FraPPE: Fast and Efficient Preference-Based Pure Exploration
A computationally efficient algorithm for identifying the exact Pareto optimal set with fixed confidence and any preference cone in a vector-valued Bandit. FraPPE is provably asymptotically optimal and numerically achieves the least sample complexity
摘要
评审与讨论
This paper studies the Preference-based Pure Exploration (PrePEx) problem, focusing on identifying the set of Pareto optimal arms in a vector-valued bandit. In this problem, the reward vectors are ordered using a given preference cone . The authors propose an algorithm based on a computationally tractable reduction of the minimization problem and the Frank-Wolfe optimizer. They prove both the algorithm's sample and computational complexity to demonstrate its effectiveness. Additionally, the authors validate the algorithm through experiments. Results demonstrate that their algorithm has advantages in both accuracy and efficiency.
优缺点分析
Strengths:
- The authors design a computationally efficient and statistically optimal PrePEx algorithm that addresses the preference-based pure exploration problem with exponential family distributions.
- The authors provide a thorough analysis of their algorithm. Their algorithm's sample complexity and computational complexity are proven.
- The authors utilize real-world datasets to validate the algorithm's effectiveness experimentally.
Weakness:
-
The paper's definitions should adhere to standard, classical formulations, particularly for fundamental concepts such as partial order and the L-parameter exponential family. Regarding the latter, my assessment suggests that the authors' definition of the L-parameter exponential family deviates from its widely accepted form. This discrepancy could impact the validity of the paper's conclusions.
-
Lemma 2 stands out as the most crucial conclusion in the paper. It purports to demonstrate that the algorithm achieves asymptotic optimality as and approach zero. However, while 's definition is briefly touched upon in the appendix, the definition of is absent. This omission prevents me from evaluating whether these two values can indeed be arbitrarily close to zero, which makes it impossible to assess the meaningfulness of Lemma 2's conclusion fully.
-
The authors directly reference results from related papers for the lower bound of the problem.
问题
-
My assessment suggests that the authors' definition of the L-parameter exponential family deviates from its widely accepted form. This might mean the problem the authors are solving is actually limited to one-parameter exponential families, rather than truly extending to multi-parameter ones as the framing suggests.
-
Could the authors please clarify the precise definition of in Lemma 2? Furthermore, it's not immediately clear whether both and can be arbitrarily close to zero. Gaining clarity on these points would be crucial for me to fully assess if the algorithm achieves asymptotic optimality as claimed.
If the authors can address my concerns regarding Lemma 2, I would consider improving my evaluation of the paper. Separately, because I'm not deeply familiar with the Pareto Set Identification problem, I haven't been able to thoroughly check all the proofs, despite spending considerable time on them. Therefore, I'll be looking to other reviewers' insights to help inform any adjustments to my overall assessment.
局限性
Yes.
最终评判理由
This paper presents a solid, technically sound approach to Preference-based Pure Exploration. The method proposed is both novel and valuable, offering a promising solution to the problem being addressed.
However, the paper uses an existing lower bound from previous work, without providing further analysis of the lower bound under the new conditions. This leaves room for improvement in the theoretical aspects.
格式问题
I did not notice any major formatting issues in this paper. The paper appears to follow the NeurIPS 2025 Paper Formatting Instructions correctly.
We thank the reviewer for taking their time to review our work. We respond to their concerns below.
-
Standard definition of partial order: We use the classical definition of order induced by a preference cone in vector optimisation literature [6,7]. This has been used across the board in the Pareto set identification with bandit feedback literature [2,3,4,5].
-
-parameter exponential family: We think that there is a miscommunication here. Now, we include the following detailed definition of -parameter exponential family in the paper.
- Definition: Specifically, let us consider an exponential family defined as with and . Then is the natural parametrization for some . is the sufficient statistic for the natural parameter. is the base density such that . Finally, is the log-normaliser or log-partition function. It is fully defined by , and . Specifically, the valid natural parameters are the ones for which remains finite (Chapter 18 in [8]).
- Lemma 2: Thank you for pointing this out for elaboration.
-
a. is the closeness parameter necessary to define the ''good events''. These events are crucial to ensure convergence of the sequential estimation problem. In our case, the two events are (i) Convergence of the sub-differential set, and (ii) Convergence of the objective function with respect to mean estimates for any allocation policy (Equation (18) and (19)) with respect to mean estimates.
-
b. Although, as correctly pointed out by the reviewer, there is no formal definition for , which arises as for estimating the minimum time required to ensure convergence of allocation policy to and consequently to . Mathematically, holds for (Theorem 5). Now, if , then we get . This is a standard algebraic trick used in the existing literature (for e.g., [1]).
Both these constants can be arbitrarily small, and hence, the asymptotic optimality of FraPPE follows. We will incorporate the definition of in the updated manuscript.
We hope that our explanations clarify all the doubts regarding correctness of our results and will motivate the reviewer to adjust their score appropriately. We remain available for further clarifications and discussions.
Reference:
[1] Wang, Po-An, Ruo-Chun Tzeng, and Alexandre Proutiere. "Fast pure exploration via frank-wolfe." Advances in Neural Information Processing Systems 34 (2021): 5810-5821.
[2] Ararat, C. and Tekin, C. (2023). Vector optimization with stochastic bandit feedback. In PMLR.
[3] Karagözlü, E. M., Yıldırım, Y. C., Ararat, C., and Tekin, C. (2024). Learning the pareto set under incomplete preferences: Pure exploration in vector bandits. PMLR.
[4] Shukla, Apurv, and Debabrota Basu. (2024). Preference-based pure exploration. NeurIPS.
[5] Crepon, E., Garivier, A., and M Koolen, W. (2024). Sequential learning of the Pareto front for multi-objective bandits. PMLR.
[6] Jahn, Johannes, ed. Vector optimization. Berlin: Springer, 2009.
[7] L{"o}hne, Andreas, Vector optimization with infimum and supremum, Springer Science & Business Media, 2011.
[8] DasGupta, Anirban. "The exponential family and statistical applications." Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. New York, NY: Springer New York, 2011. 583-612.
Thank you for your response to our questions. The clarification regarding has helped us more precisely assess the paper’s contribution. However, several concerns remain. As noted in our original review, our main focus is on the authors’ treatment of , but we also encourage greater care in the use of key terminology to ensure that the claims are stated accurately.
We list our comments point by point below:
1. Definition of partial order:
There may be a typographical issue in the current definition. It seems the intended condition is rather than . While this does not affect the main results, we ask the authors to verify this point for correctness. Regarding this point, the authors do not need to respond further.
2. -parameter exponential family:
The explanation provided in the response is clearer than that in the paper, but we suggest adopting a more standard definition. In particular, should refer to the dimension of the parameter vector , not the number of observed variables .
More importantly, we remain concerned about the paper’s claim of addressing the general -parameter exponential family. Based on our understanding, the authors are working with independent one-parameter exponential families, where each depends only on , i.e., . This is distinct from the general -parameter exponential family setting, where each may depend on the entire parameter vector , i.e., .
The one-parameter setting is a strict subset of the general case. If the paper addresses only this more restricted model, it should not claim results for the general exponential family setting. These two cases differ significantly in generality and complexity. The authors should clearly state which setting their results apply to, so that the conclusions correctly reflect the scope of the work.
3. Clarification on :
The authors’ discussion of has clarified its role in the main result. We believe this analysis is important and should be included in the paper to improve clarity. However, the current explanation still leaves a potential gap.
Specifically, the authors assume that the condition holds. However, by definition we have , and the paper’s core contribution is an upper bound on . If is very small, the condition may not be satisfied. We believe that a lower bound on in terms of is needed to resolve this inconsistency. Moreover, since cannot be made arbitrarily small, the claim of asymptotic optimality is potentially problematic.
Summary:
We appreciate the clarifications and acknowledge the additional insight they provide. However, we still observe a gap between the claimed conclusions and the theoretical results currently justified in the paper. If the authors can either rigorously support the claims as they stated, or revise them to accurately match the technical findings, we would be glad to consider increasing the score.
Thank you for your detailed responses to my questions. Your answers have sufficiently addressed my concerns. I recommend incorporating this discussion into the main text or the appendix of the paper to enhance its completeness and clarity.
Additionally, please use the formal definition of the L-parameter exponential family in the next version of the paper, and clearly specify the relationship and distinction between the dimensions of and .
Based on these improvements, I will be updating my score to borderline accept.
Dear Reviewer,
As the rebuttal period is terminating, we thank you for your thorough and thoughtful evaluation of our submission. Your constructive feedback has provided valuable guidance in improving the clarity and quality of our work.
Sincerely, The Authors
Clarification on We think there is a misunderstanding here. We address the question in two layers.
In lower bound-based pure exploration literature, we try to solve the lower bound with empirical estimates of means of the underlying arms at each time step. Thus, often the proof consists of two concentration phenomena: concentration of empirical mean to true mean and then concentration of the lower bound to the true one due to using estimates and alt-set optimisation. Finally, they together lead to the convergence in allocation and thus, correctness. This always introduces an to control the convergence, and often we introduce other parameters to explicitly write the time to concentrate to the aforementioned desired events. For example, and in our case; and in seminal work of [1]; and in [3] that developed the Frank-Wolfe method; and the following literature [2,4,5].
One universal observation is that the and terms that appear in these upper bounds are independent of . Also, note that given any fixed , we choose and for proving the upper bound. Thus, first we divide the upper bound by and take . This vanishes all the and dependent terms. Now, we can take the limits of the upper bound w.r.t. and (page 29-30, [1]; page 29, [2]; Appendix I.2., [3]; page 27-28, [4] etc.).
If this limit matches with the lower bound, we call the algorithm "asymptotically optimal". We follow this nomenclature while clarifying this nuance in of setting and arbitrarily small in Lemma 2.
- An elaborate proof and how we get to the "asymptotic optimality".
We provide below a detailed discussion of the proof and why taking is accepted in Lemma 2. We omitted this as this algebraic trick has been proposed in [3], which is a very standard work in the domain and proposes the use of Frank-Wolfe optimiser for asymptotically optimal BAI. Later on this has been re-iterated through the literature. We will add now the following elaboration for completeness.
Step 1: Let us remind of the good events:
Event : The approximation error of solution in the Frank-Wolfe updates (Equation 7) is bounded by .
Event : The error in objective function due to using the empirical means are bounded by .
Hence, the Frank-Wolfe optimiser can stop when it is close to the true maximum.
Following the concentration results, we can show that holds true, and thus, by Theorem 5, after time .
Here, , , and . and are defined in the paper.
Step 2: Now, given holds true, we have where is the stopping time of FraPPE. Thus,
Now, for ,
Therefore, we have,
Please check the next comment for continuation.
Step 3: Finally, we define a constant
Now, we are ready to introduce the small constant , such that , when This choice of is reflected in non-asymptotic sample complexity upper bound (Lemma 6 in Appendix E.3).
Now, following the algebra in [3], we get
Step 4: This finally leads to Equation (20) in our draft. We observe that the term due to is independent of . Thus, if we divide both sides Equation (20) by and take the limit , the terms due to vanishes. This yields our final bound in Lemma 2.
Now, as we can set and arbitrarily small and independent of , this bound is known to be asymptotically optimal as in [1,2,3,4,5]. To better explicate the nuances, we also provide a non-asymptotic bound in Lemma 6.
-parameter exponential family: Thank you for your further comments on the definition. We would like to elaborate on this. In our setting, we have the knowledge of (covariance matrix), only the -dimensional mean vectors are unknown. Thus, only mean parametrisation remains enough, which is -dimensional. To emphasise this, we refer to it as -parameter exponential family.
- Example: Multivariate Gaussian with mean and covariance . The density function is
We can rewrite it as
Thus, the base density is . The natural parameter is . The sufficient statistic is . The log-normaliser or log-partition function is . Here the natural parameter is of order . But, as a consequence of knowing beforehand, only mean parameterisation of order is identifiable.
We have carefully checked the whole manuscript to observe we have never used independence among the indices of -dimensional random variable in any of the proofs. Thus, we advocate the fact that our theoretical results hold for general -parameter exponential family.
Let us know if the above explanations made our claims in the paper more clear. We are eager to discuss if there is further concern.
- References
[1] Garivier, Aurélien, and Emilie Kaufmann. "Optimal best arm identification with fixed confidence." Conference on Learning Theory. PMLR, 2016.
[2] Jedra, Yassir, and Alexandre Proutiere. "Optimal best-arm identification in linear bandits." Advances in Neural Information Processing Systems 33 (2020): 10007-10017.
[3] Wang, Po-An, Ruo-Chun Tzeng, and Alexandre Proutiere. "Fast pure exploration via frank-wolfe." Advances in Neural Information Processing Systems 34 (2021): 5810-5821.
[4] Mukherjee, Arpan, and Ali Tajer. "Best Arm Identification in Stochastic Bandits: Beyond optimality." arXiv preprint arXiv:2301.03785 (2023).
[5] Degenne, Rémy, Wouter M. Koolen, and Pierre Ménard. "Non-asymptotic pure exploration by solving games." Advances in Neural Information Processing Systems 32 (2019).
This paper investigates the design of computationally efficient algorithms for the preference-based pure exploration problem. The authors reduce the corresponding lower-bound optimization problem into a computationally tractable form and solve it using the Frank-Wolfe algorithm. The authors further provide theoretical guarantees for proposed algorithms on computational and sample complexity. Both theoretical results and empirical experiments demonstrate the superiority of their proposed algorithm.
优缺点分析
Strengths:
- This work explores a computationally efficient algorithm for solving the preference-based pure exploration problem. It presents a rigorous theoretical analysis on both the computational and sample complexity of the proposed method.
Weaknesses:
-
In the problem formulation part of Section 2, the definition of the reward distribution seems somewhat unconventional, as the authors only define its mean . In contrast, (Shukla and Basu, 2024) further define the corresponding covariance matrix to account for noisy feedback.
-
In Figure 1, the authors intend to illustrate how the Pareto frontier varies under different preference cones. However, in Figure 1, the two Pareto frontiers overlap, which easily confuses readers. The authors could consider plotting the Pareto frontiers for both and cones to better illustrate the difference.
-
Some concepts are used without formal definitions, such as the term "stationary policies" in line 209.
-
In line 298, the authors claim that the Frank-Wolfe algorithm can handle non-smooth optimization objectives by constructing the corresponding -sub-differential set. Can this approximation method guarantee recovery of the original optimal solution? In addition, please justify the choice of the hyper-parameter .
-
A minor writing suggestion: in line 205, the concept of pure policies has not been formally introduced. To avoid potential confusion, it is advisable to use the notation \Delta_K \setminus {\pi^} rather than \Delta_K \setminus {\pi_i^}.
-
It is recommended that the authors provide a comprehensive table comparing different preference-based pure exploration methods in terms of algorithmic complexity (including time and space complexity) as well as the upper bounds on sample complexity. This would help readers better understand the strengths and weaknesses of the proposed method in a broader context.
-
The authors could elaborate more on the technical challenges in accelerating algorithms solving the preference-based pure exploration problem, and clarify the theoretical contributions of their proposed approach.
问题
Please see the strengths and weaknesses section.
局限性
N/A
最终评判理由
My concerns have been addressed during the discussion. I keep my positive score.
格式问题
N/A
We thank the reviewer for reviewing and acknowledging our work. We respond to the concern raised below:
Covariance matrix: In the problem formulation, we did not highlight the covariance matrix of reward distribution as it is known in our setup and the main challenge in bandits is the means of the reward distributions being unknown. It is also reflected in the derivation of the lower bound in [1]. Note that, we also consider the reward distributions can have higher moments. In Assumption 1, we adhered to the definition of a multi-parameter exponential family (Chapter 18 in [2]), where is the log-partition function and is the natural parametrisation. If we differentiate w.r.t. , we get the mean , and the second-order derivative yields the covariance matrix. We will explicitly mention the covariance in Section 2 to avoid any confusion around the definition. %\todo[inline]{Section 2 he said, change this writing.}
Figure 1: Thank you for pointing this out. Although the intend was to show that different preference cone results in different Pareto optimal arms, but they can have intersection. We will change it such that there are no intersections so that it is more clearly understandable.
Stationary policy: By stationary policy, we wanted to convey that the pure policies (stands for playing an arm) does not change over time. We will definitely include a formal definition.
Line 205: We will formally define pure policies (policies that have all the probability mass on a single arm). Thank you for pointing this out.
Choice of : As per the existing literature on Frank-Wolfe [3], the approximation parameter should shrink at a rate . It ensures the convergence of empirical allocation to optimal allocation (Theorem 5). In practice, we have chosen .
Convergence of sub-differential set: First, we prove continuity of the mapping from allocation policy to sub-differential set in Corollary 1 (Appendix D.2). Then, we show the convergence of Frank-Wolfe iterates (Equation (7)) while we iteratively optimise over the allocation policy (Appendix D.3).
Comparison table: This a very nice suggestion. With the additional available page, we will definitely include a table that summarises the complexity of different approaches to solve Pareto set identification and their corresponding complexity.
Contribution and technical challenges: In this paper, we successfully address the open problem posed in [4] to propose the first computationally efficient and asymptotically optimal algorithm, that works for arbitrary preference cones and exponential family distributions, and also achieves the lowest sample complexity and probability of error for identifying the exact set of Pareto optimal arms. We mention all our contributions in line 88 to 107 in the main manuscript. The key technical challenges are discussed in the beginning of Section 3.1 and Section 3.2, i.e., solving optimisation over alternating instance and Pareto optimal policy set. Complying with the reviewer's remark, we will again summarise the technical challenges tackled in this work.
We hope that we have addressed all the concerns. We are eager to discuss further if the reviewer has more concerns or comments.
Reference:
[1] Shukla, Apurv, and Debabrota Basu. (2024). Preference-based pure exploration. NeurIPS.
[2] DasGupta, Anirban. "The exponential family and statistical applications." Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. New York, NY: Springer New York, 2011. 583-612.
[3] Wang, Po-An, Ruo-Chun Tzeng, and Alexandre Proutiere. "Fast pure exploration via frank-wolfe." Advances in Neural Information Processing Systems 34 (2021): 5810-5821.
[4] Crepon, E., Garivier, A., and M Koolen, W. (2024). Sequential learning of the Pareto front for multi-objective bandits. PMLR.
Thanks for the detailed response. I have no further questions.
Dear Reviewer,
We are grateful for the time and attention you have devoted to evaluating our submission. Your detailed and constructive feedback is much appreciated.
Warm regards, The Authors
This paper tackles Preference-based Pure Exploration (PrePEx) in multi-objective bandits, aiming to identify the Pareto-optimal arms efficiently and confidently. The authors propose FraPPE, a computationally efficient and asymptotically optimal algorithm that leverages structural reductions, a Frank-Wolfe optimizer, and a novel stopping rule. Empirical results demonstrate gains in sample efficiency and stability over prior methods.
优缺点分析
Strengths:
• The paper tackles an interesting and well-motivated problem; namely preference-based pure-exploration.
• The works proposes a practical approach which reduce the computational complexity of solving the tackled problem.
• The paper results are supported by theoretical guarantees and empirical analysis on both synthetic and real-world problems.
Weaknesses:
• Additional tests on larger-scale or more diverse real-world settings would strengthen the empirical claims.
• For clearer presentation and better contrast with prior work, it would be helpful to summarize the main differences and contributions in a comparison table.
• (minor) While multi-armed bandits provide a foundational framework for studying sequential decision-making under uncertainty, describing them as the "archetypal setup" of reinforcement learning is somewhat misleading (line 36). Bandits represent a simplified special case without state transitions or delayed consequences, which are central features of reinforcement learning.
问题
Can the same approach (FraPPE) be extended or scaled to combinatorial pure exploration problems, such as combinatorial bandits?
How sensitive is FraPPE to mild deviations from its theoretical assumptions in practical scenarios?
Could the authors report or discuss FraPPE's actual runtime in wall-clock seconds (rather than just iteration or sample complexity), especially on the real-world dataset? This would provide practical insight into computational efficiency beyond theoretical complexity.
局限性
Yes.
最终评判理由
My concerns have been addressed, and I have decided to raise my score to 5.
格式问题
No.
We thank the reviewer for their valuable review and remarks. We respond to the concerns raised below:
- Experimental validation and computational bottleneck:
-
a. Runtime of FraPPE: While only using CPUs (without parallel processing and tensorization), each iteration of FraPPE takes milliseconds/iteration for COVBOOST ( arms and objectives). While for DB [1] ( arms and objectives), FraPPE clocked in milliseconds/iteration. For synthetic data with correlated objectives ( arms and objectives), it takes milliseconds/iteration. These results are obtained by averaging over 1000 iterations.
-
b. Runtime of baselines: The implementations available in the literature differ in their setup and programming languages: [2] uses Julia, [3] uses C++. In contrast, to make the code accessible to wider audience, we chose Python3 for implementation with only three well-known packages: Numpy, CVXPY, and Scipy. Thus, we surmise it would be unfair to compare the clock runtimes of the different baselines implementations. But resonating with the reviewer's remark, we have now conducted a runtime analysis with varying , to validate our theoretical claims.
-
c. Conic programming: Yes, we agree that conic programming is the computationally costliest module here. Though we believe while solving the lower bound, conic programs are unavoidable as it is required to find the Pareto for generic cones. Thus, it has been a part of the open problem posed by [2], i.e. to reduce the computational complexity per iteration up to the complexity of Pareto set computation.
-
d. Larger dataset- DB [1]: We have tested on all the datasets used in the SOTA paper for exact Pareto set identification: PSIPS [3]. We are currently testing FraPPE on the DB (Disc Brake) dataset ( arms and objectives) that presents an efficiency and safety optimization problem in disc brake manufacturing for automotives. In the literature, larger datasets like DB are used only to attain -approximate Pareto set identification [5,6], while FraPPE aims for exact identification and succeeds as per the preliminary results.
Comparison table: This is indeed a very nice suggestion. With the extra available space, we will definitely include a table that summarises the complexity of different approaches to solve Pareto set identification and their corresponding complexity.
FraPPE in Combinatorial Bandits: We think like most of the algorithmic framework that exists for solving pure exploration in multi-objective bandits, for e.g., successive arm-elimination [1] or Bayesian posterior sampling based [2] etc., FraPPE can also be extended to solve pure exploration in combinatorial bandits. That means the goal will be to identify the optimal combinatorial or subset of arms among possible subsets (Or a policy with basis consisting pure policies corresponding optimal subset of arms) that maximises the reward . Additionally, the agent will observe feedback for each of the arms in the subset chosen to play at any time step. Although we think this problem is more complex due to exploration among exponentially large options rather than usual actions. This bottleneck calls for careful adaptation of the existing lower bound to the combinatorial setup (that might depend on complexity of differentiating between the optimal subset and the second best subset of arms). Still, this problem seems solvable by adapting FraPPE directly with necessary extensions.
Mild deviations in theoretical assumptions: The only two assumptions that we require are: the knowledge of the cone and the reward distribution being in an exponential family with non-zero second moment. The first one is a fundamental requirement as the lower bound of [6], which is our starting point, requires this. If we have a misspecified cone, the algorithm would still run and identify a part of the Pareto optimal arms but we cannot theoretically say anything about correctness and optimality. The second one is more of a theoretical requirement to prove optimality of FraPPE. The lower bound and the algorithmic technique would work for any distribution. But since the GLRT stopping rule is known to be tight for exponential families, not only FraPPE but any lower bound-driven pure exploration algorithm [7,8,9] even in simple best arm identification assumes an exponential family to prove optimality [1,2,6]. In brief, we can run FraPPE for instances deviating from these assumptions but we cannot guarantee the theoretical goodness any more.
We hope that we have addressed all the concerns. We are eager to discuss further if the reviewer has more concerns or comments.
References:
[1] Crepon, E., Garivier, A., and M Koolen, W. (2024). Sequential learning of the Pareto front for multi-objective bandits. AISTATS.
[2] Kone, C., Jourdan, M., and Kaufmann, E. (2025). Pareto set identification with posterior sampling. AISTATS.
[3] Tanabe, R. and Ishibuchi, H. (2020). An easy-to-use real-world multi-objective optimization problem suite. Applied Soft Computing, 89:106078.
[4] Ararat, C. and Tekin, C. (2023). Vector optimization with stochastic bandit feedback. In PMLR.
[5] Karagözlü, E. M., Yıldırım, Y. C., Ararat, C., and Tekin, C. (2024). Learning the pareto set under incomplete preferences: Pure exploration in vector bandits. PMLR.
[6] Shukla, Apurv, and Debabrota Basu. (2024). Preference-based pure exploration. NeurIPS.
[7] Garivier, A. and Kaufmann, E. (2016). Optimal best arm identification with fixed confidence. COLT.
[8] Wang, P.-A., Tzeng, R.-C., and Proutiere, A. (2021). Fast pure exploration via Frank-Wolfe. NeurIPS.
[9] Degenne, R., Ménard, P., Shang, X., and Valko, M. (2020). Gamification of pure exploration for linear bandits. ICML.
Dear Reviewer,
Thank you for taking the time to read our work so closely. We truly appreciate your constructive suggestions and the perspective you brought to the review.
With appreciation, The Authors
Thank you for your detailed response. I have also reviewed other reviewers’ comments and your replies. Please incorporate the additional experimental results and discussions into the revised version of the paper, as they will strengthen its contribution. My concerns have been addressed, and I have decided to raise my score to 5.
This paper considers pure exploration in a K-armed stochastic bandit problem with L-dimensional rewards. The goal is to identify the Pareto set of policies (distributions over arms) given a fixed confidence level in as few evaluations as possible. The paper builds an approach based on the lower bound on Pareto front identification. Using structural properties, it forms a computationally tractable alternative to the minimax optimization problem in the lower bound. A track-and-stop inspired arm selection and stopping rule is proposed based on the tractable lower bound. Asymptotic optimality of the proposed rule is established. Experiments are carried out to show how the proposed rule compares with SoTA.
优缺点分析
Strengths:
The paper is clearly written. The authors did a good job of explaining this complicated pure exploration problem in nine pages.
This paper solves an important open problem in bandit-based vector optimization by proposing a computationally efficient, asymptotically optimal, fixed-confidence pure exploration algorithm. The proposed algorithm utilizes a tractable reformulation of the intractable optimization problem. Theoretical analysis of the computational complexity demonstrates a significant improvement over existing approaches. The paper proposes a non-asymptotic, asymptotically optimal bound on the sample complexity. Experiments support theoretical claims.
Weaknesses:
There is no major weakness.
问题
-
It is hard to parse the contribution of Section 3.3. What is the exact reduction in computational complexity when the polar cone representation of alternating instances is used?
-
Please elaborate on how the -parameter exponential family extends the scalar exponential family. What conditions are required on functions , , and ?
局限性
Yes.
最终评判理由
The authors have sufficiently addressed my comments regarding clarifications. This is solid work. Please incorporate these clarifications in the final version of the paper.
格式问题
N/A.
We thank the reviewer for acknowledging our work and taking their time to review. We respond to the questions raised below:
- Section 3.3 : In this section, we leverage the structure of the alternating instance to gain computational efficiency for the two innermost minimisation problems in Equation (3).
-
(a) Direct optimisation over can have complexity , since optimising an -dimensional vector inside a cone has complexity . With the polar cone representation of the alternating instances (Proposition 1), can be expressed with . Here, is an -dimensional vector on the boundary of the polar cone of the given preference cone . Thus, for a fixed and , we now optimise over an -dimensional vector rather than a matrix.
-
(b) Further with this reduction, we observe that and are defined on the cone and its polar cone . Thus, any efficient second order conic program solver (for e.g., CVXPY) can solve both the optimisation problems in rather than .
Hence, the total computational gain achieved in Section 3.3 is .
- L-parameter exponential family: Since we adopted the standard definition of multi-parameter exponential family from statistics ((Chapter 18 in [1])), which is also a generalisation of multivariate Gaussian of dimension with mean and covariance , we omitted a detailed discussion. Thanks for the remark. We clarify it here and must add the details with the extra page available.
- Definition: Specifically, let us consider an exponential family defined as with and . Then is the natural parametrization for some . is the sufficient statistic for the natural parameter. is the base density such that . Finally, is the log-normaliser or log-partition function. It is fully defined by , and . Specifically, the valid natural parameters are the ones for which remains finite.
We hope that our response have addressed the concerns and questions. We remain available for further discussion.
Reference: [1] DasGupta, Anirban. "The exponential family and statistical applications." Probability for Statistics and Machine Learning: Fundamentals and Advanced Topics. New York, NY: Springer New York, 2011. 583-612.
Thanks for the clarifications. My questions are addressed.
Dear reviewer,
As the rebuttal period is coming to the end, we thank you for taking time to thoroughly review our work. We appreciate your constructive feedback.
Regards, Authors
This paper addresses an open problem in preference-based pure exploration (PrePEx) in multi-objective bandits, where the goal is to identify all Pareto optimal arms under a given preference cone with fixed confidence. The authors propose FraPPE, a computationally efficient algorithm that provides an asymptotically optimal solution for arbitrary preference cones and exponential family distributions. The key contributions include:
- Structural reductions of the optimization problem, avoiding convex hull computations,
- Application of Frank-Wolfe optimization for efficient lower bound tracking, and
- Theoretical guarantees of asymptotic optimality with improved computational complexity.
- Simulation results to support the proposed results.
优缺点分析
Strengths:
Novel Theoretical and Algorithmic Contributions: The paper makes significant progress on an open problem posed by Crepon et al. (2024) by providing a computationally efficient and asymptotically optimal algorithm for arbitrary preference cones. The decomposition of the Alt-set without requiring convex hull computations is a significant theoretical advance. The reduction from O(K²L) to O(KL min{K,L}) computational complexity is non-trivial and practically important.
Problem Relevance and Motivation: The multi-objective bandit setting with preference cones is well-motivated through concrete applications in clinical trials (COV-BOOST), material design, and RLHF. The inability to scalarize objectives a priori is a genuine challenge in these domains, making the preference cone framework particularly relevant.
Strong Literature Positioning: The paper does a great job of placing itself amid related work. The authors clearly articulate the limitations of existing approaches: successive elimination methods that only find approximate Pareto sets, the computational intractability of PreTS, and the restriction of previous optimal algorithms to other distributions. The connection to pure exploration with multiple answers and the Frank-Wolfe literature is also well stated.
Clear Presentation: The paper is well-structured with a logical flow from problem formulation through theoretical contributions to algorithm design. The use of illustrative examples aids a deeper understanding of the work. Technical concepts are introduced progressively with appropriate mathematical insights.
Weaknesses:
Limited Experimental Validation: While the COV-BOOST results are compelling, the experimental section would benefit from more ablation studies: (i) runtime breakdown to validate computational efficiency claims, (ii) scaling effects with varying K and L, and (iv) ablation studies on other parameters.
Assumptions and Practical Considerations: The assumption of perfect knowledge of preference cones may be restrictive in practice. Maybe a discussion of robustness to misspecified cones would be a nice add.
问题
Questions
Preference Cone Specification in Practice: What are the real-world models with the preference cone C, and what are their typical values? Also, how sensitive is the algorithm's performance to misspecification of the preference cone C? In real applications, domain experts may have uncertainty about their preferences. Have you considered extensions where the cone is learned or refined during exploration?
Computational Bottlenecks: Despite the theoretical improvements, conic programs per iteration could still be computationally demanding. Does an experimental time breakdown on which steps of the algorithm provide the time bottleneck make sense?
局限性
It would be great if the authors added a dedicated limitations section discussing the points touched upon in the Weaknesses and Questions section.
最终评判理由
I am happy with the author's response. I would like ot retain the score. I really like the paper as it stands.
格式问题
The paper is structured quite well. No major formatting issues were observed. The paper follows standard mathematical formatting conventions with proper theorem environments, notation, and references.
We thank the reviewer for taking time to review our work, and for appreciating the novelty and clarity of our contributions. We respond to the questions below, while we include the promising improvements in our updated draft.
- Experimental validation and computational bottleneck:
-
a. Runtime of FraPPE: While only using CPUs (without parallel processing and tensorization), each iteration of FraPPE takes milliseconds/iteration for COVBOOST ( arms and objectives). While for DB [1] ( arms and objectives), FraPPE clocked in milliseconds/iteration. For synthetic data with correlated objectives ( arms and objectives), it takes milliseconds/iteration. These results are obtained by averaging over 1000 iterations.
-
b. Runtime of baselines: The implementations available in the literature differ in their setup and programming languages: [2] uses Julia, [3] uses C++. In contrast, to make the code accessible to wider audience, we chose Python3 for implementation with only three well-known packages: Numpy, CVXPY, and Scipy. Thus, we surmise it would be unfair to compare the clock runtimes of the different baselines implementations. But resonating with the reviewer's remark, we have now conducted a runtime analysis with varying , to validate our theoretical claims.
-
c. Conic programming: Yes, we agree that conic programming is the computationally costliest module here. Though we believe while solving the lower bound, conic programs are unavoidable as it is required to find the Pareto for generic cones. Thus, it has been a part of the open problem posed by [2], i.e. to reduce the computational complexity per iteration up to the complexity of Pareto set computation.
-
d. Larger dataset- DB [1]: We have tested on all the datasets used in the SOTA paper for exact Pareto set identification: PSIPS [3]. We are currently testing FraPPE on the DB (Disc Brake) dataset ( arms and objectives) that presents an efficiency and safety optimization problem in disc brake manufacturing for automotives. In the literature, larger datasets like DB are used only to attain -approximate Pareto set identification [5,6], while FraPPE aims for exact identification and succeeds as per the preliminary results.
- Specification of the cone:
-
a. Practicality of knowing : We assume to know the preference cone following the present literature of PrePEx, where the SOTA results are mostly available for the positive orthant as the cone. Knowledge of is available in multiple real-world systems: polyhedral cones are used in aircraft design optimisation [7], positive orthant cone () are used in Portfolio optimization balancing return, risk, and liquidity [8], customised asymmetric cones are used in climate policy optimization with multiple national goals [9].
-
b. Learning : We agree that sequentially learning the cone from multiple expert's feedback while exploring is an interesting problem with its own set of challenges. But if we do not know or encounter misspecification, the lower bound of [10] is not valid and we should derive a new lower bound. It is an ongoing study and we will mention it in future works.
We hope that our response have addressed the concerns and questions. We remain available for further discussion.
References:
[1] Tanabe, R. and Ishibuchi, H. (2020). An easy-to- use real-world multi-objective optimization problem suite. Applied Soft Computing, 89:106078.
[2] Crepon, E., Garivier, A., and M Koolen, W. (2024). Sequential learning of the Pareto front for multi-objective bandits. PMLR.
[3] Kone, C., Jourdan, M., and Kaufmann, E. (2024). Pareto set identification with posterior sampling.
[4] Ararat, C. and Tekin, C. (2023). Vector optimization with stochastic bandit feedback. In PMLR.
[5] Karagözlü, E. M., Yıldırım, Y. C., Ararat, C., and Tekin, C. (2024). Learning the pareto set under incomplete preferences: Pure exploration in vector bandits. In PMLR.
[6] Bradley, R. A. and Terry, M. E. (1952). Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika.
[7] Mavrotas, G. (2009). Effective implementation of the ε-constraint method in Multi-Objective Mathematical Programming problems. Applied Mathematics and Computation.
[8] Ehrgott, M. (2005). Multicriteria Optimization. Springer.
[9] Keeney, R. L., & Raiffa, H. (1993). Decisions with Multiple Objectives: Preferences and Value Trade-Offs. Cambridge University Press.
[10] Shukla, Apurv, and Debabrota Basu. "Preference-based pure exploration." Advances in Neural Information Processing Systems 37 (2024): 17313-17347.
Thank you for addressing my concerns. I am satisfied with the responses to my queries
Dear reviewer,
Thanks for the time spent reviewing our work. We appreciate your constructive feedback.
Regards, Authors
Summary: This paper focused on tackling pure exploration with preference feedback, which identify the best or an -optimal item using duels rather than numeric rewards, under standard stochastic preference models. To address this challenging problem, this paper proposed FraPPE, which (i) adaptively focuses comparisons on an active set of contenders, (ii) chooses informative pairs via confidence-interval geometry to accelerate elimination, and (iii) maintains computational efficiency with incremental updates and pair selection. Theoretical performance analysis of FraPPE was presented. The instance-dependent sample complexity scales with natural hardness measures of the preference matrix. Finally, experimental results were presented to validate the performance of FraPPE.
Strength:
- The problem itself is interesting and important as pure exploration with pairwise feedback is central to ranking, recommendation, and RLHF-style evaluation.
- The design of active set and the informative pair selection are elegant, simple to be implemented in practice.
- The performance of FraPPE is theoretically guaranteed with formal -correctness and the instance-dependent complexity bounds. The analysis highlights the role of preference gaps.
- The performance of FraPPE is further validated via experiments including comparisons to baselines in several key metrics.
Weakness:
- The performance guarantees rely on standard stochastic preference assumptions. This paper could further discuss robustness when these are violated.
- While instance-dependent bounds are valuable, this paper could clarify the gap to known minimax lower bounds (tightness) and when FraPPE might be suboptimal in adversarially hard instances.
Reasons for acceptance: This paper demonstrated clear contributions from both theory and practice. The proposed FraPPE advances the state of the art for preference-based pure exploration with a design that is simultaneously fast, sample-efficient, and analyzed.
Discussion & rebuttal assessment: Reviewers reached a consensus that this is a solid work with interesting problems, nice algorithm design, theoretical sound performance guarantees and experimental validations.