PaperHub
6.0
/10
Poster4 位审稿人
最低4最高7标准差1.2
4
7
7
6
3.3
置信度
正确性3.0
贡献度2.8
表达2.0
NeurIPS 2024

Preference-based Pure Exploration

OpenReviewPDF
提交: 2024-05-15更新: 2025-01-16
TL;DR

A new algorithm for vectorial optimization under bandit feedback

摘要

We study the preference-based pure exploration problem for bandits with vector-valued rewards and a set of preferences imposed over them. Specifically, we aim to identify the most preferred policy over a set of arms according to the preferences induced on the reward vectors by an ordering cone $C$. First, to quantify the impact of preferences, we derive a novel lower bound on the sample complexity for identifying the most preferred arm with confidence level $1-\delta$. Our lower bound shows that how the geometry of the preferences and reward vectors changes the hardness of this problem. We further explicate this geometry for Gaussian distributions of rewards, and provide a convex reformulation of the lower bound solvable with linear programming. Then, we leverage this convex reformulation of the lower bound to design the Track and Stop with Preferences (TSwP) algorithm that identifies the most preferred policy. Finally, we derive a new concentration result for vector-valued rewards, and show that TSwP achieves a matching sample complexity upper bound.
关键词
Pure explorationmulti-armed banditsvector-valued rewardspreferences

评审与讨论

审稿意见
4

This work focuses on the setting of multi-armed bandits with vectorized rewards and studies the identification of the Pareto Optimal arms with a fixed confidence. Compared with existing works, this work considers a more general preference definition (i.e., induced by a preference cone) or targets at finding the entire set of Pareto optimal arms. A lower bound is derived, providing both statistical and geometrical implications of the hardness of the problem. Based on the lower bound, a track-and-stop style algorithm is proposed with its efficiency proved by a corresponding upper bound.

优点

  • The studied problem itself is novel and well-motivating due to the commonly existing scenarios with multiple objectives. According to the summarized literature, there have been no efforts in the exact problem setup studied here.
  • The overall flow is clear, and the techniques adopted are reasonable and intuitive, extending the studies from single-objective BAI: starting with the lower bound, and based on it, leveraging track-and-stop style algorithms.
  • Despite certain confusion about the notations/definitions (see weakness), I believe the overall results should be sound.

缺点

  • First, I believe the overall paper needs careful proofreading and some revisions to ensure the readers’ correct understanding. For example, (1) Definition 3: μ\mu and μ\mu’ are not labeled clearly; (2) the maximizations in Equation 1 and subsequent sections are confusing per my understanding (please let me know if I missed the definition somewhere); (3) Eqn. 1, line 231, 233: sometimes there is transpose over MM but sometimes not; (4) Eqn. 6: MM is a vector on the right-hand side? Also, the minimization is noted over M^t\hat{M}_t while the expression seems over MM.
  • Second, while the adopted techniques are sound based on my intuition, it seems they largely follow the original track-and-stop paper. The authors have made efforts to highlight the challenges and differences (e.g., the discussions beneath the lower bound, and the distances introduced in Sec. 5 to accommodate Pareto fronts). However, the overall idea is still the classic track-and-stop one.

问题

  • I would greatly appreciate some clarification over my confusion on notations listed in Weakness.
  • Also, the algorithm design begins with a convexification of the lower bound, which I have several questions about. (1) Is this design purely to benefit computation? In other words, if letting computation aside, can we directly track the lower bound and obtain an asymptotical upper bound; (2) as the final performance approaches the convexified characteristic time, I am wondering whether it is possible to compare it with the true lower bound so we can understand how much loss is introduced by the convexification (i.e., a tradeoff between computation and performance).

局限性

I understand that this work is of a theoretical nature so there are no major societal impacts to be discussed. However, I do not find justifications for related questions in the checklist (e.g., 1. Claims, 2. Limitations, 3. Theory Assumptions and Proofs).

作者回复

We thank the reviewer for the time spent reviewing and pointing out several notational issues to improve the manuscript. We have done a proofreading of the paper and fixed the typos and issues. We address the other concerns here.

  1. Novelty with respect to Track and Stop: As pointed out by reviwer E3uE: ``While the overall structure closely follows the track-and-stop literature, there seems to be a lot of technical novelty in its extension to the pareto front setting. The characteristic time has a much more complicated geometric structure, which is reflected in the lower bound. New concentration inequalities need to be derived for the pareto frontier estimation (along with a novel distance)." We would like to build upon observation and highlight the salient features of our contribution and the relation to existing work.

  2. Purpose of convexification: Convexification serves several purposes not limited to providing computational benefits. Without optimising over the convex set, we cannot guarantee that the algorithm is asymptotically optimal. In particular Theorem 4, claim (4) will not hold. Therefore, tracking the lower bound as per Track-and-Stop strategy will not lead to asymptotic optimality. There are other key points which will not hold if we do not optimize over the convex hull. For example, we cannot directly use Donsker-Vardhan in the proof of Lemma 6 but we have to prove compactness and other properties of the original non-convex set.

  3. Convexification of Characteristics Time: In the setting with Gaussian bandits, as we show Theorem 2 in the paper, nothing is lost since we are able to get exact form of the hardest instance analytically. In the non-Gaussian setting, we need to conduct a separate study, both computational and theoretical, in order to study the gap between solutions of the original optimization problem and the convex relaxation. It will non-trivially depend on the interaction of the preference cone and the space of confusing instances. This would be a interesting avenue for future work. Further, please see lines 316-324 in the updated paper.

  4. Explaining the maximisation problem in Equation (1): For better understanding, we explain the maximisation problem of Equation (1) as "Alternatively, this vector optimization problem can be represented in the policy space as finding a policy πΔK\pi \in \Delta_K supported on the Pareto optimal set of arms. This is given by..." (line 189-191 in revised manuscript). Let us know if that clarifies your concerns.

We hope that our rebuttal addresses your concerns and questions. We are eager to discuss if you have further queries.

评论

Thank you for the responses.

  1. The authors’ response is quite vague and the additional techniques beyond the original Track and Stop remain unclear. I would encourage the authors to better highlight the novelties, especially compared with TaS.

  2. I think the purpose of convexification needs further clarification. If it is for computational purposes, it should be positioned that way and have discussions on the results, assuming an ideal computational oracle. Suppose it is for statistical purposes (i.e., even with a computational oracle, the proposed algorithm still cannot work). In that case, the illustrations should focus on this perspective and mention the computational benefits as an additional one. From the paper, the main discussions are on the computational purpose (line 298, line 322, etc). However, from the response, it seems there are also statistical purposes in terms of the proofs. This confusion should be better clarified in the paper.

  3. Given the loss from the convexification, I believe it is not rigorous/fair to call the algorithm “asymptotical optimal” (e.g., line 93, line 391) or “matching sample complexity upper bound” (line 13). These statements may only be made upon Gaussian bandits, if I understand correctly.

  4. Thank you for the clarifications.

Given these factors, I decided to keep my score.

评论

(1) Novelty with respect to TaS: We believe that there is a confusion regarding novelty with respect to TaS, and we would really like to clarify it.

Track-and-Stop provides a generic wrapper for designing lower bound-based pure exploration algorithm. But for the preference-based pure exploration problem the lower but bound is significantly different than that of BAI. Thus, we have to derive a new lower bound (and a ``good" relaxation) and propose multiple components to design PTnS, which were absent in the TaS paper and the following literature.

We enlist these components here:

1.1) Stopping rule: The standard Chernoff-based stopping rule (Eq.(6) of Kauffmann et. al., JMLR, 2016) do not work for the PrePEx problem as here the term inside KL needs to consider preference (Equation (8), line 338). Thus, we proposed a new preference-aware stopping rule. This requires deriving a a novel concentration inequality (Theorem 5). These design contributions and their theoretical validity (which is one of the main contributions of the paper) are highlighted from line: 331-351. This is a contribution to both the design as well as analysis over and above the existing TaS algorithms.

1.2) Use of convexified set for tracking: Creating the convex-hull of alternating instances, and further using that for tracking and computing allocations is novel. It has luckily not been needed in existing BAI techniques. We dedicated Section 4.1. for this and now also add a commentary on the need and impact of convexification following your previous feedback.

1.3) Analysis of PreTS: The analysis of PreTS (Section 5) involves several new theoretical techniques and results: (a) defining the distance metric between pareto fronts instead of Hausdroff metric (Definition 8), (b) showing concentration of good event with respect to this metric (Lemma 4), and (c) a preference-dependent concentration of difference of means (Lemma 2). We extend the discussion in line 91-94 to highlight these components.

We hope that this exposition of the non-trivial changes in algorithm design and analysis would be able to clarify your concern regarding novelty of PreTS with respect to TaS. Based on the previous feedback, we have further added an explicit remark in the main paper highlighting these novelties with respect to TaS.

  1. A case for convexification: The advantages of convexification are of both statistical and computational in nature. A thorough discussion of the computational benefits/tradeoffs with existing approaches was avoided since this was not the main objective of the paper, and more numerical and optimization-based in nature. Since this paper is mostly theoretical and statistical in nature, and we had to explain several novel design and analysis ideas, we feel that computational discussions are best for a seperate work. Now, we have added a discussion at the end of Section 4.1. to clarify the need and impact of convexification. We quote it here verbatim:

``Discussion: Cost and Need of Convexification. For Gaussian bandits, as we can get the analytical form of the most confusing instance M~\tilde{M} (Theorem 2), we do not pay any extra cost of convexification. In the non-Gaussian settings, where we cannot find such analytical forms for the most confusing instances, the minimum value of the inner minimisation problem under convex hull (Equation (5)) can go lower than the minimum value found in the original non-convex set of instances (Equation (3)). Thus, the characteristic time attained by solving the convex relaxation might be either higher than or equal to that of the original lower bound (Equation (3)). Hence, an algorithm solving the convex relaxation might have higher stopping time. But convexification is essential for computational feasibility of a lower bound-tracking algorithm for PrePEx, and also to prove Theorem 4, which is essential to statistically design a tracking-type algorithm. In future, it will be interesting to study this computational-statistical trade-off.’’

  1. Thanks, we agree with this observation. We remove the qualifier "asymptotically optimal" for PreTS from the paper. We edit line 13 as "matching sample complexity upper bound” (line 13) - ``a convex relaxation of the lower bound as the sample complexity upper bound". We also edit the other three mentions (line 93, 381, 391) similarly, and mention that PreTS tracks a convex relaxation of the lower bound and achieves the corresponding sample complexity for general reward functions.

Let us know if this response clarifies our concerns. We would be happy to elaborate further if it does not. If we have successful to clarify, we would ask you to kindly reconsider your score. Thanks again for reviewing our work.

评论

We thank you again for your review and the following interactions. We are going to include the relevant discussions in the final version of the paper.

We hope that our responses have addressed all your concerns within the scope of the current paper, and we would really appreciate it if you can adjust your score accordingly.

审稿意见
7

This paper studies the preference-based pure exploration problem for bandits with vector-valued rewards and a set of preferences imposed over them. The objective is to identify the most preferred policy over a set of arms according to the preferences induced on the reward vectors by an ordering cone C. The technical contributions are three folds. First, a lower bound on the sample complexity for identifying the most preferred arm with confidence level 1 – delta is proved. The lower bound shows how the geometry of the preferences and reward vectors changes the hardness of this problem. This geometry for Gaussian distributions of rewards is further explicated. A convex reformulation of the lower bound solvable with linear programming is provided. This convex reformulation of the lower bound is utilized to design the Track and Stop with Preferences (TSwP) algorithm that identifies the most preferred policy. Third, a new concentration result for vector-valued rewards is proved. It is shown that TSwP achieves a matching sample complexity upper bound.

优点

This paper establishes a lower bound on the sample complexity for identifying the most preferred arm with confidence level 1 – delta. This lower bound reveals the hardness of PrePEx problems. This lower bound also shows how the geometry of the preferences and reward vectors changes the hardness of this problem.

It is shown that the optimization problem in the lower bound involves minimization with a provably non-convex set. A convex reformulation of the problem based on ideas from disjunctive programming is provided. This convex reformulation of the lower bound is utilized to design the Track and Stop with Preferences (TSwP) algorithm that identifies the most preferred policy.

A new concentration result for vector-valued rewards is proved. It is shown that TSwP achieves a matching sample complexity upper bound.

缺点

I only identified some minor writing issues such some citation numbers are missing, e.g., line 21.

问题

No.

局限性

No.

作者回复

We would like to thank the reviewer for appreciating the strength of our contributions. We have rectified the errors, typos, and notational issues in our revised manuscript.

评论

Many thanks for the response. I like to maintain my score.

审稿意见
7

This paper generalized the strack-and-stop style of best arm identification analysis and algorithm design to the setting of vector-valued bandit problems where the pareto frontier must be found. Novel upper and lower bounds are proposed as well as a convex relaxation of the lower bound that produces an implementable algorithm.

优点

While the overall structure closely follows the track-and-stop literature, there seems to be a lot of technical novelty in its extension to the pareto front setting. The characteristic time has a much more complicated geometric structure, which is reflected in the lower bound. New concentration inequalities need to be derived for the pareto frontier estimation (along with a novel distance).

The explanations are clear and presented in a coherent order. The comparisons with previous track-and-stop papers provides a convenience frame to understand the novelties of this paper.

缺点

I wish the authors would describe the technical novelties e.g. in the proof techniques in more detail in the main body. Some interesting examples illustrating the geometric insights would greatly aid the reader’s understanding. Finally, the paper needs a thorough proofread - there are many broken references, misplaced or missing periods (especially at the end of equations), etc.

问题

What is lost with the convex relaxation? What happens in problem instances where the optimal action is not in the relaxed region?

局限性

N/A for societal impact, but I found the discussion on limitations mostly absent.

作者回复

We thank the reviewer for acknowledging the strengths and novelty of the contributions along with pertinent questions regarding the computational approach. We respond to them here.

  1. Cost of Convex Relaxation: In the setting with Gaussian bandits, as we show Theorem~2 in the paper, nothing is lost since we are able to get exact form of the hardest instance analytically. In the non-Gaussian setting, where we cannot find such analytical form for the confusing instance, we observe that the min value of the inner minimisation problem under convex hull can be lower than the minimum value found in the original non-convex set of instances. Thus, the characteristic time attained by the algorithm using the convex relaxation can be higher than that of the original lower bound. Thus, an algorithm solving the convex relaxation might have higher stopping time. Further, please see lines 316-324 in the updated paper.

  2. What happens in problem instances where the optimal action is not in the relaxed region? Since our approach takes the convex hull of the feasible set, which is a super-set of the true set of alternating instances, the optimal action set always lies in this set.

  3. Technical Novelties and Insights: The technical novelties of this work can be categorised into three parts: (i) lower bound: taking the policy space perspective, (ii) algorithm design: convexification of the lower bound and a new stopping rule and (iii) proving asymptotic optimality of PreTS requires coming up with a new distance metric and establishing concentration inequalities under these metrics.

We hope that our response addresses your questions. We would be glad to respond if you have any further query.

评论

Thanks for the response. I remain supportive of acceptance.

审稿意见
6

The paper considers a generalization of the fixed confidence best arm identification. Specifically, in this setting we have a collection of arms each having a mean vector associated with it. Further, we have an ordering that establishes a preferences over the vectors. The objective is to identify the Pareto front according to the preference model. The paper establishes lower bounds, designs an algorithm based on convex relaxation, and establishes upper bounds on its sample complexity.

优点

-The proposed problem is interesting and relevant.

-I did not verify the correctness of the theorems, but the techniques seem involved with lower bounds being established.

-One can see possibly extensions to this setting where we learn a Pareto set.

缺点

1-The major weakness of the paper is the presentation:

(A)There are many broken references.

(B) definition 3 for partial order: twice the definition says μCμ\mu \leq_{C} \mu, I think the prime is missing after μ\mu.

(C) In page 4 at times the paper uses CC and others C\mathcal{C} but aren’t they both referring to the cone?

(D) "c1=c_1=" in line 334 is not given. Further, I think Lemma 1 would say There exists a constant cc instead of c1c_1.

(E) no space between words in lines 112-116: “settingby”, “)and”

2-The paper is a bit confusing since not the whole Pareto front needs to be identified only a subset of it, correct? In the end of section 2, what does “active support” mean? How is it different from just saying support?

3-What is TˉF\mathcal{\bar{T}}_{\mathcal{F}} in Theorem 6? Does this imply that the algorithm achieves almost optimal performance?

问题

See the points under Weaknesses.

局限性

I think the limitation were addressed.

作者回复

We would like to thank the reviewer for pointing out several avenues for improving our work. We address the concerns below.

Editorial edits: We have revised our manuscript to rectify all the errors and typos as mentioned by the reviewer.

Identification of the Pareto Front: This paper is about identifying the entire Pareto front. This resonates the line of literature paved by Sequential Learning of the Pareto Front for Multi-objective Bandits", crepon, Garivier, Koolen AISTATS 2024, and Pareto Front Identification from Stochastic Bandit Feedback" Auer et. al. AISTATS 2016. They consider similar problems but without generic preferences and linear bandit feedback. Similar but not identical problem was considered in "Adaptive algorithms for relaxed pareto set identification", Kone et. al. NeurIPS 2023, wherein an approximation of the Pareto Set has to be identified. Our problem is a non-trivial generalization of those papers in the sense that, we consider generic preferences (described through a preference cone) and linear bandit feedback.

Terminology: Thank you for pointing out. The word "support" suffices for our purposes.

Asymptotic Optimality and Theorem 6: TF\mathcal{T}_{F} is the characteristic time associated with PrePEx problem. You are indeed correct, the claim of Theorem 6 is that the proposed algorithm is asymptotically optimal.

We hope that our response addresses the reviewer's questions. Let us know if you have any other query.

评论

Thanks again for reviewing our paper. As the review period is coming to an end, please let us know if our response has been able to address your concerns. If yes, we would be grateful if you reconsider your assessment.

作者回复

We would like to thank the reviewers for providing several valuable comment to improve the presentation and writing of our manuscript. We have incorporated those comments and are uploading a revised version here.

评论

Dear Reviewers,

Thank you for reviewing our paper and providing feedback. Please let us know if our responses have been successful to clarify your concerns.

As the discussion period is running out, we would really appreciate if you could communicate your further concerns and comments. We are eager to engage in discussion and use the remaining time to address them.

Thanks and regards, Authors of Paper 11288

最终决定

The paper reports a study on preference based pure exploration, with vector based rewards on arms and a set of preferences over these vectors. The overall opinions received from the reviewers are positive. The reviewers acknowledge the novelty of the problem formulation and the technical contribution on the lower bound and upper bound results. I recommend acceptance of the paper, and encourage the authors to revise the paper thoroughly to address all comments from the reviewers.

One thing I would like to point out is that during the rebuttal stage, the authors submit a revised paper pdf file. As specified by the FAQ (https://neurips.cc/Conferences/2024/PaperInformation/NeurIPS-FAQ) of NeurIPS'2024, this is not allowed. After consulting with the SAC, I inform all the reviewers that the submitted pdf file should be ignored and all discussions and decision are based on the textual comments and replies, not on this pdf file. I suggest the authors should carefully check the rule of the conference for submission and rebuttal and comply with the rule in the future.