5.8

/10

Rejected4 位审稿人

最低5最高7标准差0.8

2.5

置信度

正确性2.3

贡献度2.5

表达2.3

NeurIPS 2024

Optimal, Efficient and Practical Algorithms for Assortment Optimization

Aadirupa Saha,Pierre Gaillard

OpenReview PDF

提交: 2024-05-12更新: 2024-11-06

摘要

关键词

Active online assortment optimizationPreference feedbackSubsetwise utility maximizationAssortment selection algorithmsPlackett Luce modelRegret minimizationPairwise Rank-BreakingConcentration guaranteePractical algorithmsEmpirical evaluations

评审与讨论

审稿意见

评分: 6置信度: 22024-07-01

This paper considers the Adaptive Optimal Assortment (AOA) problem a.k.a. Utility Maximization with Subset Choices. The goal of this problem is to find the optimal profit-maximizing subset of size up to m (Top-m-Objective) or its weighted variant (Wtd-Top-m-Objective). Given a selected subset, the feedback follows the Plackett-Luce (PL) choice model that returns an item from the subset or a "no-choice" option. The probability of choosing each item is proportional to their underlying score/utility values.

The paper proposes a new algorithm, AOA-RB, that is claimed to be practical, efficient, and optimal. Compared to previous works, this algorithm does not require sampling the same subset repeatedly nor assumes a strongest default item. Later, the authors extend this algorithm with adaptive pivots that further improves performance.

The theoretical analysis shows that AOA-RB obtains regret guarantees that build on a novel "Rank-Breaking" parameter estimation technique for the discrete choice model.

The performance of AOA-RB is further demonstrated in numerical experiments using synthetic datasets.

优点

Problem Statement

Clear presentation and motivation. Easy-to-follow section.

Algorithm

Clear strengths are the relaxation of previous assumptions, e.g., repeated sampling of the same subset or the assumption of a strong default item
The algorithm is well-presented and easy-to-follow.
The adaptive pivot extension of the AOA-RB is a clear improvement that provides significant improvements

Theoretical Analysis

The new concentration lemmas in Section 3.2. are claimed to be novel by the authors.
Regret guarantees are provided for both objectives. The main strength is Theorem 6 which analyses the regret of the adaptive pivot version of the algorithm and shows a regret bound that does not blow to $\infty$ in corner cases.

Experiments

The numerical experiments section further demonstrates the performance improvement of AOA-RB over the state-of-the-art MNL-UCB algorithm. It highlights especially the benefits of the adaptive pivots.

缺点

Introduction, Related Works, and Contribution

Certain claims are not supported, e.g., Line 21 "Studies have shown that it is often easier..." but it lacks citation which studies the authors refer to.
I found some citations to be misplaced or non-supportive of the claims it is used for, e.g., [11] is used in Line 62 as a reference for Battling Bandits while it is a survey of dueling bandits. Similarly, citations [45, 46] are used for dueling bandits while they are only two examples from the literature. It would be great if authors could use consistent citations, e.g., surveys when they refer to broader literature and individual publications when specifics are important.
Table 1 is provided for the comparison of regret guarantees but the authors do not describe it. It would be great if they could comment on the differences between the algorithms.

Problem Setting

Limitations are not mentioned in the problem statement. For example, how restrictive is the Plackett-Luce model, and whether the approach could be extended to other models? I see that it is mentioned in Remark 1 but could be commented on in Section 2 as well.
Both Top-m and Wtd-Top-m consider the (weighted) utility optimization problem. However, for most of the applications used as motivation, e.g., assortment optimization and recommender systems, the utility of the user which dictates the selected feedback, and the utility/profit of the subset selection (platform) are misaligned. Could the authors comment on how to formulate these problems in their setting?

Algorithm

The $argmax_{S\subseteq [K], |S|\leq m}$ optimization is non-trivial and could be computationally expensive for large values for $K$ .
The authors claim that AOA-RB is practical, efficient, and optimal. While the theoretical analysis supports the last two claims, I struggle to find the intuition behind the algorithm. Could the authors elaborate further on this point?

Experiments

Numerical experiments demonstrate performance only in synthetic data. Given the clear application and motivation of the paper, I would like to see experiments that reflect these problems.
I recommend the authors to use larger figures. Axes and titles are hardly visible in the printed version.
Only one baseline is considered. It would be appreciated if the authors could include the other algorithms mentioned in Table 1 for numerical comparison besides the theoretical one.

While the paper is easy to read and follow even for readers not familiar with all the works in the area, the inconsistent citations and unsupported claims have to be addressed before the paper would reach publication standards.

问题

It would be appreciated if the authors could comment on the questions outlined in the Weaknesses section. Some further questions are the following:

It is not evident to me how the proposed framework applies to the application areas described in the Introduction section. Could the authors further comment on it and formulate some more rigorously for demonstration?
The paper compares its performance closely to MNL-UCB [2]. Could the authors further elaborate on the similarities and differences between their proposed algorithms and the MNL-UCB?

局限性

Limitations are mentioned in the paper, however, it is often not directly connected, e.g., the assumption of the PL model is only addressed in Remark 1. I would suggest the authors address limitations more clearly when they appear for easier readability. The work is mainly theoretical without any immediate direct societal impact.

评论- Rebuttal by authors

2024-08-07

Thanks for your review

Weaknesses --

Q1. Non supported claims in line 21

-- Sudies in brain, psychology and cognitive neuroscience corroborate the fact. We will add references such as:

Kahneman and Tversky. The psychology of preferences. Scientific American 1982
Musallam, Corneil, Greger, Scherberger, and Andersen. Cognitive control signals for neural prosthetics. 2004.

Q2. Reference issues

About [11]. We believe that [11] is a valid reference for Battling bandits (BB). First, DB is a special case of BB. More importantly, if you note Tables 8,9 of [11] that mentions BB algorithms in details, in general Sec 6 of [11] elaborately discusses BB.
Note also that [45,46] are valid references for DB, in fact, we cited [41, 3, 45, 46, 44] as DB in Line 31, where we introduced DB for the first time.
About using consistent citations. Thanks for the suggestion, will make sure to cite accordingly.

Q3. Differences between the algorithms in Table 1

-- See global rebuttal

Q4. Generalization to other models

-- See global rebuttal

Q5. Misalignment between platform and user utilities

-- Please note we have a concept of revenue/weights ( $r_i$ for each item $i \in [K]$ ) that captures the utility of the sellers, while $\theta_i$ captures the utility of the customers. So the MNL formulation does consider the utility of both users and the customers.

However, if the reviewer is asking how to develop an algorithm that simultaneously offers the `best-valued' assortment to the customers while maximizing the revenue of the sellers, that can be either (i) formulated as a multi-objective problem that balances the tradeoff between both utility notions or (ii) simply assume the seller weights/reviews ( $w_i$ s) are an increasing function of the customer scores/values ( $\theta_i$ s), in which case maximizing one notion will, in turn, maximize the other with our Wtd-Top-m ( $Reg_T^{wtd}$ ) objective. Please let us know if you have any other questions in mind

Q6. Computation cost of the argmax

-- Please note this step is efficient since it is just a static assortment optimization problem under MNL model with known parameters, as already established in the literature:

Avadhanula, Bhandari, Goyal, Zeevi. 2016. On the tightness of an LP relaxation for rational optimization and its applications. Operations Research Letters.
Rusmevichientong, Shen, DShmoys. 2010. Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research.

Thanks for the comment, we will add the citations in the final draft.

Q7. Intuition behind the algorithm

-- Please see the global rebuttal

Q8. Experiments reflecting the problem

-- We are unaware of any open-source datasets suitable for validating our algorithms in this specific problem setting. The challenge lies in the inability to know the ground truth parameters $(\boldsymbol{\theta})$ of the MNL model, making it difficult to evaluate regret performance. This is why prior works, such as [1] and [24], either did not report experiments or only used synthetic data. While [2] claims to use the UCI Car Evaluation Dataset, they actually conducted a synthetic experiment by modifying item features and assigning synthetic scores to the PL parameters (see Sec 7.3 [2]).

Also our primary focus has been theoretical and experiments support our theory. If the reviewer knows of any suitable datasets, we would be happy to report comparative performance on those.

We will enlarge the figures. Thanks for the suggestions

Q9. Other algorithms in the experiments

-- As we elaborated in the global rebuttal, the algorithms of [2] and [24] for our problem setting are the same. Further [1] is only a TS variant of [2] with an identical algorithm structure which thus suffers from the same issues. We will be however happy to add MNL-TS in our experiments.

Questions --

Q10. How the framework applies to the applications in the Intro?

As per the examples discussed in Sec1, e.g. Web search, online shopping 36 (Amazon, App stores, Google Flights), recommender systems (Youtube, Netflix, Google News/Maps, Spotify), typically involve users expressing preferences by choosing one result from a subset of offered items in terms of a purchase or click which we capture thought the MNL-choice model.

On the other hand, the system (algorithm) designer, which could be the platform or the seller, may want to converge to the ‘most-profitable’ or `revenue-maximizing' set, which is captured through our regret objective: Top-m ( $Reg_T^{top}$ ) could be relevant for Google, Netflix, Spotify, News/Maps Recommendation etc (to maximize click through rates), while Wtd-Top-m ( $Red_T^{wtd}$ ) could be relevant for Amazon, Walmart, Google Flights, App Stores where the objective is to maximize the total revenue.

Q11. Comparing MNL-UCB [2]

-- Pls see Q3 or Global Rebuttal

We urge you to kindly reconsider the score based on the rebuttal

评论- Requesting your feedback

2024-08-09

Dear Reviewer L6d3,

I am writing you to kindly request your feedback on our rebuttal. As the discussion phase is currently underway and we have a limited timeline, I am eager to address any remaining concerns you may have. We believe to have answered all your queries and are happy to provide any further clarifications in the hope of potentially raising your scores.

Requesting you to kindly engage in a discussion at your earliest convenience.

Thank you for your consideration of this matter.

Thanks, Authors

2024-08-09

Dear Authors,

Thank you for providing further details and answers to my questions.

Q7: In the general rebuttal, you refer to Q3 of R2 but I do not see the rebuttal for R2. Could you provide further details for me as well about the intuition behind $\theta_{i,t}^{ucb}$ ? As you claim in the rebuttal, your main contribution lies here therefore I would be curious.

评论- Contd. Follow-up clarification for Reviewer-L6d3 (Intuition behind $\theta_{i,t}^{ucb}$)

2024-08-10

While the above discussion establishes the novelties of Alg1, we also observe a potential limitation of our Alg1 that paved the path for our final algorithm Alg2 (AOA-RBPL-Adaptive,Sec4). But it will be worth understanding the limitation of Alg1 first before we proceed to the intuition behind Alg2.

Limitation of Alg1 (for certain realizations of MNL parameters $\boldsymbol{\theta}$ ): While Alg1 carves out the basic building block of our $\theta_{i,t}^{ucb}$ , one caveat lies in its concentration rate which shrinks at the rate of $O(1/\sqrt{n_{i0,t}})$ --- as established is Lem8, $n_{i0,t}$ being the number of rank-broken pairs of $(i,0)$ till time $t$ (Line174-175). Thus ideally we would need $n_{i0,t}$ to grow fast for a fast and tight convergence of $\theta_{i,t}^{ucb}$ to $\theta_i$ . However, as Lem2 reflects, this might not be true unless either $0$ (NC) or $i$ is a `strong item' with sufficiently large MNL parameters (comparable to $\theta_{\max}$ ). This is since, as Lem2 shows $n_{i0,t}$ (roughly) grows proportional to $\frac{(\theta_i + \theta_0)}{\theta_{\max}}$ which could be quite small if both Item-i and 0 (NC) happens to be a "weak-items" in terms of their MNL scores, i.e. $\max\{\theta_i,\theta_0\} << \theta_{\max}$ . If we think, this is also intuitive since, if both $\theta_i$ and $\theta_0 = 1$ are small compared to $\theta_{\max}$ , chances are very low that either of them will be picked as a winner of any round $t$ , even if $i \in S_t$ and as a result they will never be "rank-broken" against each other resulting in a very small value of $n_{i0,t}$ , weaker UCB estimates $\theta_{i,0}^{ucb}$ and finally a weighted regret bound of $Reg_T^{wtd} = \tilde O(\sqrt{\theta_{\max} KT})$ (Thm5) which could be large when $\theta_{\max}$ is large!

Understanding the problem, we remedy this with Alg2 (AOA-RBPL-Adaptive) in Sec4, where we devised a smarter UCB estimate of $\boldsymbol{\theta}$ , while still keeping the basic intuitions from Alg1 (AOA-RBPL) intact:

$\boldsymbol{\theta}\_t^{ucb}$ justification for Alg2: As explained above, we realized "pivoting" on the NC for estimating $\theta_i$ could be a bad idea, especially if $1 = \theta_0 << \theta_{\max}$ . Towards this, we made the following (seemingly interesting) observations:

(1) We first note that: $\theta_i = \frac{\theta_i}{\theta_0} = \frac{\theta_i}{\theta_j} \frac{\theta_j}{\theta_0}$ for any $j \in [K] \setminus \\{i\\}$ .
(2) Then drawing motivation from the UCB estimates of Alg1, we further set $\theta_{i,t}^{ucb} = \gamma_{ij,t}^{ucb}\gamma_{j0,t}^{ucb}$ , where $\gamma_{ij,t}^{ucb} = \frac{p_{ij,t}^{ucb}}{(1 - p_{ij,t}^{ucb})_+}, ~\forall i,j \in [K]\cup\{0\}$ (display after Line252).
(3) The hope is if we can find a "strong element j" and pivot our rank-broken pairwise estimates ( $p_{ij,t}^{ucb}$ s) around $j$ for all $i \in [K] \cup \{0\}$ , hopefully the that will remedy the caveat of Alg1 as detailed above.
(4) To find such a "strong pivot j" (such that $\theta_j$ is comparable to $\theta_{\max}$ ), we set a dynamic $j = \arg\min_{j \in [K] \cup\{0\}} \gamma_{ij,t}^{ucb} \gamma_{j0,t}^{ucb}$ for each item $i \in [K]$ and time $t$ , which by definition happens to be a relatively stronger item than Item- $i$ and 0 (NC).
(5) Further, similar to Lem1, we also find the rate of concentration of our new UCB estimates $\theta_{i,t}^{ucb}$ in Lem10 (see Appendix B.6), which, as intuitive, is shown to shrink at the rate of $O(\frac{1}{\sqrt{n_{i,j,t}n_{j0,t}}})$ . The nice trick was to note that this yields a sharp concentration for $\theta_{i,t}^{ucb}$ owing to our clever choice of the dynamic pivot j as described in the point above -- this saves us from the caveat arose in Alg1 due to a poor choice of the static NC pivot (Item-0).
(6) Finally the above sharp concentration of $\theta_{i,t}^{ucb}$ in Lem10 ensures the final weighted regret of $Reg_T^{wtd} = \tilde O(\sqrt{\min\\{\theta_{\max},K\\} KT})$ (Thm6), which could be notably at most $\tilde O(K\sqrt T)$ even if $\theta_{\max} \to \infty$ . Here lies the drastic improvement of our algorithm compared to [1,2,24] which either needs to assume $\theta_{\max} = \theta_0 = 1$ or their regret scales as $O(\sqrt{\theta_{\max}KT})$ which yields a trivia regret as $\theta_{\max} \to \infty$ or even if $\theta_{\max} = \Omega(T)$ . We detailed this after Thm6, as well as empirically validated in Sec6.

We hope that was helpful and clarifies your doubts around our choice of UCB estimates $\boldsymbol{\theta}_t^{ucb}$ for both Alg1 (AOA-RBPL, Sec3) and Alg2 (AOA-RBPL-Adaptive, Sec4).

Finally, we would like to convey that, we have put considerable effort into developing our work as well as addressing the reviewers' questions. We hence urge you to kindly reconsider your scores based on the above clarifications if you find them useful. Please let us know if you need any further explanation or any remaining clarification you may require.

Thanks, Authors

评论- Follow up

2024-08-12

Dear Reviewer L6d3,

We wanted to write you again to understand if we have clarified your question regarding the "intuition behind \theta_{i,t}^{ucb}" with inadequate detail. In summary:

For both Alg1 (AOA-RBPL, Sec3) and Alg2 (AOA-RBPL-Adaptive, Sec4), our choice of $\theta_{i,t}^{ucb}$ ensures that it's a valid and tight upper confidence bound (UCB) on $\theta_{i}, ~\forall i \in [K]$ (Lem1,2,4 for Alg1, and Lem9,10,11 for Alg2).
This, in turn, ensures our improved regret guarantees (Thm6) in the worst case (for large $\theta_{\max} >> \theta_0 = 1$ ) compared to the vacuous $\infty$ regret bound of the existing works.
Our new ideas behind $\theta_{i,t}^{ucb}$ estimation gave a practical & efficient algorithm with the clever trick of rank-broken $\theta_i$ estimate (instead of repeated querying of same subsets multiple times until NC, as used in all the previous works for the purpose [1,2,24]). Please see Sec1 for our contributions, our remarks after the main theorems, as well as Sec5.

We provided detailed clarifications of these points in our response above. Please let us know if we can clarify anything further.

We urge you to kindly consider your scores based on our response. Of course, we are happy to explain any remaining details you may require.

Thanks, Authors

评论- Few Hours Until the Author-Reviewer Discussion Ends

2024-08-13

Dear Reviewer L6d3,

Thank you again for your time and insightful comments. Since we are only a few hours away from the end of the author-reviewer discussion phase, we wanted to check if we were able to clarify your question regarding $\theta_{i,t}^{ucb}$ and if we can clarify any remaining concerns.

Please let us know and we would be happy to.

Sincerely, The authors

2024-08-13

Dear Authors,

Thank you for addressing my question in detail. It clarified the novelty of your approach and I updated my score accordingly.

评论- Thank you

2024-08-13

Dear Reviewer L6d3,

Thank you for engaging in the discussion and considering increasing your score. We appreciate your time and feedback. Of course, we will be happy to clarify any remaining questions you may have, please let us know in case.

Thanks again, The Authors.

评论- Follow-up clarification for Reviewer-L6d3 (Intuition behind ${\theta}_{i,t}^{ucb}$)

2024-08-10

Dear Reviewer L6d3,

Thanks for considering our rebuttal and your question. We apologize for the confusion regarding Q3 or R2 (we meant "our answer of Q3 of Reviewer-mX9J"). We explain below the intuition behind $\theta_{i,t}^{ucb}$ in detail:

Note we gave two algorithms, (Alg1) AOA-RBPL in Sec3, and (Alg2) AOA-RBPL-Adaptive, in Sec4, both of which use two different estimates of $\theta_{i,t}^{ucb}, \forall i \in [K]$ , $\theta_{i,t}^{ucb}$ being the upper confidence bound (UCB) estimate of the MNL parameter $\theta_i$ at round $t$ , for all $i \in [K]$ . Further note, due to the scale independence of the MNL model, we always assume that the parameter of the no-choice (NC) item is always $\theta_0 = 1$ (see Line153).

How $\boldsymbol{\theta}^{ucb}$ was used in the regret analysis (Thm5, Thm 6): We will explain the intuition of $\theta_{i,t}^{ucb}$ for Alg1 and Alg2, but before that let us understand how $\theta_{i,t}^{ucb}$ were used in the regret analysis of Thm5 and Thm3 (resp. for Alg1 and Alg2):

(1) Towards this, an important observation to note is both our regret proofs (i.e. proof of Thm5 and Thm6) use a key wtd-utility inequality $\mathcal R(S^*, \boldsymbol{\theta}) \leq \mathcal R(S^*, \boldsymbol{\theta}^{ucb})$ , where $\mathcal R(S^*, \boldsymbol{\theta})$ and $\mathcal R(S^*, \boldsymbol{\theta}^{ucb})$ respectively denote the weighted-utility of set $S^*$ under MNL parameters $\boldsymbol{\theta}$ and $\boldsymbol{\theta}^{ucb}$ (see Eq2): Please see the inequalities used in the displays of Eq14 and Eq18, in proof of Thm5 and Thm6 resp., to note how we used the above key inequality to derive the final regret upper bound ( $Reg_T^{wtd}$ ) for Alg1 and Alg2).
(2) However, to achieve the above property we need to ensure, $\theta_i^{ucb} \geq \theta_i$ as we showed in Lem4: Essentially it shows if $\theta_i^{ucb}$ is a valid UCB of $\theta_i, ~\forall i \in [K]$ , then the estimated wtd-utility $\mathcal R(S^*, \boldsymbol{\theta}^{ucb})$ is also a valid UCB of $\mathcal R(S^*, \boldsymbol{\theta})$ .
(3) Hence the question is how to assign the values of $\theta_i^{ucb}$ so that they represent a valid and tight UCB of the corresponding (true) MNL parameters $\theta_i$ s?

We now justify our choice of $\theta_i^{ucb}$ s for all $i \in [K]$ for which the above properties are satisfied, both for Alg1 and Alg2. But let us understand some important properties of the MNL model before that.

$\boldsymbol{\theta}$ estimate from MNL pairwise preferences:

(1) We first note that $\textbullet$ for any two items $i, j \in [K]\cup \{0\}$ , the pairwise preference of $i$ over $j$ is $p_{ij} = \frac{\theta_i}{\theta_i + \theta_j}$ , by definition of MNL choice model (Eq1).
(2) But since $\theta_0 = 1$ (Line153), this implies $\textbullet ~\theta_{i} = \frac{p_{i0}}{1 - p_{i0}}, ~\forall i \in [K]$ .
(3) However $p_{i0}$ is unknown, but can we estimate it? Here we have a key idea that by exploiting the Independence of Irrelevant Alternatives (IIA) property of MNL model once can indeed maintain unbiased pairwise preference estimates ( $p_{ij}$ s) of any pair $(i,j)$ , $i,j \in [K]\cup\{0\}$ using Rank-Breaking (Please see our response to Q5 of Reviewer-dQv5 for details). We denoted them by $\hat p_{ij,t}$ at round $t$ (see Line171).

We first understand the rationale behind our choice of $\boldsymbol{\theta}$ for Alg1.

$\boldsymbol{\theta}_t^{ucb}$ justification for Alg1 (uses the NC item as pivot):

(1) Upon realizing $\theta_i = \frac{p_{i0}}{1 - p_{i0}}$ and having access to $\hat p_{i0,t}$ , the unbiased estimates of $p_{i0}$ derived through rank breaking as described above, we first find a good upper confidence bound (UCB) estimate of $p_{i0}$ , denoted by $p_{i0,t}^{ucb}$ as described in Eq3. Further, we derive the concentration rate (tightness) of the UCB estimate $p_{i0,t}^{ucb}$ in Lem8 (Appendix B.1).
(2) The next idea was to use $p_{i0,t}^{ucb}$ to define a natural estimate of $\theta_{i,t}^{ucb} = \frac{ p_{i0,t}^{ucb} }{ (1 - p_{i0,t}^{ucb})\_{+}}$ (see the display after Eq3), drawing inspiration from $\theta_i = \frac{p_{i0}}{1 - p_{i0}}$ . Further, Lem1 establishes the rate of concentration of the UCB estimate $\theta_{i,t}^{ucb}$ using Lem8 (see proof of Lem1 and 8 in Appendix B.1, B.2).

This concludes our intuition behind our choice of $\theta_{i,t}^{ucb}$ in Alg1, which provably yields a UCB estimate of the true MNL parameters $\theta_i$ for all $i \in [K]$ , as desired.

Alg1 contributes to our basic intuition behind dynamically estimating $\boldsymbol{\theta}_{t}^{ucb}$ s using the novel rank-breaking technique and alleviates the problem of repeatedly querying same subsets $S_t$ in consecutive rounds as used in the previous works to estimate MNL parameters $\boldsymbol{\theta}$ along this line [1,2,24]. We detailed this in Line105-107 as well as in our Global Rebuttal (please see Limitations of previous algorithms).

审稿意见

评分: 7置信度: 32024-07-06

The authors consider the online MNL assortment optimization problem, where the goal is to learn MNL parameters while suggesting assortments, with the goal of either learning the top-m highest utility items or learning the maximum revenue set with m items. They use a UCB-based approach on pairwise win rates to get a UCB for utilities, which can then be fed into a traditional assortment optimization algorithm. The authors show this approach achieves asymptotically optimal regret and does not require assumptions used by previous approached. The basic algorithm relies on comparisons between each item and the no-choice option, but they also introduce a more sophisticated adaptive pivot approach that works better when the no-choice option is rarely selected. In experiments on synthetic data, their assortment optimization approach performs significantly better than the previous state-of-the-art.

优点

The problem studied is natural and important.
The presentation is generally clear.
The technical quality seems good, although I cannot attest to the correctness of all the proofs in the appendix.
The UCB approach on pairwise win rates is clever and appears original.

缺点

The algorithms and proofs could use some additional description/intuition. Some of the steps in the proofs take rather large leaps.

问题

The regret bounds in Table 1 for [2] (Thm 1) and [2] (Thm 4) don't seem to match up with the bounds actually in [2], which have some extra terms.
Eq. (1) is a multinomial logit, but it's referred to in the paper as Plackett-Luce. Plackett-Luce is the repeated-selection ranking version of MNL, so I think the model in this paper should be referred to as MNL. (Ah, I see that lines 294-300 use Placket-Luce, but I think it makes more sense to call the model MNL throughout the paper, since that's the focus.)
After line 253, what is the justification for defining $\hat \theta$ as that product of $\gamma$ s? This definition seems to still depend on observations of comparisons between j and the no-choice option, doesn't it?
The title is quite general, but the paper is very specifically about online assortment optimization rather than the static problem. It would be good to make this clear from the title.

Minor comments:

The statement of Lemma 1 needs to be proofread and fixed
Line 196: shouldn't this be $\tilde O(\sqrt{KT})$ ?

局限性

I think the limitations of the paper were adequately stated

评论- Rebuttal by authors

2024-08-07

Thanks for your positive detailed review and the insightful questions.

Q1. Missing terms in the regret bounds in Table 1

-- You are right. We only included the main leading term for conciseness, ignoring logarithmic and constant terms. We will clarify this in the caption.

Q2. Multinomial vs Plackett Luce

-- Thanks for the suggestion, we will call the model as MNL throughout the paper to avoid any confusion.

Q3. Justification for defining $\theta^{ucb}$ (after line 253)

Since we compare i with 0 through j, when j is a "strong item" that is often selected as the winner, both $\gamma_{ij,t}^{ucb}$ (that estimates $\theta_i/\theta_j$ by above) and $\gamma_{j0,t}^{ucb}$ (that estimates $\theta_j/\theta_0$ by above) are sharp estimators. Hence, $\theta_{i,t}^{ucb} = \min_j \gamma_{ij,t}^{ucb} \gamma_{j0,t}^{ucb}$ is a sharp upper confidence bound for $\theta_i$ .

This definition of $\theta_{i,t}^{ucb}$ in turn satisfies the condition of Lem4, which requires $\theta_{i,t}^{ucb} \geq \theta_i$ for $Reg(S^*, \boldsymbol{\theta}) \leq Reg(S^*, \boldsymbol{\theta}^{ucb})$ to hold good. Lem 4 is further crucially used our final proof of Thm6 (see the inequality used in Eq18, Appendix B.6). This Lemma would not hold if we used $\gamma_{ij,t}^{ucb}$ directly without multiplying by $\gamma_{j0,t}^{ucb}$ since it would not be a valid upper bound of $\theta_i$ .

Q4. The title is quite general

-- We understand. We will be happy to include the term "online assortment optimization" and also "MNL model" to the title. Thanks for the suggestion.

Minor: Line 196 which should be $\tilde O (\sqrt{KT})$

-- Yes, that is correct, thank you for noting the typo, will update accordingly.

We thank you again for your positive review, please let us know if we could clarify anything else.

2024-08-10

Thanks for the response! That answers my questions.

评论- Thank you

2024-08-10

Dear Reviewer mX9J,

Thank you for taking the time to read our rebuttal, we appreciate your time and attention. We would also like to draw your attention to our "Follow-up clarification for Reviewer-L6d3 (Intuition behind $\theta_{i,t}^{ucb}$ )" where we provide a very detailed explanation of the rationale behind our specific choices of $\theta_{i,t}^{ucb}$ , both for Alg1 (AOA-RBPL, Sec3) and Alg2 (AOA-RBPL-Adaptive), from an intuitive viewpoint. If you have time, please go over it as you may find it useful for your Q3 as well.

Thanks again for your feedback, Authors

审稿意见

评分: 5置信度: 32024-07-08

The paper addresses the problem of active online assortment optimization problem with preference feedback, which has been extensively studied. The paper argues that the previous studies have some unrealistic assumptions such as: there is a ‘strong reference’ which is always included in the choice sets; the same assortments can be repeatedly selected. Without these assumptions, they propose some efficient algorithms for the problem of regret minimization in assortment selection with Plackett Luce (PL) based user choices.

优点

The paper proves the regret bounds of the proposed online learning algorithms. The regret bounds are proved based on some concentration guarantee for estimating the score parameters of the PL model using ‘Pairwise Rank-Breaking’.

缺点

I cannot fully understand the motivation of the paper. The paper says that two major drawbacks of the previous studies include: the existing algorithms assume that the no-choice’’ option is stronger than the other choices, and they may query the same set of items for multiple times. It seems that the focus of the paper is to address these drawbacks. However, I think that these drawbacks’’ may not be real. First, it is natural that most of the customers will not choose any product, so it is very reasonable to assume that no-choice option is stronger. Second, in the typical assortment optimization scenario where customers arrive online one by one, showing the same set of items to different customers for multiple times absolutely will not cause any problem. So I think that addressing these ``drawbacks’’ has very limited value.
The regret bounds proposed by the paper is actually K\sqrt{T}\log T. It seems that this regret bound is weaker than those of the previous studies such as [2] (at least by log factors on T). The authors may argue that their bounds are better when \theta_{max}\rightarrow \infty, but this depends on the assumptions made on specific application scenarios, which is questionable as explained in my last comment.
The experiments are conducted using some specific values of \theta and hence are not very convincing. I think that more experiments on more applications are necessary to demonstrate the superiority of the paper.

问题

Since the feedbacks from the same customer may be correlated, how do you use concentration bounds to learn the underlying distributions?
The proposed algorithms seem to be UCB-style algorithms. Can you explain the key differences and novelties of your algorithm compared to the classic UCB algorithm?

局限性

see the above

评论- Rebuttal by authors

2024-08-07

We thank the reviewer for the comments.

Weaknesses --

Q1. "Drawbacks may not be real. .. very limited value"

-- We respectfully but absolutely disagree. We strongly believe the comment 'the drawbacks may not be real' only depicts a very personal viewpoint of the reviewer which we believe should be considered unfair. Especially since the reviewer did not provide any justification behind such claims. The likelihood of "No Choice" (NC) varies by application. It may not always exist, and its preference depends on the context:

In recommender systems like YouTube, Spotify or Netflix, users typically make choices.
In news, music, flight, or Google Maps recommendations, NC is unlikely, almost never selected as the person needs to commute or book a flight nevertheless!
Similarly, in many language models or chatbot applications, NC is improbable.

-- This is also the reason the original MNL-UCB paper tried to relax the assumption of $\theta_0 = \theta_\max$ but their regret bound stands vacuous in the regime of small $\theta_0$ . A key motivation of our contribution is that our algorithm (Alg 2, AOA-RBPL-Adative) works in the $\theta_0 \to 0$ regime (equivalently $\theta_{\max} \to \infty$ ) (see Thm 7) where we achieved largely improved regret bound of $\tilde O()$ compared to the vacuous $\infty$ regret bound of Agrawal et al (2019). It is also supported by our experiments (please see Fig 2, Sec 5).

Thus we believe it is unfair to overlook the merits and nice ideas we offered in this work that help us to overcome the limitations of the previous works. Please also refer to the Global Rebuttal for a detailed comparison of our algorithm and results with that of state-of-the-art approaches (in Table1).

Q2. "It seems that this regret bound is weaker ... last comment".

Continuing from Q1, reviewers point of view and fairness is still on question. Firstly this statement "regret bounds proposed by the paper is actually $K\sqrt{T}\log T$ " is incorrect. In the MNL bandit setting of [2] where $\theta_0 = 1 = \theta_{\max}$ , our regret is actually $\tilde O(\sqrt{KT})$ . Further in the limit of large $\theta_{\max}\rightarrow \infty$ , our $\tilde O(K\sqrt{T})$ regret bound (Thm6) is way too better compared to the vacuous infinite regret bound of [2]. Despite the attempt of [2] to lift the assumption of $\theta_0 = 1 = \theta_{\max}$ (Sec6.2, [2]), they derived a vacuous bound and failed to achieve our regret bound of Thm6. It seems totally unfair and unjustified that the reviewer questions our improved bounds under relaxed assumptions, rather than appreciating the novel techniques we offered to achieve our results.

Q3. Experiments conducted on specific values of $\theta$ and hence not very convincing

-- Again, we believe the comment is unreasonable. Synthetic experiments have to be reported for some specific choices of the parameters by virtue; same has been done in the [2,24] as well where they forcefully set $\theta_0 = 1$ which is specific to their problem assumption. If that is justified, we only relaxed that assumption in our experiments, adhering to our problem setting, as expected. And clearly, the previous work performs poorly in those regimes, due to their restrictive assumptions. This is the genuine way to corroborate the claims of any theoretical guarantees, we are completely unable to see the rationale behind the reviewer's point of content, once again!

Questions ---

Q4. Feedbacks from the same customer may be correlated

-- By definition of the MNL-AOA problem, it assumes at each round, given $S_t$ the choice-feedback is sampled independently from the underlying MNL $(\boldsymbol{\theta})$ choice-model. Assuming correlated feedback deviates from the MNL assumption, leads to a different choice model. There are several ways to model such dependencies including (but not limited to) taking cues from rotting/rising/restless/sleeping bandits theory. Each of these directions are an independent body of work and will lead to new research problems of AOA. But it simply lies beyond the scope of this work.

Q5. Explain the differences and novelties of your algorithm compared to the classic UCB?

-- Please check our Global Rebuttal for a detailed discussion on our algorithmic and analytical novelties over classic UCB algorithms.

We understand the reviewer is inclined to reject the paper, but we do not see a single rationale or logical argument behind the decision. In particular, claiming that `generalizing a setting' will not be helpful (even with clear experimental backup) seems to be a personal viewpoint and unjustified. A personal view can not be pitted against facts that we presented through our examples, relaxing the limitations of the previous works, theorems, and experiments. We urge the reviewer to kindly clarify from a technical (and not a personal) viewpoint if there are any remaining concerns you may have or would you kindly reconsider the score, please?

2024-08-08

thank you for your response.

评论- Few Hours Until the Author-Reviewer Discussion Ends

2024-08-13

Dear Reviewer 1RHQ,

Thank you for considering our rebuttal. Since we are only a few hours away from the end of the author-reviewer discussion phase, we wanted to check if we can clarify any remaining concerns. Please let us know and we would be happy to answer promptly.

Thank you, The authors

审稿意见

评分: 5置信度: 22024-07-12

The paper studies the problem of active assortment optimization in MNL model.

In the problem of assortment optimization, we have a large universe of products i=1,2,\dots N, each of which generates a given revenue r_i for the seller. In MNL model each product i has a value \theta_i to the customers and when customers are offered a subset of products they choose each item (including the no-choice option) with a probability proportional to their value. We also assume there is a no-choice option with revenue 0. The seller’s objective is to identify the assortment of products which generates maximum expected revenue.

In the active version of the problems, the values of items \theta_1, \theta_2,\dots , \theta_N, are not known to the seller. Thus, the seller shows a subset of items from the universe to the customers at rounds 1,2, \dots ,T and estimates \theta_i s based on the observations. After approximating these values, the seller may solve the problem in static setting and find the optimal assortment. This strategy is known as exploration and exploitation.

In active assortment optimization, the objective is to minimize the regret of the algorithm which is defined as the summation over rounds t=1,2,\dots T the difference of the expected revenue in each round from the optimal revenue.

Prior works for instance [2] provided an algorithm for this problem by estimating at each round a high probability upper bound for the values \theta_i, and then solve the static problem using the upper-bounds. In [2] the authors assume that \theta_0 (the value assigned to no-choice option and thus its probability ) is the highest among all items.

The submitted manuscript claims that they provide an algorithm with a similar regret bound to [2] which does not have the restriction of assuming the no-choice option has the highest value. Their suggested approach is similar to that of [2] (finding high probability upper bounds for the parameters) but it is hard to follow all details of obtaining the upper bound and how it removes the restriction imposed on the value of the no-choice option.

The result, if true, is interesting but I found the paper hard to read and got lost in section 3.1. I think that the paper will benefit greatly from rewriting and improving the presentation.

I will detail my confusions as follows:

In Equation (3) on line 173 there is a variable x which is not defined up to this point. I understand that x appears to bound the probability of error in Lemma 1. But you have to introduce it before you use it the first time.
Between line 176 and 177 what is the + sign on the denominator of the equation? you use this notation again in another equation between lines 252 and 253.
In equation 3 you show an upper bound on \hat{p_ijt} which then turns to a bound on \theta_i s. But in Lemma 1 you have shown a different upper bound for \theta_i. Can you explain the connection of these two bounds.

A few minor typos:

Lemma 1. atleast-> at least ^ucb is sometimes with roman font and sometimes normal font.

[2] Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research, 67(5):1453–1485, 2019.

优点

The problem of active assortment optimization is a fundamental problem in revenue management
The result is interesting if correct, as it removes an important restriction from prior algorithms.

缺点

The results are poorly presented and it is hard to follow the paper. The paper lacks an explanation of main intuitions .
The technique seems to be similar to [2] as both papers obtain high probability upper bounds for the parameters and then solve it in an static setting. An intuitive explanation of how the given different upper bound is obtained, why it is correct, and how it removed the restriction on no-choice option is not provided.

问题

I am mainly confused by the high probability upper bounds in Eq (3). Can you provide some intuition where is comes from and how it turns to an upper bound on \theta_i? For instance in the introduction you mention you use Rank-Breaking technique. Can you explain in simple words what rank breaking is? how does this technique make your work (and in particular deriving the upper bound on \theta_i s ) different from [2] and how it shows up in equation (3)

My confusions were addresses after reading the rebuttal.

局限性

limitations are not discussed but there are several future directions that have been discussed.

评论- Rebuttal by authors

2024-08-07

Thanks for your review. Please see our responses below. Please note the misunderstanding and we tried our best to clarify it. We will be glad to clarify any further questions.

Q1. Clarification of variable $x$ in In Eq3

-- $x$ is the input to the algorithm. It can be any positive number for which Lem1 holds good. In Thm1 we specifically mentioned the choice of $x = 2\log T$ for the regret bound of Alg1. We agree, we should not have used a forward reference and will add the line `` $x> 0$ can be any positive number as defined in Lem1" before Eq3 Thanks for noting this.

Q2. What is the + sign between line 176 and 177

-- For any real number $a \in \mathbb R$ , $a_+ := \max(a,0)$ . We assumed the notation is standard in bandit literature, but we will certainly clarify this in the notation section to avoid any confusion.

Q3. Upper bound on $\hat p_{ijt}$ in Eq3 and Lem1

-- Please note there is a misunderstanding as you seem to have confused definitions (Eq3) and concentration bounds (Lem1). Firstly, (a) By Eq3, you may mean the display after Eq.3 that defines $\theta\_{i,t}^{ucb} = p\_{i0,t}^{ucb}/(1-p\_{i0,t}^{ucb})\_+$ . Please note this is the definition of $\theta_{i,t}^{ucb}$ . Similarly, Eq3 gives the definition of $p_{ij,t}^{ucb}$ . These turn out to be both upper bounds for $p_i$ and $\theta_i$ respectively as proved in the proof of Lem1 (AppB.2). Only the one of $\theta_i$ is important in the rest of the analysis and thus displayed in Lem1.

Q4. Comparision with [2]

-- We describe the intuition of our algorithm in detail in the global rebuttal, as well as pinpoint our algorithm novelties and advantages over [2].

Q5. Confusion about by the high probability upper bounds in Eq (3)

-- Please read Q3 above, as detailed, Eq3 does not provide any high probability upper bound but a definition of $p_{ij,t}^{ucb}$ . Let us know if anything is not clear still.

Q6. Explanation about rank-breaking (RB), how that make your work different from [2], and how it show up in Eq(3).

Please note RB is described in Appendix A.2. It is the idea that involves extraction of consistent pairwise comparisons from (partial) ranking data, as obtained by treating each pairwise-comparison in the (partial)-ranking independently. E.g. in a set $S = {1,2,3,4}$ if 1 is the choice winner, then RB extracts the pairs $(1 \succ 2)$ , $(1 \succ 3)$ , $(1 \succ 4)$ . Similarly for a full ranking feedback, e.g. $4 \succ 2 \succ 3 \succ 1$ , RB extracts all the 6 pairwise comparisons $(4 \succ 2)$ , $(4 \succ 3)$ , $(4 \succ 1)$ , $(2 \succ 3)$ , $(2 \succ 1)$ , $(3 \succ 1)$ .
RB is used to derive our empirical pairwise estimates $\hat p\_{ij,t} = w\_{ij,t}/n\_{ij,t}$ (Eq3), where $w_{ij,t}$ and $n_{ij,t}$ denotes the total pairwise win counts of item i over j after rank-breaking and total number of rank broken pairs of $(i,j)$ respectively, till time $t$ . We will add these descriptions in the main draft (in Sec 3.1) for ease of exposition.
Re. How RB is used and give us advantage over [2]: We figured the novel Rank-Breaking (RB) technique to derive an important key idea which (roughly) says that the empirical pairwise preferences of item pair $(i,j) \in [\tilde K]\times [\tilde K]$ , see $\hat p_{ij,t}$ in Eq3, gives an unbiased estimate of the true pairwise preference $p_{i,j}: = \theta_i/(\theta_i + \theta_j)$ . This leads to sharper and more efficient UCB estimator of $\theta_i$ , given by $\theta_{i,t}^{ucb} = \min_{j \in [K] \cup \{0\}}\gamma_{ij,t}^{ucb}\gamma_{j0,t}^{ucb}$ where $\gamma\_{ij,t}^{ucb}:=p\_{i0,t}^{ucb}/(1-p\_{i0,t}^{ucb})\_+$ (Line 253). We further prove in Lem9 and Lem10 how $\theta_{i,t}^{ucb}$ yields a much sharper and flexible UCB estimator of $\theta_i$ for each $i \in [K]$ without the requirement of $\theta_0 > \max_{i \in [K]}\theta_i$ unlike all the previous works. Please also see Q3 of R-mX9J to get more intuition of our $\theta_{i,t}^{ucb}$ estimate --- Here really lies our main contribution that helps us to overcome the existing limitations of the previous works (see Table1) and leads to multiple advantages as listed in the global rebuttal.

Please also read the Global Rebuttal which summarizes the existing algorithms and how we overcome the limitations of the state of the art methods ([1,2,14], etc).

We believe we have answered all your questions, please let us know if you have further questions. Otherwise, kindly reconsider your scores in light of the clarifications.

2024-08-10

My confusions were addresses in the rebuttal. I think the results are interesting and probably valid. The presentation can still improve. Hence, I am updating my score to borderline accept.

评论- Thank you

2024-08-10

Dear Reviewer dQv5,

Thank you for taking the time to go over our rebuttal, we appreciate your time and feedback.

We will make sure to add all the details clarified above in the final version, polish the writing in the remaining draft and add more intuitions of the proof details.

Please let us know if you have any remaining questions that we can clarify further.

Thanks Authors.

评论- Requesting your feedback

2024-08-09

Dear Reviewer dQv5,

We are writing you to kindly request your feedback on our rebuttal. As the discussion phase is currently underway and we have a limited timeline, I am eager to address any remaining concerns you may have. We believe to have answered all your queries and are happy to provide any further clarifications in the hope of potentially raising your scores.

Requesting you to kindly engage in a discussion at your earliest convenience.

Thank you for your consideration of this matter.

Thanks, Authors

作者回复

2024-08-07

We thank all the reviewers for their feedbacks. We make every effort to address any concerns raised and hope that our response will clarify any questions you may have. We emphasise here our contributions with respect to existing work.

Description of existing algorithms [2,1,24] (as given in Table 1).

[2]: This is the classical MNL-UCB work, the idea is to estimate the true PL parameter $\boldsymbol \theta = (\theta_1,\ldots,\theta_K)$ s. They estimate it by repeatedly querying the same set (i.e. assortment $S_t$ ) multiple times and keeping a count of the average number of times an item $i \in [K]$ is selected until no items (NC) are selected. They further maintain a UCB of the estimated PL parameters, $(\hat \theta_1,\ldots,\hat \theta_K)$ , and the assortment of the next phase optimistically based on the UCB estimates. The process is repeated until time step T.
[1]: This is another follow-up work of [1]. The algorithm MNL-TS is almost same as MNL-UCB above, with the exception being this paper uses Thompson Sampling (TS) with Beta posteriors, instead of the UCB estimates used in [2], to maintain the estimates of $\boldsymbol{\theta}$ s.
[24]: The objective of this work and preference model of this work is slightly different than that of [1,2] as their objective is `learning to rank' (LTR), i.e. to find the best ordered subset based on some position bias $\lambda_i > 0$ for position $i \in [m]$ . Their algorithms are inspired by the one of [1,2] adapted to their model.

Limitations of previous algorithm. The key idea adapted in all of the above algorithms are same which essentially keeps playing a fixed assortment until no-choice (NC) is picked, upon which they update the estimated parameters. Hence the algorithms work under a semi-batch feedback model, and not in a fully active manner. Consequently, all the above algorithms heavily rely on the assumption that NC item must be the strongest, i.e. $\theta_0 > \theta_i, ~\forall i \in [K]$ , as otherwise if $\theta_0$ is small, the NC item is (almost) never selected, and they never update their estimated parameters despite collecting new information from every customer. This essentially leads to the wastage of the collected information. It is hence not surprising that their regret analysis breaks otherwise (precisely the concentration bound of $\boldsymbol{\theta}$ breaks), as we detailed after Thm5 (Line 217-225) and Thm6 (Line 261-267), as well as corroborated in our experiments in Sec 5.

Description of our main algorithm. We now describe our main algorithm AOA-RBPL-Adaptive (Sec4), which eliminates the above two limitations of [1,2,24]. We figured the novel Rank-Breaking (RB) technique (see Appendix A.2) used in the MNL choice literature to derive an important key idea which (roughly) says that the empirical pairwise preferences of item pair $(i,j)$ , see $\hat p_{ij,t}$ in Eq3, gives an unbiased estimate of the true pairwise preference $p_{i,j}$ . This leads us to come up with a much sharper and efficient UCB estimator of $\theta_i$ . We further prove in Lem9 and Lem10 how $\theta_{i,t}^{ucb}$ yields a sharper and flexible UCB estimator for all $\theta_i$ without the requirement of $\theta_0 > \max_{i} \theta_i$ unlike previous works. Please also see Q3 of R2 to get more intuition of our $\theta_{i,t}^{ucb}$ estimate --- Here really lies our main contribution that leads to the following advantages.

How to generalize our approach to other models. We have mentioned in Rem1 how our algorithm AOA-RBPL-Adaptive generalizes nicely for any general RUM-based choice model [8,9]. This is owing to the fact that [37] shows how pairwise-preference estimates $(\hat p_{ij,t})$ s can be used to estimate score parameters $(\boldsymbol{\theta})$ for general any RUM models. Using the RB based RUM-parameter estimation technique of [37], we can show a regret bound of $\tilde O(\sqrt{\min(\theta_{\max},K) KT}/c\_{rum})$ for our proposed algorithm AOA-RBPL, where $c_{rum}$ is the parameter associated to the minimum advantage ratio (min-AR) of the underlying RUM $(\boldsymbol{\theta})$ model, as defined in Thm6 of [37]. In particular, $c_{rum}$ can be shown to be a constant given a fixed RUM model, e.g. $c_{rum} = 1/4$ for Exp(1), Gamma(2, 1), $c_{rum} = 1/(4\sigma)$ for Gumbel $(\mu,\sigma)$ , $c_{rum} = \lambda/4$ for Weibull $(\lambda, 1)$ , $c_{rum} = 1/3$ for Gaussian $(0,1)$ , etc (Cor5, [37]).

Main contributions of our approach.

We overcome the limitations of existing algorithms (no-choice and mutliple queries) described above.
Our concentration bounds (Lem9 and Lem10) are different than that of [2] as well as the main regret bound analysis (Thm6). See Appendix B.6 for the technical details.
Our regret analysis results in a multiplicative improvement in $\theta_{\max}$ in the final regret performance, as discussed after Thm6 and shown in our experiments (Sec5).
Our algorithms are general and could be extended to any RUM-based choice models including RUM with Gamma, Gaussian, Weibull, Gumbel noise (e.g. MNL is also a special case of RUM models for Gumbel noise, see [8,9]).

In summary. Our work substantially deviates from the standard MNL-UCB approach that directly tries to estimate $\boldsymbol{\theta}$ used in [1,2,24] and used an RB-based UCB estimate improving the existing results.

评论- A small clarification

2024-08-10

Dear Reviewers,

By Q3 of R2 in our global rebuttal above, we meant "our response for Q3 of Reviewer-mX9J".

We would also like to draw your attention to our "Follow-up clarification for Reviewer-L6d3 (Intuition behind $\theta_{i,t}^{ucb}$ )" where we provide a very detailed explanation of the rationale behind our specific choices of $\theta_{i,t}^{ucb}$ , both for Alg1 (AOA-RBPL, Sec3) and Alg2 (AOA-RBPL-Adaptive), from an intuitive viewpoint. Please refer to that response for an in-depth understanding of our algorithmic novelties, new parameter estimation ideas, as well as the logical sequence of proof techniques (for Thm 5 and Thm6).

Thanks Authors

评论- Few Hours Until the Author-Reviewer Discussion Ends

2024-08-13

Dear Reviewers,

Thank you again for your time and insightful comments. Since we are only a few hours away from the end of the author-reviewer discussion phase, we wanted to check for the last time if we can clarify any remaining concerns. Please let us know and we would be happy to answer promptly.

Sincerely, The authors

最终决定Reject

2024-09-25

The reviewers appreciated the algorithmic contribution. Compared to prior work, the proposed algorithm removes several restrictive assumptions that are relevant and important for applications, and thus it is more broadly applicable. After the discussion, all of the reviewers agreed that the assumptions used in prior work are indeed restrictive and their removal is a valuable contribution. However, although the contribution is solid and valuable, there remained some concerns regarding its significance. Overall, the contribution was deemed to be borderline.