Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
In this paper, we investigate an adversarial contextual multinomial logit (MNL) bandit problem and establish minimax optimal lower and upper regret bounds.
摘要
评审与讨论
This paper studied the minimax optimal regret for a series of contextual MNL bandit problems, covering rewards—uniform and non-uniform cases and varying the value of the outside option . The authors first derive the regret lower bound that provides explicit dependence on the key parameters. Then they propose efficient algorithm OFU-MNL+ that achieves a matching upper bound up to a logarithm term. Simulations were conducted to demonstrate the tight dependence of the derived bounds.
优点
-
Despite many recent advances in contextual MNL bandit problems, however no previous studies have proven the minimax optimality. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the feature dimension and the maximum assortment size . This paper closes this gap. The technical contribution is strong and sufficient.
-
In spite of a theoretical paper, it is well written and is easy to follow. I enjoy reading the paper.
缺点
- My major question is about Assumption 2, which requires the uniform lower bound for every item and any assortment set and all round . Requiring to be a constant would be a strong assumption in practice. Since , would naturally depend on the size . In this case, would it be possible to derive more tighter results by explicitly including in the lower and upper bounds?
Below are a few minor suggestions:
-
I would suggest to introduce Section 4 (Problem Setting) before Section 3 (Existing Gap). Readers who are not familiar with the contextual MNL bandit problems might be confusing while reading Section 3.
-
Section 3 overlaps with many discussions in the Introduction section. It might be helpful to shorten the discussion and add more discussions on (1) why proving the new lower and upper bound results is non-trivial; (2) providing some proof outlines on the key steps used in the proof.
-
In the Introduction, line 31 on page 1, it is helpful to add some clarification on the definition of . The current terminology "utility of the outside option" is not very clear.
-
In line 162 on page 4, the authors mentioned that the set of items can vary over time. While I understand the assortment set could change over time, I am confused how could vary over time. The notation means . What do you mean this set varies over time? Do you allow new items to enter the market or some items to be removed over time?
问题
see Weakness
局限性
N/A
We are delighted that you enjoyed reading our paper and appreciated the value of our contributions. Thank you very much for your valuable feedback. Based on your detailed and insightful suggestions, we will make further improvements to our paper.
Regarding Assumption 2
We would like to clarify that the existence of a non-zero in our Assumption 2 should NOT be considered a weakness. It is important to note that is always non-zero, as all MNL probabilities are non-zero for bounded feature vectors and bounded parameters, and represents the minimum value of the product of two MNL probabilities. Thus, it is more appropriate to "define" rather than assuming its existence. In Goyal and Perivier (2021), is also defined rather than assumed. For clarity, we will revise Assumption 2 to be a Definition in our final version.
Moreover, this is a common assumption in logistic, MNL, and generalized linear model (GLM) bandits, and should NOT be considered a unique limitation of our work.
More interestingly, although can be exponentially small, our algorithm does not need to know . And we avoid the -dependence in the leading term of our regret analysis, making our algorithm much more practical.
Problem-dependent regret bounds
It turns that problem-dependent upper and lower bounds under uniform rewards are achievable. By making some modifications to our results, we can easily achieve this.
1. Upper bound
By adapting Theorem 2 from Faury et al. (2022) [21] to the multinomial and online parameter setting, we can achieve a regret upper bound of , where . See Section A in the comment below for the proof.
2. Lower bound
We can also derive a regret lower bound of by applying Theorem 2 from Abeille et al. (2021).
In the proof of Theorem 1, both the optimal assortment and an alternative assortment (only has feature vectors that maximizes the utility in ) have the same feature vectors for each assortment. Let and . Then, by Proposition D.1, we have
Here, maps from to . Since the input to the function is a single vector , can be regarded as a logistic function. This characteristic allows us to follow the proof of Theorem 2 from Abeille et al. (2021) [4], resulting in a regret lower bound of .
Suggestions
1. Introducing Section 4 before Section 3: That’s a great suggestion. Thank you for your input. We will incorporate these changes into the final version.
2. More discussion on (1) why proving the new lower and upper bounds is non-trivial; (2) providing some proof outlines for the key steps used in the proofs: Due to the page limit, we have briefly outlined the technical novelties used in each proof in the discussions section of the main paper, while providing more comprehensive details in the Appendix. If we receive an extra page for the camera-ready version, as you suggested, we will expand the discussions to further emphasize the importance of our bounds and elaborate on the key steps used in the proofs.
3. Clear definition of : The utility for the outside option is the attraction parameter for not choosing any item, and it is used to construct the MNL probability. For better presentation, we will include a reference to the definition of MNL probabilities (Eq.1) when we first introduce the term and provide additional explanation about .
4. The meaning of "the set of items can vary over time" (Line 162): The statement indicates that we allow for the entry of new items into the market or the removal of existing items over time, which could also lead to changes in the total number of items, denoted as . For clarity, we need to define the item set at time as . Thus, the item set can vary over time. Thank you for pointing this out. We will clarify this notation in the final version of our paper.
A. Proof of problem-dependent upper bound
We start the proof from line 748 in the proof of Theorem 2.
$
\sum_{t=1}^T \nabla Q(\mathbf{u}_t^\star)^\top (\mathbf{u}_t - \mathbf{u}_t^\star) \leq 2 \beta_T(\delta) \sum_{t=1}^T \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \| x_{ti} \|_{H_t^{-1}}.
$
Let . Let J_1 = \\{ t \in [T] \mid \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \geq \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}\_{t+1})p_t(0 \mid S_t, \mathbf{w}\_{t+1})\\\} and . Thus, we get
$
\text{RHS} = 2 \beta_T(\delta) \sum_{t=1}^T g_t(S_t; \mathbf{w}^\star) = 2 \beta_T(\delta) \sum_{t \in J_1} g_t(S_t; \mathbf{w}^\star) + 2 \beta_T(\delta) \sum_{t \in J_2} g_t(S_t; \mathbf{w}^\star).
$
To bound , by the mean value theorem, we get
$
\sum_{i \in J_1} g_t(S_t; \mathbf{w}^\star) = \sum_{i \in J_1} g_t(S_t; \mathbf{w}_{t+1}) + \sum_{i \in J_1} \nabla_{\mathbf{w}} g_t(S_t; \bar{\mathbf{w}}_t)^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}),
$
where is the convex combination of and . The first term can be bound by
$
\sum_{t \in J_1} g_t(S_t; \mathbf{w}_{t+1}) &\leq \sqrt{\sum_{t\in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}_{t+1})p_t(0 \mid S_t, \mathbf{w}_{t+1})} \sqrt{\sum_{t \in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}_{t+1})p_t(0 \mid S_t, \mathbf{w}_{t+1}) \| x_{ti}\|_{H_t^{-1}}^2 } \\ &\leq \sqrt{\sum_{t\in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star)} \cdot \tilde{\mathcal{O}}( \sqrt{d} ) \leq \sqrt{\sum_{t=1}^T \kappa_t^\star + \text{Reg}_T} \cdot \tilde{\mathcal{O}}( \sqrt{d} ) ,
$
where the second-to-the last inequality holds by the definition of and the elliptical potential lemma in our paper, and the last inequality holds by Lemma 11 in Goyal and Perivier [24] with . Moreover, the second term can be bounded by
$
\left|\sum_{i \in J_1}\nabla_{\mathbf{w}} g_t(S_t; \bar{\mathbf{w}}_t)\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \right| &=\Bigg|\sum_{t \in J_1} p_t(0 \mid S_t, \bar{\mathbf{w}}_t) \sum_{i \in S_t}p_t(i \mid S_t, \bar{\mathbf{w}}_t) x_{ti}^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \| x_{ti} \|_{H_t^{-1}} - 2 p_t(0 \mid S_t, \bar{\mathbf{w}}_t) \sum_{i \in S_t}p_t(i \mid S_t, \bar{\mathbf{w}}_t) \| x_{ti} \|_{H_t^{-1}} \sum_{j \in S_t}p_t(j \mid S_t, \bar{\mathbf{w}}_t) x_{tj}^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \Bigg| \\ &\leq 3 \beta_{T}(\delta) \sum_{t=1}^T \max_{i \in S_t} \| x_{ti} \|_{H_t^{-1}}^2 = \tilde{\mathcal{O}}(d^{3/2}/\kappa).
$
Hence, we get .
Now, we bound .
$
\sum\_{t \in J_2} g_t(S_t; \mathbf{w}^\star)
&\leq \sqrt{\sum\_{t \in J_2} \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star)} \sqrt{\sum\_{t \in J_2} \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \\| x\_{ti} \\|\_{H_t^{-1}}^2 }
\\\\
&\leq \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T} \sqrt{ \sum\_{t=1}^T \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}\_{t+1})p_t(0 \mid S_t, \mathbf{w}\_{t+1}) \\| x\_{ti} \\|\_{H_t^{-1}}^2 }
\\\\
&= \tilde{\mathcal{O}}\left( \sqrt{d} \cdot \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T} \right),
$
where the second inequality holds by the definition of and Lemma 11 in Goyal and Perivier [24]. Hence, we get . Therefore, by solving , we can conclude that .
This paper studies MNL bandit problem and proposes computationally efficient algorithms in both uniform and non-uniform settings. The authors present lower bounds for these two settings and show their results are minimax optimal up to logarithmic factors.
优点
- The paper achieves the minimax optimal results for the MNL bandit problem in both uniform and non-uniform settings, contributing new findings to the community.
- This paper introduces computationally efficient algorithms that enhance the computational complexity over prior works.
- Empirical evidence provided by the authors convincingly demonstrates the effectiveness of the proposed algorithms, strengthening the paper's claims with solid experimental validation.
缺点
- My main concern is about the contribution and technical novelty of this paper. From the statistical perspective, the regret for the uniform setting does not advance beyond the results presented in Goyal and Perivier (2021) and the -independent bound for the non-uniform setting has been achieved in Agrawal et al. (2023) (A comparison with Agrawal et al. (2023) in the table would be beneficial for highlighting any improvement). The main contribution of this paper is the improvement of the computational complexity. However, this enhancement comes from the online parameter estimation, which is almost the same as that in Zhang and Sugiyama (2023), with only minor modifications. As MNL bandits considered in this paper and MLogB bandits are quite similar, this modification is not significant.
- The claim that "This complexity arises because the optimistic revenue...requiring a different analysis" in Remark 2 seems overstated. The issue of assortment optimization to maximize optimistic revenue has been previously addressed by Oh and Iyengar (2021). The approach in the paper does not deviate from their method, which makes the purported contribution questionable.
- The presentation of lower bounds for both settings to demonstrate result optimality is appreciated, yet the inclusion of a problem-dependent lower bound that illustrates the influence of the problem-specific parameter would have added significant value, as seen in studies on logistic bandits.
问题
- Can the proposed algorithms be directly applied to achieve the optimal results for binary logistic bandits? It would be beneficial for the paper to discuss this applicability.
- The empirical performance shown in Figure 1 suggests UCB outperforms TS in non-uniform settings, contrary to the fact that TS usually achieves better empirical performance than UCB in the literature. Moreover, though OFU-MNL+ outperforms the two other baselines, all algorithms appear to incur linear regret in specific scenarios (e.g., (N=100, K=10, d=5, Uniform R)). Could the authors provide some intuition for these phenomena?
局限性
\
We appreciate your time to review our paper and your feedback. We would like to clarify some of the possible misunderstandings regarding our main contributions. Our main contributions are to establish the first minimax regret for MNL bandits, along with providing several novel insights related to variables such as and reward structures. We strongly believe that our results significantly advance what have been knwon in the field of MNL bandits. We will address your feedback point by point below. We sincerely hope to clarify any misunderstandings.
Regarding our main contributions
Our main contribution is NOT the improvement of computational complexity (although it is still an advantage of our proposed method). While we employ the online parameter estimation adapted from Zhang and Sugiyama [47], it is important to note that new analyses based on improved self-concordant-like properties and considering varying assortment sizes are introduced in our work.
We clearly emphasize that our main constributions are closing the gap between the upper and lower bounds in contextual MNL bandits for the first time and proposing a nearly minmax optimal algorithm. Additionally, we believe it is a significant and novel insight that the regret: 1) varies depending on the value of and 2) is influenced by the reward structure (uniform versus non-uniform), and 3) can improve as the assortment size increases under uniform rewards. Our contributions can be summarized as follows:
- First minimax results in terms of and even
- First proofs demonstrating how regret varies with and reward structure
- First lower bound under non-uniform rewards
- First tractable -free uppder bound under non-uniform rewards
- Adaptation of the online parameter estimation while maintaining the above regret bound
Each of our results tackles distinct non-trivial challenges, which we have successfully resolved in our paper. For instance, we have enhanced the self-concordance-like properties and the elliptical potential method, and introduced novel "centralized features" to prove the upper bound for non-uniform rewards. Please refer to the discussions for each theorem for more details.
Furthermore, our regret bound (Theorem 2) is a clear improvement over those presented by Goyal and Perivier (2021) and Agrawal et al. (2023).
-
vs Goyal and Perivier (2021) [24]: We have improved upon the regret results of Goyal and Perivier (2021) by a factor of : reducing it from (let in the leading term) to . This advancement was achieved by enhancing the self-concordant-like property of the loss function (Proposition C.1) and refining the elliptical potential lemma (Lemma E.2). Moreover, we significantly improved the regret associated with the transient term, reducing it from to , by utilizing an improved bound for the second derivative of the revenue (Lemma E.3). For further details, please refer to the discussion of Theorem 2 in our paper.
-
vs Agrawal et al. (2023) [5]: In fact, their paper contains significant technical errors (refer Section A in the comments below), rendering a direct comparison to our work meaningless. This is the main reason we initially chose not to compare our results with theirs. Nevertheless, in response to requests from some reviewers, we will consider including a discussion about this in our camera-ready version. Even if all errors were corrected, our result (Theorem 2) still shows an improvement over theirs in terms of . Agrawal et al. (2023) considered a uniform rewards setting (with ) and achieved a regret of . However, the corrected regret should be ).
Remark 2 is intended solely for comparing our setting with that of Zhang and Sugiyama [47].
The purpose of Remark 2 is to explain that our setting fundamentally differs from those described by Zhang and Sugiyama [47], necessitating a different analysis. Therefore, this distinction should NOT be viewed as a weakness. Again, our main contribution is proving the first minimax results in MNL bandits.
Problem-dependent () lower bound
The main goal of this paper is to explicitly show how the values of and affect the regrets (both upper and lower bounds). To clearly demonstrate the dependency of regret on these values, we have presented the results in terms of and , rather than .
As per your request, we believe that a problem-dependent lower bound can be easily derived from our intermediate results. In the proof of Theorem 1, both the optimal assortment and an alternative assortment (which only includes feature vectors that maximizes the utility in ) have the same feature vectors for each assortment. Given that the input to the MNL probability function is a single feature vector, our problem can be reduced to the logistic bandit problem. Subsequently, using similar analyses to those in the proof of Theorem 2 from Abeille et al. (2021) [4], we can derive a regret lower bound of where . See Section B in the comments below for more detailed proof.
We are more than happy to include this lower bound. However, we do NOT believe that not stating the problem-dependent lower bound should be considered a weakness, as our focus is to show the regret dependence on and .
Due to space limitations, we will address your questions in the following comments.
B. Proof of problem-dependent () lower bound
In the proof of Theorem 1, both the optimal assortment and an alternative assortment (only has feature vectors that maximizes the utility in ) have the same feature vectors for each assortment. Let and . Then, by Proposition D.1, we have
Here, maps from to . Since the input to the function is a single vector , can be regarded as a logistic function. This characteristic allows us to follow the proof of Theorem 2 from Abeille et al. (2021) [4], resulting in a regret lower bound of where .
Thank you for your response. I appreciate that my main concerns were addressed, so I've improved my score. I'm also grateful to the author for highlighting the technical errors in the previous article. However, I still have a few technical questions regarding the paper. The analysis by Zhang & Sugiyama (2023) seems to rely heavily on the one-step distance of online mirror descent (Lemma 20 of their paper). However, I could not find the one-step distance analysis within the paper. Could you please provide more details on this?
Thank you very much for kindly increasing the score. We sincerely hope that we have addressed all of your concerns and that our work’s significance is now clearer. Your thoughtful feedback has been instrumental, and we believe that the final version will be strengthened as a result.
Questions
Q1. Achieving the optimal results for binary logistic bandits
Yes, it is explained in Lines 291-293. When , and , our problem can be reduced to the logistic bandit problem. Under these conditions, we achieve a regret of by Corollay 1 in Zhang and Sugiyama [49].
Q2-1. Figure 1 suggests UCB outperforms TS in non-uniform settings, contrary to the fact that TS usually achieves better empirical performance than UCB.
While TS often outperforms UCB algorithms in many bandit problems, it is NOT guaranteed to perform better. For example, even in linear bandits, the worst-case regret of TS (Agrawal and Goyal [6]) is known to be worse than that of UCB (Abbasi-Yadkori et al. [1]) as many experts would know. That is, there are instances where UCB can empirically outperform TS. Additionally, the empirical results from Oh and Iyengar [39] further demonstrate that UCB performs better than TS in MNL bandits.
Q2-2. All algorithms appear to incur linear regret in specific scenarios (e.g., (N=100, K=10, d=5, Uniform R)).
First, we would like to clarify that the regret upper bounds of the baseline algorithms (UCB-MNL and TS-MNL) are inversely proportional to . As we explained in Line 196 of our paper, since , the curvature of the regret curves of the baselines decreases as the assortment size increases. This is why their curves appear to be linear.
However, it's important to note that while the regret curves of these baselines may appear linear, they are actually sub-linear, curving very slowly. This indicates that these algorithms learn at a significantly slower rate compared to our algorithm. The table included below clearly demonstrates that the regret curves are indeed sub-linear. The average cumulative regret and the increase in regret every 500 rounds under N=100, K=10, d=5, uniform rewards:
| round | UCB-MNL | UCB | TS-MNL | TS | Ours | Ours |
|---|---|---|---|---|---|---|
| 1 | 0.044908 | 0.046225 | 0.033627 | |||
| 501 | 16.33583 | 16.290922 | 26.57288 | 26.526655 | 7.65057 | 7.616943 |
| 1001 | 31.896535 | 15.560705 | 52.93464 | 26.36176 | 11.883341 | 4.232771 |
| 1501 | 47.225106 | 15.328571 | 79.144988 | 26.210348 | 14.85561 | 2.972269 |
| 2001 | 62.326848 | 15.101742 | 105.145026 | 26.000038 | 17.124088 | 2.268478 |
| 2501 | 77.195593 | 14.868745 | 130.847381 | 25.702355 | 19.058557 | 1.934469 |
| 3001 | 91.963978 | 14.768385 | 156.315643 | 25.468262 | 20.676582 | 1.618025 |
Other scenarios show similar trends.
A. Technical errors in Agrawal et al. (2023) [5]
1. Equation (16): , where is a design matrix whose columns are the attribute vectors of the items in the assortment and .
- It appears that the authors may have intended to derive this equation using a first-order exact Taylor expansion. However, in MNL bandits, this equation generally does not hold. Consider a counterexample where , for , and . Then, LHS is equals to . In this case, the left-hand side (LHS) equals (since ), but the right-hand side (RHS) is not , because the denominators of each and differ. This equation only holds in special cases, such as when , which corresponds to the logistic bandit case. Equation (16) serves as the foundation for the entire proof in their paper. Consequently, all subsequent results derived from it are also incorrect.
2. Cauchy-Schwarz inequality in the regret analysis (Page 46)
-
When using the Cauchy-Schwarz inequality on the regret before applying the elliptical potential lemma, they indeed incur an additional factor:
-
Hence, their regret should actually be , which is worse than the result in our Theorem 2 (upper bound under uniform rewards) by a factor of .
Continued in the next comment.
Thank you once again for your positive reevaluation of our paper. We are glad that we were able to address your major concern!
To answer your question: We could have used Lemma 20 from Zhang & Sugiyama (2023). However, since the leading term of the regret upper bound remains unchanged, we determined that the lemma was not essential and decided not to include it.
First, let’s revisit Lemma 20 from Zhang & Sugiyama (2023). Notably, their bound can be improved by a factor of . Specifically, because (rather than as they suggested), we can derive (instead of ), where .
With this revised version of Lemma 20, we could apply it to derive the same regret upper bound. In Line 746, by taking the absolute value of the regret, we can use a second-order Taylor expansion around (instead of ) as follows:
$
\sum_{t=1}^T | Q(\mathbf{u}_t) - Q(\mathbf{u}_t^\star) | \leq \underbrace{\sum_{t=1}^T | \nabla Q(\mathbf{u}_t)^\top (\mathbf{u}_t^\star - \mathbf{u}_t) |}_{\text{(A)}} + \underbrace{\frac{1}{2} \sum_{t=1}^T| (\mathbf{u}_t^\star - \mathbf{u}_t)^\top \nabla^2 Q(\bar{\mathbf{u}}_t) (\mathbf{u}_t^\star - \mathbf{u}_t) |}_{\text{(B)}}
$
The term (A) can be bounded using a similar analysis to our proof of Theorem 2, except that we need to bound instead of (refer to Lines 763 and 768). Since and , applying Lemma 20 of Zhang & Sugiyama (2023) results in a tighter bound. However, we only apply Lemma 20 to non-leading terms—the second and third terms of Eq. (E.4) (Line 757). Specifically, by using Lemma 20, we can bound these terms by which is dominated by the bound for term (B) at . Thus, the leading term of the regret upper bound remains unchanged.
We hope our response addresses your question. If you have any additional questions or comments, we would be more than happy to address them.
Thank you for your reply. I have no further questions. I will further raise my score to 7.
We sincerely appreciate your decision to further increase the score. It has been a pleasure discussing our work with you. Thank you very much!
The paper closes the gap between upper and lower bounds for contextual MNL bandits over various scenarios, including for varying and uniform/non-uniform rewards. Furthermore, the paper proposes a computationally efficient algorithm that achieves such nearly optimal upper bound, which is validated numerically as well.
优点
- Clearly, very well-written
- Clear-cut contributions that close all the gaps remaining in the contextual MNL bandits
- Numerical verification
- Various technical contributions, mainly for the analysis but also for the algorithm design (Eqn. (6))
缺点
- Nothing noticeable; see questions
问题
Questions
- In Remark 2, the authors mention that the optimistic bonus incurs exponential computational cost. But, disregarding the computational complexity, is there any advantage of doing so from the statistical/theoretical perspective? Somewhat related, although the authors have avoided the MLE-based approach for computational complexity, does using MLE with the intuitions provided here could lead to better regret guarantees in other dependencies?
- In the regret upper bounds, there is a transient term . Is this an artifact of the proof? Or is there a hidden warmup phase done implicitly by the OFU-MNL+? Then, if one is okay with incurring such transient terms anyhow, is there potential for doing even better through forced exploration-type approaches (e.g., [1,2] for logistic bandits)?
- Although the -boundedness of is standard in MNL bandits, if we relax this to -boundedness for some , then does this imply that may scale exponentially in , like logistic bandits [3]?
- Can the lower bound in Theorem 1 be written as a local minimax lower bound (e.g., Theorem 2 of [3]) centered around a specific instance? This seems like a better presentation than the current representation, which seems to be a generic minimax, where the max is over all possible problem instances. Also, the lower-bound construction, although inspired by prior works on MNL bandits, resembles a bit of [3] as well. I encourage the authors to check Appendix D of [3], especially the Averaging Hammer technique, and if applicable, include some discussions on this as well. Such instance-specific local minimax lower bound may make the result stronger.
Suggestions
- In the main text, for all the theorem statements, please include (hyper)ref for where the full proof is in the Appendix.
- Although a bit complex, for practitioners, it might be beneficial to include the precise constants for the in Lemma 1 (or point to the Appendix with the constants). A follow-up question: for the experiments, do the authors use the precise parameter as specified theoretically?
- Might be beneficial to move the problem setting to earlier part of the paper, but not too sure here.
[1] https://arxiv.org/abs/2404.06831
局限性
Yes
Thank you very much for your positive evaluation of our paper and for your insightful and valuable feedback. Each question you raised is an intriguing research topic, shedding light on new areas for future study.
Advantage of optimistic bonus
First, we want to clarify that the simple adaptation of Zhang and Sugiyama [49] to our method incurs "exponential computational cost" -- this is NOT an inherent drawback of our method. To the best of our knowledge, the adaptation of an optimistic bonus, as done in Zhang and Sugiyama [49], to our method does not provide any statistical advantage, despite the exponential computation cost.
Does using MLE with the intuitions provided here could lead to better regret guarantees in other dependencies?
To the best of our knowledge, given that our regret upper bound is minimax optimal (up to logarithmic factors), we believe that the use of MLE may not improve our regret bound polynomially. However, it might be possible to affect the bound logarithmically, which could be of future interest.
Regarding and the potential for forced exploration-type approaches
The term is an "artifact" of the proof, resulting from the second-order Taylor expansion of the regret. We do not have any warm-up phase or forced exploration in the algorithm. Hence, we are not sure whether forced exploration-type approaches can be beneficial. Based on your feedback, we considered adding an initial/intermediate forced-sampling step, but we do not believe the regret can be improved (in our current analysis). This would be an interesting area for future work.
If we relax the bound of to -boundedness for some , then does this imply that may scale exponentially in , like logistic bandits [3]?
Yes, by the deifinition, is affected by . Specficially, (which appears in the lower-order term in our upper bounds) scales with .
Instance-specific local minimax lower bound
As you suggested, we have checked the Averaging Hammer technique and found that an instance-dependent lower bound is indeed achievable. Thank you very much for your valuable feedback. Your insights have greatly strengthened our paper!
We can achieve a regret lower bound of by applying Theorem 2 from Abeille et al. (2021), where .
In the proof of Theorem 1, both the optimal assortment and an alternative assortment (only has feature vectors that maximizes the utility in ) have the same feature vectors for each assortment. Let and . Then, by Proposition D.1, we have
Here, maps from to . Since the input to the function is a single vector , can be regarded as a logistic function. This characteristic allows us to follow the proof of Theorem 2 from Abeille et al. (2021) [4], resulting in a regret lower bound of . We will include this result in our final paper.
Suggestions
We appreciate all your suggestions and will incorporate them into the paper as you proposed.
To answer your additional question about whether we used precise parameters in the experiments: yes, we used the exact values.
Thank you for your detailed responses. I'm quite happy that the lower bound holds in a local minimax sense, which I believe is a much stronger notion for claiming instance-wise optimality of the algorithms. All my questions and concerns have been addressed, and after reading through other reviews, other significances of this paper (not requiring the knowledge of for the algorithm, instance-specific upper/lower bounds available, additional experiments) have come to my attention.
Thus, I'm further raising the score and willing to advocate strongly for the paper. Congratulations on the nice work!
Some additional questions/suggestions:
- Exactly where in the proof is the transient term popping up? Could one hope for an arm-set geometry-dependent transient term, as in the "detrimental arms" of Abeille et al. (2021)?
- I don't think this was mentioned properly in the paper. Is the obtained confidence sequence (CS) the first -free CS in contextual MNL bandits? If time allows, for the camera ready (not the rebuttal), it would be interesting to see the resulting CS at the final stage and compare it to existing CSs.
- For the camera-ready ver, I would like to see explicit dependencies on as well, with the more general assumption. Also, under the -boundedness, would all the constants for the algorithm () as well as the regret also depend on ?
We truly appreciate your strong support for our paper! The discussions with you during this rebuttal period have been both highly valuable and genuinely enjoyable. Each of your comments has been very constructive and has helped us improve our work. Below are our responses to your additional questions.
Q1. Regarding the transient term
The transient term pops up when bounding the second-order term of the regret (term (B) in Line 746), analogous to what is observed in logistic bandits as discussed by Abeille et al. (2021) [3] and Faury et al. (2020) [18]. In their work, the second-order regret term also scales with . Global information (measured through ) is pushed into a second-order term that vanishes quickly.
To address the question of whether an arm-set geometry-dependent transient term is possible, we believe that directly applying the proof from logistic bandits is not feasible. Extending it to our setting presents several challenges:
-
Setting: We would need to shift from our current setting (adversarial contexts) to a fixed-arm setting.
-
Multiple item selection: In Abeille et al. (2021), the tighter bound for the second-order term is largely achieved by focusing on whether holds or not. However, in our setting, the optimal assortment includes multiple items, which makes it much more complex to divide the cases in a similar way.
-
Lemma 9 is not applicable: To bound the second-order term, Abeille et al. (2021) use Lemma 9, which is only valid when the input to the function is a scalar. Since our function (Line 742) takes a -dimensional vector as input, this lemma does not apply to our case.
There are likely additional complexities that make it challenging to derive a bound for the second-order term that depends on the arm structure. Nonetheless, we believe this is a highly interesting direction for future research. Thank you for the intriguing question!
Q2. Regrading confidence sequence
Goyal and Perivier (2022) [24] also achieved a -free CS (See Proposition 2 in their paper). However, they constructed their CS based on the MLE, which incurs a computational cost of computation cost per round to estimate the parameter. More importantly, their CS is non-convex, making it intractable to compute an optimistic parameter .
We plan to compare our resulting confidence set (Lemma 1) with other works in the camera-ready version. Thank you for the constructive suggestion!
Q3. Explicit dependencies on
We agree that assuming -boundedness of the parameter norm would enhance clarity. We will update the assumption and the corresponding constant in our camera-ready version. With the -boundedness assumption, the constants for the algorithms are as follows:
- :
- :
- :
$
&\sqrt{(\log(K+1)+B+1) \left( ( 3\log(1+(K+1)t) + B + 2 ) \left( \frac{17}{16}\lambda + 2\sqrt{\lambda} \log \left( \frac{2 \sqrt{1+2t}}{\delta} \right) + 16 \left( \log\left( \frac{2\sqrt{1+2t}}{\delta} \right) \right)^2 \right) + 2 + \frac{7\sqrt{6}}{6} \eta d \log \left( 1 + \frac{t+1}{2\lambda}\right) \right) + 2 \ B^2} \\ &= \mathcal{O}(B \sqrt{d} \log K \log t )
$
The paper studies the contextual multinomial logit bandit problem. In this problem a learner interacts with the environment over a sequence of rounds. At each round, the learner receives features for arms and decides to select up-to of them (called an assortment) and play them. The environment might select one of these arms with a probability determined by a multinomial distribution and the learner receives a reward for the selected arm. The goal is to select the optimal assortment at each round that minimizes the cumulative regret of the algorithm.
The authors design algorithms for cases when the rewards are uniform (same for each arm) and non-uniform. They also provide matching lower bounds (for large enough ). Finally numerical experiments are performed to demonstrate the practical performance of the algorithm.
优点
The authors do a great job of explaining the problem statement and prior work and position their work appropriately. I believe this is a very important direction since bandit algorithms have recently started getting popular in practical settings where the assumption of linear rewards no longer holds. Getting optimal regret guarantees for such non-linear models has gained good attention recently and the paper contributes to this in a clear non-trivial way.
It's great to see that the authors prove matching lower bounds for the uniform case (for general ) and for non-uniform case (for bounded ). This adds further credibility to their algorithm and the upper bound.
缺点
I am not an expert on the MNL problem but I was wondering if the algorithm offers anything new to the logistic bandit setting. Recently lot of work has happened in obtaining computationally efficient algorithms with optimal regret (independent of a problem non-linearity parameter ). It would be good to see if anything new can be said about that setting using the results in this paper. Also, empirical comparisons on both regret and time complexity (for logistic bandit i.e. K=1 as well as larger K) can be a good addition to the paper.
Minor comment - Typo on line 183. The correct definition should be .
My expertise in this topic is limited and I will wait to see what the other reviewers think as well.
问题
Is the problem dependent constant in line 189 same as the non-linearity defined in prior work for logistic bandits (See [4,20,21]). It seems to be slightly differently defined here so I was curious to see if it meant the same at least in the logistic case.
局限性
N/A
Thank you for acknowledging the value of our work and providing a positive evaluation. We will address your questions below and make improvements to our paper based on your suggestions.
Anything new to the logistic bandit
There are interesting points that can be said about the logistic bandit setting:
1. The MNL bandit is better than the logistic bandit (under uniform rewards).
According to Theorem 2 in our paper, a large can be beneficial under uniform rewards. This suggests that in recommendation systems, it is advantageous to recommend multiple items (MNL bandit) rather than just a single item (logistic bandit). This aligns with the intuition that offering multiple items allows us to gather more information from user feedback. However, in real-world scenarios, offering a large assortment might lead to user fatigue. It would be interesting for future research to consider this user fatigue and determine an appropriate size of the assortment .
2. As value of the utility for outside option increases, both the upper and lower bounds of regret for the logistic bandit decrease.
When , our setting simplifies to the logistic bandit scenario. In this case, the upper and lower bounds are and respectively. This highlights an interesting aspect about how the value of influences the regrets in the logistic bandit.
Additional empirical comparison
Following your suggestion, we have conducted additional experiments for the logistic bandit case (K=1) and cases with larger (). See the "global" response above. These additional results also corroborate our theoretical claims. Particularly, under uniform rewards, we observed that regret decreases as increases, aligning with Theorems 1 and 2. In contrast, under non-uniform rewards, regret remains unchanged regardless of changes in , consistent with Theorems 3 and 4.
We are pleased to provide these additional results that further substantiate the claims in our paper. Thank you very much!
Typo
Thank you very much for identifying the typo. We will make the necessary correction and ensure it is reflected in the final version of the paper. Your attention to detail is greatly appreciated.
The authors consider the problem of sequentially picking assortments of items in an online manner over a horizon of rounds, where a user then selects from the items or not according to a multinomial logit model with a fixed unknown preference vector. The items are represented as (known) feature vectors which can change adversarially from round to round. The authors prove tighter cumulative regret lower bounds for both uniform and non-uniform cases and propose an algorithm with matching upper bounds. The authors account for the probability of a user not choosing from the given options in the bounds. The authors also run experiments to compare their algorithm to baselines.
优点
Major
- Improved lower bounds for contextual MNL bandits in the uniform-reward setting, accounting for the impact of the probability of the null action.
- improved lower bounds also in the non-uniform reward setting.
- A tractable algorithm is proposed with matching regret upper bounds for both uniform and non-uniform reward setting
Minor
- The authors include experiments for non-uniform rewards, showing significant improvements in both run-time and cumulative regret for included baselines.
- Good discussions and remarks on effects of problem setup, such as and whether rewards are uniform or not on regret bounds, and relations with prior works (among those discussed to some length).
- Overall I found the writing to be good. Table 1 is helpful.
缺点
Minor
- Zhang and Luo “Contextual Multinomial Logit Bandits with General Value Functions” [48] is only mentioned once in line 128 in a sentence about past works claiming -independent regret lower bounds while not accounting for (with the next sentence pointing out issue that several lower bounds used . However, in [48] they propose algorithms and show regret bounds for a harder problem (adversarial contexts and adversarial rewards and for more general settings than MNL models) that avoid dependence in the leading term, though for the simpler problem studied in the current paper may have loose assortment size dependence. In [48] it looks to me like they consider not as implied by the current paper. Thus, while [48] is cited, to me it seems to be improperly cited and the results should be more thoroughly discussed, and unless some justification should be included in Table 1.
- State of the art upper bounds for non-contextual MNL are not discussed. Non-contextual MNL lower bounds are included in Table 1 and discussed, but unless I overlooked it did not see anything on upper bounds, even from papers like Agrawal et al. (2019) “MNL-bandit...” [8] that were discussed for lower bounds. (Or do the contextual upper bounds with recover them in current state of the art?)
- Agrawal et al. “A tractable ..” 2023 [5] is cited in a few places regarding the problem setting (like in line 186 for assumptions 1 and 2, line 282 for , …), but not included in Table 1 or discussed in detail. They achieve improved regret upper bound dependence on from prior works using a reportedly tractable algorithm. Agrawal et al. [5] consider a (contextual) combinatorial assortment problem using an MNL model with adversarial contexts and uniform rewards and .
- Choi et al. “Cascading Contextual Assortment Bandits” in NeurIPS 2023 is not cited. How would their results compare? Their regret upper bound (specialized to the non-cascading setting you consider) looks competitive for and uniform reward setting. Another cascading assortment paper by Cao and Sun “Tiered Assortment: Optimization and Online Learning” in Management Science is also not cited, though they acknowledge in their Remark 2 that in the special case of a single tier, their result matches Agrawal et al. (2019) “MNL-bandit...” [8].
- The authors consider the parameter vector bounded by instead of bounded by a known constant like in previous works (such as Agrawal et al. 2023 [5] and Lee et al. 2024 [32]). As pointed out by Leet et al. 2024 [32], “These quantities [] can scale exponentially in S in the worst-case (Faury et al., 2020).” How would the results be impacted accounting for such a known bound instead of setting it to 1?
- A clearer discussion on the computational complexity of the methods listed as “intractable” (in particular Goyal and Perivier 2022 [24]) would be better, both worst-case and (if not included in experiments) conjecture whether where they would have been in Fig. I.1 (eg even for those instances would it have been so slow as to be off the charts?).
Very Minor
- [24] cites an arxiv version of a NeurIPS 2022 paper
问题
(some questions are included above in comments made in “Weaknesses”)
局限性
Yes
Thank you very much for your insightful and valuable feedback. As we start our discussition, we first would like to ask for your patience for our lengthy responses. We have much to share as we have genuinely appreciated your feedback and look forward to our discussion with you! We will ensure that all the following discussions are included in our final version.
Regading Zhang and Luo (2024)
First, we would like to clarify that our setting also considers adversarial contexts and adversarial rewards. In our work, the term "the non-uniform rewards" is equivalent to the adversarial rewards. Thus, apart from the fact that Zhang and Luo (2024) [48] use a "general value function", other problem settings in [48] are not necessarily harder than ours.
Even though their tightest regret bound (Corollary 17) avoids dependence on based on the general value function, their regret still scales with and the number of items . Moreover, as they mentioned in their paper, there is no efficient way to implement the algorithm.
To clarify, in Line 128, our point in citing [48] was to emphasize that previous studies' oversight of 's impact on regret lower bounds has caused substantial confusion in the field. They stated that in the non-contextual setting is optimal; however, this claim only holds true when .
Upper bounds for non-contextual MNL bandits
We plan to discuss the upper bounds in the non-contextual setting to clearly illustrate how previous studies have derived regret bounds under varying conditions, such as different values and reward structures. For instance, Agrawal et al. (2019) achieved an upper bound with " and non-uniform rewards". However, the known tightest lower bound, , was proven under settings of " and uniform rewards".
Comparison to Agrawal et al. (2023)
In fact, their paper contains significant technical errors (see Section A in the comments below), rendering a direct comparison to our work meaningless. This is the main reason we initially chose not to compare our results with theirs.
Even if all errors were corrected, our result (Theorem 2) still shows an improvement over theirs in terms of . Agrawal et al. (2023) considered a uniform rewards setting (with ) and achieved a regret of . However, the corrected regret should be .
Comparison to cascading assortment problems
Recently, both works have considered the cascading assortment bandit problem, which encompasses the MNL bandit problem as a special case where the cascading length is . However, these works do not encompass our results for the following reasons.
- vs Choi et al. (2023): They consider uniform rewards, and achieve a regret upper bound of , which avoids dependence on both the cascading length and in the leading term. When the cascading length is , our result (Theorem 2) improves upon theirs by a factor of . Moreover, their computation cost per round is since they employ MLE to estimate the parameter.
- vs Cao and Sun (2023): They consider non-uniform rewards, and achieve a regret upper bound of , where is the cascading length and for all , , and . However, their regret bound still suffers from a harmful dependence on , which can be exponentially large.
What happens if ? (+ problem-dependent bounds)
When (we use instead of to avoid notational overlap), the confidence bound is . Consequently, the regret upper bounds are as follows:
- Uniform rewards: . It's important to note that one of our main goals is to explicitly demonstrate the regret depends on and . In deriving such a result, the dependence on is unavoidable to our best knowledge. Note that for the instance-dependent regret, the regret bound does not depend on (see below).
- Non-uniform rewards: .
And the regret lower bounds remain the same, because we can always construct an instance where even if . Note that a small norm of typically indicates a more challenging instance, as it becomes harder to identify the optimal arm. Overall, our upper and lower bounds in terms of , and (and ) are still tight, maintaining the (near) minimax optimality in terms of those problem parameters, which is our main focus.
Moreover (as per other reviewers' request), it turns that instance-dependent upper and lower bounds under uniform rewards (without dependence on ) are achievable. By making some modifications to previous results, we can easily achieve this. See Section B in the comments below for the proofs.
Computational complexity of Goyal and Perivier (2022)
The computational bottleneck of Goyal and Perivier (2022) [24] is to find an optimistic parameter . However, since the confidence set is non-convex, computing is extremely expensive, and the worst-case cost is not well-defined. A possible solution could be a convex relaxation, as suggested by Proposition 3 in Abeille et al. (2021) [4], though exploring this is beyond the scope of our work.
$
\left|\sum_{i \in J_1}\nabla_{\mathbf{w}} g_t(S_t; \bar{\mathbf{w}}_t)\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \right| &=\Bigg|\sum_{t \in J_1} p_t(0 \mid S_t, \bar{\mathbf{w}}_t) \sum_{i \in S_t}p_t(i \mid S_t, \bar{\mathbf{w}}_t) x_{ti}^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \| x_{ti} \|_{H_t^{-1}} - 2 p_t(0 \mid S_t, \bar{\mathbf{w}}_t) \sum_{i \in S_t}p_t(i \mid S_t, \bar{\mathbf{w}}_t) \| x_{ti} \|_{H_t^{-1}} \sum_{j \in S_t}p_t(j \mid S_t, \bar{\mathbf{w}}_t) x_{tj}^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \Bigg| \\ &\leq 3 \beta_{T}(\delta) \sum_{t=1}^T \max_{i \in S_t} \| x_{ti} \|_{H_t^{-1}}^2 = \tilde{\mathcal{O}}(d^{3/2}/\kappa).
$
Hence, we get .
Now, we bound .
$
\sum\_{t \in J_2} g_t(S_t; \mathbf{w}^\star)
&\leq \sqrt{\sum\_{t \in J_2} \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star)} \sqrt{\sum\_{t \in J_2} \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \\| x\_{ti} \\|\_{H_t^{-1}}^2 }
\\\\
&\leq \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T} \sqrt{ \sum\_{t=1}^T \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}\_{t+1})p_t(0 \mid S_t, \mathbf{w}\_{t+1}) \\| x\_{ti} \\|\_{H_t^{-1}}^2 }
\\\\
&= \tilde{\mathcal{O}}\left( \sqrt{d} \cdot \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T} \right),
$
where the second inequality holds by the definition of and Lemma 11 in Goyal and Perivier [24]. Hence, we get . Therefore, by solving , we can conclude that .
2. Lower bound
We can also derive a regret lower bound of by applying Theorem 2 from Abeille et al. (2021).
Proof of the lower bound: In the proof of Theorem 1, both the optimal assortment and an alternative assortment (only has feature vectors that maximizes the utility in ) have the same feature vectors for each assortment. Let and . Then, by Proposition D.1, we have
$
\sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in S_t} p (i \mid S_t, \mathbf{w}^\star) \geq \sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in \tilde{S}_t} p (i \mid \tilde{S}_t, \mathbf{w}^\star) =: f(x;\mathbf{w}^\star) - f(\hat{x};\mathbf{w}^\star).
$
Here, maps from to . Since the input to the function is a single vector , can be regarded as a logistic function. This characteristic allows us to follow the proof of Theorem 2 from Abeille et al. (2021) [4], resulting in a regret lower bound of .
Thanks to the authors for the detailed response. I have read the rebuttal and decided to increase my score. I do not have further questions.
We sincerely appreciate your decision to increase the score. Your valuable and constructive comments will greatly contribute to strengthening the final version of our paper. Thank you for the insightful and enjoyable discussion.
A. Technical errors in Agrawal et al. (2023) [5]
1. Equation (16): , where is a design matrix whose columns are the attribute vectors of the items in the assortment and .
- It appears that the authors may have intended to derive this equation using a first-order exact Taylor expansion. However, in MNL bandits, this equation generally does not hold. Consider a counterexample where , for , and . Then, LHS is equals to . In this case, the left-hand side (LHS) equals (since ), but the right-hand side (RHS) is not , because the denominators of each and differ. This equation only holds in special cases, such as when , which corresponds to the logistic bandit case. Equation (16) serves as the foundation for the entire proof in their paper. Consequently, all subsequent results derived from it are also incorrect.
2. Cauchy-Schwarz inequality in the regret analysis (Page 46)
-
When using the Cauchy-Schwarz inequality on the regret before applying the elliptical potential lemma, they indeed incur an additional factor:
-
Hence, their regret should actually be , which is worse than the result in our Theorem 2 (upper bound under uniform rewards) by a factor of .
B. Problem-dependent bounds
1. Upper bound
By adapting Theorem 2 from Faury et al. (2022) [21] to the multinomial and online parameter setting, we can achieve a regret upper bound of , where .
Proof of the upper bound: We start the proof from line 748 in the proof of Theorem 2.
$
\sum_{t=1}^T \nabla Q(\mathbf{u}_t^\star)^\top (\mathbf{u}_t - \mathbf{u}_t^\star) \leq 2 \beta_T(\delta) \sum_{t=1}^T \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \| x_{ti} \|_{H_t^{-1}}.
$
Let . Let J_1 = \\{ t \in [T] \mid \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \geq \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}\_{t+1})p_t(0 \mid S_t, \mathbf{w}\_{t+1})\\\} and . Thus, we get
$
\text{RHS} = 2 \beta_T(\delta) \sum_{t=1}^T g_t(S_t; \mathbf{w}^\star) = 2 \beta_T(\delta) \sum_{t \in J_1} g_t(S_t; \mathbf{w}^\star) + 2 \beta_T(\delta) \sum_{t \in J_2} g_t(S_t; \mathbf{w}^\star).
$
To bound , by the mean value theorem, we get
$
\sum_{i \in J_1} g_t(S_t; \mathbf{w}^\star) = \sum_{i \in J_1} g_t(S_t; \mathbf{w}_{t+1}) + \sum_{i \in J_1} \nabla_{\mathbf{w}} g_t(S_t; \bar{\mathbf{w}}_t)^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}),
$
where is the convex combination of and . The first term can be bound by
$
\sum_{t \in J_1} g_t(S_t; \mathbf{w}_{t+1}) &\leq \sqrt{\sum_{t\in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}_{t+1})p_t(0 \mid S_t, \mathbf{w}_{t+1})} \sqrt{\sum_{t \in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}_{t+1})p_t(0 \mid S_t, \mathbf{w}_{t+1}) \| x_{ti}\|_{H_t^{-1}}^2 } \\ &\leq \sqrt{\sum_{t\in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star)} \cdot \tilde{\mathcal{O}}( \sqrt{d} ) \leq \sqrt{\sum_{t=1}^T \kappa_t^\star + \text{Reg}_T} \cdot \tilde{\mathcal{O}}( \sqrt{d} ) ,
$
where the second-to-the last inequality holds by the definition of and the elliptical potential lemma in our paper, and the last inequality holds by Lemma 11 in Goyal and Perivier [24] with . Moreover, the second term can be bounded by
Continued in the next comment.
We would like to express our heartfelt thanks to all the reviewers for their constructive feedback. Your insights and suggestions are greatly appreciated.
In response to Reviewer 8Ks7's request, we conducted additional experiments for the logistic case () and for larger values of (). The results of these experiments are detailed in the attached PDF. These results also support our theoretical claims. Specifically, under uniform rewards, we observe that the regret decreases as increases, aligning with Theorems 1 and 2. Conversely, under non-uniform rewards, the regret remains unaffected by changes in , consistent with Theorems 3 and 4.
We are pleased to provide additional results that further strengthen the claims in our paper. Thank you very much!
This paper explores the problem of contextual multinomial logit (MNL) bandits, an elegant abstraction that captures many real-world decision-making scenarios. Previous research has shown a clear theoretical gap between upper and lower bounds, particularly with respect to the maximum assortment size . The authors thoroughly investigate this issue and propose several innovative algorithms with provable regret bounds, demonstrating their results' near-minimax optimality. These findings are very nice, and the reviewers unanimously recognize the value of these results. On the other hand, they also highlight several points for improvement. Key points include addressing experimental issues and providing clearer distinctions between the paper’s contributions and earlier work, such as Zhang and Luo [48] and Zhang and Sugiyama [49]. The authors are encouraged to revise the manuscript to incorporate the reviewers' suggestions and enhance its impact on the field.