Improved Online Confidence Bounds for Multinomial Logistic Bandits
We establish an improved online confidence bound for MNL models and, leveraging this result, propose a constant-time algorithm that achieves the tightest variance-dependent optimal regret in contextual MNL bandits.
摘要
评审与讨论
This work extends the MLE-based -free online confidence sequences for generalized linear models, where is the norm-bound of the parameter set; to the setting of multinomial bandits to give a statistically SOTA algorithm (in terms of the dependence on and), which is also variance-dependent. The work also provides a computationally optimal algorithm under the relatively well-known "online Newton step" framework base on ellipsoidal confidence sets.
给作者的问题
- The online confidence sequence is claimed to be the "tightest", so could the authors provide some lower bounds on the width confidence radius itself, or at least some intuition on why couldn't we do better?
论据与证据
Although the "tightest" confidence sequence is a bit "over-claimed" (without a concrete lower bound), the mathematical results themselves are very clear and correct.
方法与评估标准
N/A for this theoretical paper.
理论论述
The results for both the computationally optimal algorithm and the statistically SOTA algorithm are well-supported by theorems and lemmas. The paper is well-written and easy to follow.
实验设计与分析
The simulation results are consistent with the theoretical claims. The proposed algorithm even outperforms Thompson-sampling-based baselines in simulation even down to constant factors, which is surprising.
补充材料
The supplementary material is well-written and the proofs are complete and correct.
与现有文献的关系
The setting of multinomial bandits is well-motivated in operations research and well-established in the literature, either UCB based or posterior sampling based. The paper fits well in this line of research.
The techniques on mirror descent fits in the line of classic online convex optimization, which is also a well-established area.
The statistical techniques on the sharp confidence intervals for generlized linear models are recently developed, and the paper fits well in this line of research.
遗漏的重要参考文献
No significant oversight.
其他优缺点
N/A
其他意见或建议
N/A
Thank you very much for your positive evaluation of our paper and for your valuable feedback! Below is our response to your question:
The online confidence sequence is claimed to be the "tightest", so could the authors provide some lower bounds on the width confidence radius itself, or at least some intuition on why couldn't we do better?
When we referred to our result as the "sharpest online confidence bound," we meant that, to the best of our knowledge, it is the sharpest among existing works—not that it is fundamentally optimal or cannot be further improved. In the final version, we will clarify this point to prevent any potential misunderstanding or impression of overclaiming. We sincerely appreciate your constructive feedback, and please don’t hesitate to reach out if you have any further questions.
This paper enhances the regret bound for multinomial logistic bandits. The multinomial logistic bandit problem is a contextual bandit setting where each item is associated with a context vector, the player can select up to K items simultaneously, and the reward is determined probabilistically by a logistic model. Specifically, the reward distribution is governed by multiple logits, each computed as the inner product between an item’s context vector and a hidden environment vector. To maximize rewards, the player must simultaneously learn this hidden vector while selecting high-reward items. A common approach involves an optimistic strategy, which first constructs a confidence set that is likely to contain the hidden vector and then acts as if the best possible vector within this set is the true environment vector. This paper introduces a refined confidence set that is tighter than existing ones, leading to improved regret guarantees. The authors analyze the regret of playing with this confidence set and demonstrate that it achieves better regret bounds with reasonable computational cost, maintaining constant per-round complexity.
给作者的问题
-
Could you provide more applications where the multinomial logistic bandit can be utilized? Discussing real-world use cases would help strengthen the motivation for this work.
-
Is there a known lower bound on the regret? Additionally, can you discuss whether the variance-dependent term has an optimal form?
论据与证据
All of their claims are supported by theoretical proofs.
方法与评估标准
The proposed methods make sense.
理论论述
The proofs of all the theorems appear to be correct, although I have not verified every detail. In the proof of Theorem 4.2, the observation that self-concordance holds with respect to the norm is particularly interesting, and the introduction of a novel intermediary term effectively completes the proof. The proof technique is non-trivial and highly insightful. Other theorems are also well-developed and thoroughly discussed.
实验设计与分析
The experimental results are limited, as they only consider synthetic environments. However, given the theoretical nature of this work, this limitation is not a significant concern.
补充材料
The supplementary material is extensively dedicated to proofs. Unfortunately, I do not have sufficient time to verify the entirety of the 30-page-long proof.
与现有文献的关系
I believe the proof technique could be applicable to other bandit settings.
遗漏的重要参考文献
None
其他优缺点
Strengths
-
This paper improves the regret bound for the Multinomial Logistic Bandit without increasing computational complexity.
-
The proof is not merely a combination of existing ideas but introduces novel techniques.
Weaknesses
-
The paper would benefit from a more detailed discussion on the motivations and potential applications of the multinomial logistic bandit.
-
Including discussions on real-world scenarios and experimental evaluations could further strengthen the paper.
其他意见或建议
None
Thank you for your positive review! We are very pleased that the reviewer appreciates the significance of our technical contributions. In particular, we’re delighted that you found our proposed -norm self-concordant property interesting. We hope this new property will inspire further research on MNL models, extending beyond the MNL bandit setting.
Please find our responses to your questions below:
Motivation and Applications
The MNL bandit model offers a powerful and flexible framework for modeling choice behavior in sequential decision-making settings, where a decision-maker presents a subset of items (an assortment) and observes user selections based on relative utilities. This structure naturally arises in a wide range of real-world applications, making the study of MNL bandits both practically relevant and theoretically important. Some key domains include:
-
E-commerce and Online Retail: Platforms choose a subset of products to display to each customer. The customer's purchase behavior is naturally modeled using discrete choice models such as MNL.
-
Recommender Systems: Services like Netflix or Spotify present a small selection of items (e.g., movies or songs), and the user selects one. Modeling this behavior with an MNL model enables the system to learn user preferences and optimize engagement.
-
Online Advertising: Ad platforms select a set of ads to display, aiming to maximize click-through rate (CTR) or conversions. The MNL model captures how user choice probabilities depend on both the ad features and the composition of the ad set.
-
Dynamic Pricing and Assortment Optimization: Retailers dynamically select assortments subject to pricing or inventory constraints. The MNL model reflects how customer choices shift in response to changes in availability and pricing.
-
Potential Application to RLHF: The MNL bandit framework also has promising applications in Reinforcement Learning from Human Feedback (RLHF). Human preferences in RLHF are often modeled using BTL or Plackett-Luce (PL) models. Since MNL generalizes BTL to multi-choice settings and serves as a building block for PL (ranking models), it provides a natural and flexible foundation for modeling richer forms of preference feedback in RLHF.
We believe that the MNL bandit model provides not only a unifying theoretical framework but also a practically impactful tool for real-world sequential decision-making. We will consider expanding this motivation section in the final version. We sincerely appreciate the suggestion to highlight these broader applications.
Variance-dependent lower bound
As discussed in Remark 4.7, our proposed algorithm achieves a regret upper bound that is nearly minimax optimal in both uniform and non-uniform reward settings, when compared to the worst-case (variance-independent) lower bounds established by Lee and Oh (2024). To the best of our knowledge, a formal variance-dependent lower bound has not yet been established for MNL bandits. The most closely related result is the instance-dependent lower bound (specifically, -dependent) proposed by Lee and Oh (2024); see Proposition 1 in their paper. Here, is the variance of the optimal assortment under uniform rewards (see Appendix E.2 of our paper for further discussion), rather than the assortment selected by an algorithm.
Thus, whether a fully variance-dependent lower bound can be derived in the MNL bandit setting remains an open question. Very recently (less than three weeks ago), He and Gu (2025) [1] established the first variance-dependent lower bound for general variance sequences in linear contextual bandits [1]. Their techniques may be adaptable to the MNL setting, and we view this as an interesting direction for future research.
[1] He, Jiafan, and Quanquan Gu. "Variance-Dependent Regret Lower Bounds for Contextual Bandits." arXiv preprint arXiv:2503.12020 (2025).
This paper studies the multinomial logistic bandits problem, in which the learner submits an assortment of at most arms and then receives binary feedback following the MNL model. The main contribution of the paper is the proposal of an improved method for this problem, yielding an online confidence set of , which improves upon previous work by factors of , , and . In addition, the paper provides a variance-dependent bound for the multinomial logistic bandit model and an MLE-based method that completely removes the dependence on .
给作者的问题
- Could you further emphasize the challenges of extending the warm-up technique from Faury et al. (2022) from an offline to an online setting?
- Could you provide more discussion on the significance of the variance-aware bound in cases where depends on ?
- Although Theorem 4.5 successfully removes from the leading term, the dependence on the remaining term appears large compared to Lee et al. (2024). Could you elaborate on the tightness of the dependence on ?
I would be happy to increase my score if the above concerns are adequately addressed.
论据与证据
From my perspective, the paper's claims are supported by clear evidence.
方法与评估标准
The regret bound is used as the standard metric for the logistic bandit problem.
理论论述
The proofs seem sound, although I haven't verified the details exhaustively.
实验设计与分析
From my perspective, the experimental comparison is adequate.
补充材料
I have briefly reviewed Appendix D.
与现有文献的关系
This paper presents an improved confidence set for the MNL bandit problem. Its primary results improve on the work of Lee and Oh (2024) by factors of and , and the previously multiplicative term involving is now an additive term.
遗漏的重要参考文献
The related work is appropriately cited and discussed; however, the technical contribution relative to Faury et al. (2022) remains somewhat ambiguous. Please refer to the strengths and weaknesses section for further details.
其他优缺点
Strengths of this Paper:
- The paper offers several improvements over previous work on MNL bandits, with improved dependence on , , and . Additionally, it introduces a variance-dependent regret bound.
- Experiments are provided to demonstrate the effectiveness of the proposed methods.
Weaknesses of the Paper: Although the improvements are interesting, my main concern pertains to the overall significance of the results and the technical contribution.
- Regarding the improvement in the dependence on , the warm-up technique employed here appears closely related to that of Faury et al. (2022). As discussed in Remark 4.8, the primary difference is that the previous work utilises MLE, while the current study adopts online mirror descent. However, since Faury et al. (2022) also propose an adaptive method to verify the condition on the fly, it remains unclear how challenging this extension would be for the MNL problem.
- Concerning the improvement in , one of the main techniques is the constant self-concordant property of the MNL loss, which was already established by Lee and Oh (2024).
- The variance-dependent optimal bound is interesting; however, it is unclear how large the term may be, given that it depends on , the set of arms selected by the learner. This dependence is somewhat unfavourable, as previous variance-dependent bounds in linear bandits are typically independent of the learner’s predictions [1,2].
[1] Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP. NeurIPS 2021.
[2] Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. NeurIPS 2022.
其他意见或建议
Please see the comments above.
Thank you for taking the time to review our paper and provide valuable feedback.
We would like to clarify that our main contribution is the development of the first -free, variance-dependent optimal regret bound for MNL bandits, which cannot be achieved by existing algorithms such as those proposed by Faury et al. (2022) or Lee and Oh (2024). We provide detailed explanations below and sincerely hope this clarification helps convey the novelty of our approach.
Comparison to Faury et al. (2022)
While our algorithm shares some conceptual similarities with Faury et al. (2022)—notably the adaptive warm-up strategy—it fundamentally differs from their method in several key aspects beyond just the use of online updates:
-
Decision rule: The decision rule used in the adaptive algorithm (ada-OFU-ECOLog) by Faury et al. (2022) does NOT yield a -free confidence bound. Their confidence bound, (defined in Equation (15) of their paper), has an order of . This -dependence arises primarily because their method does not guarantee that the diameter of the search space, , remains small or bounded by a constant over time.
This is notably distinct from their non-adaptive algorithm (OFU-ECOLog, fixed-arm setting), which achieves a tighter confidence bound of order . We believe the reviewer may have inadvertently confused the confidence bound of ada-OFU-ECOLog with that of the non-adaptive version (OFU-ECOLog).
In contrast, our proposed decision rule always guarantees a confidence bound that is completely independent of . Combined with our new decision rule and the novel online confidence bound established in Theorem 4.2—which incorporates several technical novelties such as -norm self-concordance, bounded intermediate terms, and a -free regularization parameter—we successfully achieve a -free regret bound.
-
Prior knowledge of : Faury et al. (2022) require prior knowledge of , which is often impractical or even impossible to know beforehand in real-world settings. In contrast, our algorithm does not rely on any prior knowledge of , making it significantly more practical.
-
Gram matrix update: In Faury et al. (2022), the Gramm matrix is updated overly conservatively using a weight parameter as . Since is very small—specifically, in logistic bandits, and even smaller in MNL bandits with —the updates become too conservative. As a result, their search space shrinks too slowly. In contrast, our algorithm does NOT apply such conservative updates to the warm-up Hessian matrix . Instead, we update it using the second derivative of the loss, which enables the search space to shrink more rapidly and tightly.
-norm self-concordance
Our self-concordant property is constructed with respect to the -norm, whereas Lee and Oh (2024) use the -norm. This distinction is crucial: naively applying an -norm-based self-concordant property leads to a step size that depends on and , which in turn causes the confidence bound to also depend on these parameters.
We believe this novel -norm self-concordant property is of independent theoretical interest and has potential applicability beyond bandits, extending more broadly into the literature on MNL models.
Variance-dependent bound
It appears the reviewer may be confusing the definition of variance used in the cited papers [1,2]. To clarify, similar to our setting—where variance depends on the selected assortment —the variance in [1,2] also explicitly depends on the chosen arm . Moreover, the regret bounds in both our work and [1,2] are not dependent on the learner’s predictions—they depend solely on the variance under the true parameter.
Tightness of in non-leading term
Since Theorem 4.12 (MLE version) achieves a regret bound with a non-leading term that is free of any polynomial dependence on , we believe that the -dependence in the non-leading term of Theorem 4.5 is likely not optimal. Eliminating this dependency is an interesting direction for future work. We appreciate the insightful question that brought attention to this issue.
We sincerely believe that we have addressed all of your concerns. If you agree, we would greatly appreciate it if you could kindly reconsider your evaluation. Additionally, please feel free to reach out if you have any further questions or require clarification on any point.
Thank you for the detailed feedback. I believe that the comparison and distinction between this work and that of Faury et al. (2022) deserve a more thorough and careful discussion in the paper. In addition, I remain somewhat uncertain regarding the significance of the variance-aware bound. Since the model depends on , it is not entirely clear to me how substantial this improvement is. It would be helpful if the authors could provide concrete examples illustrating situations in which the variance-aware bound proves particularly beneficial. For instance, in the linear bandit setting, variance-aware bound may lead to improved guarantees for linear function approximation in RL. Moreover, in the case of linear bandits, as presented in Section 3 of [1], it seems that the variance could be independent of the selected action , given that the random noise may itself be independent of .
We truly appreciate your reply and the chance to provide more clarification.
We would like to reiterate that our result is strictly more general—since the logistic model is a special case of the MNL model. And, our result is also stronger (sharper), as we provide a tighter regret bound in terms of than Faury et al. (2022). Furthermore, when reduced to their logistic setting (i.e., the special case of and uniform rewards in ours), we can derive an instance-dependent bound (see Proposition 4.10), consistent with their result. As suggested by the reviewer, we are more than happy to provide a more detailed comparison with Faury et al. (2022) in the final version. We appreciate the suggestion—it will certainly strengthen our paper!
Next, we address the reviewer's final question regarding the variance-dependent bound. As clearly stated in the "Discussion of Theorem 4.5", our result of is a strict improvement over the previous best bound of established by Lee & Oh (2024). This improvement follows from the fact that, by definition, . We also note that, to the best of our knowledge, there was NO prior variance-dependent or instance-dependent regret bound for MNL bandits under non-uniform rewards (), which is a strictly more general setting than the uniform rewards case ().
Regarding the dependence of on in [1], we would like to clarify that is indeed defined as a function of . Recall the definition of : . For simplicity, suppose that is independent of the history, given . Then we can write: , which shows that is the conditional variance given . Hence, by definition, depends on . Note that [1] does NOT state or assume that is independent of . In fact, in their RL setting, the variance of the value function does depend on the state.
Moreover, the form of our (conditional) variance is identical to that in [1]. Specifically, our variance is defined as: , where the expectation is with respect to the MNL choice distribution given .
We hope this clarifies the reviewer’s concern regarding the significance of our variance-dependent bound. That said, we would like to emphasize that, while the variance-dependent result is certainly interesting in its own right, our main theoretical contribution lies in the development of the first -free online confidence bound and the sharp regret bound derived from it. Even in the absence of the variance-dependent result, we believe our findings offer substantial theoretical contributions and merit recognition.
In the light of clarification, we kindly and respectfully ask the reviewer to take these key contributions into account and to consider adjusting the score accordingly.
The main contribution of this paper is proposing an improved online confidence bound for multinomial logistic models. Moreover, the authors applied their results to MNL bandits to achieve an enhanced result. Further, they also showed numerical experiments.
给作者的问题
I do not have any specific question.
论据与证据
I verified the correctness of some theorems. Moreover, numerical experiments make sense to me.
方法与评估标准
The evaluation criteria are natural.
理论论述
I verified the correctness the theoretical results in Appendix C.
实验设计与分析
The experimental results make sense to me.
补充材料
I verified the correctness of some theoretical results in Appendix C. Also, I skimed all other parts.
与现有文献的关系
Bandits are a fundamental topic in learning theory. Any good result in this context will be valuable. This paper provides a new tool, namely a confidence bound for multinomial logistic and applies it to MNL bandits to achieve improved results. This is a solid theoretical paper. I think it is a good contribution to the field of statistical learning theory.
遗漏的重要参考文献
其他优缺点
I like the writing of the paper. In particular, I like proof sketches and intuitions.
其他意见或建议
伦理审查问题
We sincerely appreciate your positive support and recognition of the value of our work! We truly hope that this research helps to illuminate a new direction for MNL models and bandit algorithms. Please feel free to reach out if you have any further questions.
This paper made solid contributions on improving the bounds. The paper is well-written, provides nice sketches and intuitions, and numerical experiments show the effectiveness of their methods.