PaperHub
7.2
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
4
3
4
ICML 2025

Improved Online Confidence Bounds for Multinomial Logistic Bandits

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

We establish an improved online confidence bound for MNL models and, leveraging this result, propose a constant-time algorithm that achieves the tightest variance-dependent optimal regret in contextual MNL bandits.

摘要

In this paper, we propose an improved online confidence bound for multinomial logistic (MNL) models and apply this result to MNL bandits, achieving variance-dependent optimal regret. Recently, Lee & Oh (2024) established an online confidence bound for MNL models and achieved nearly minimax-optimal regret in MNL bandits. However, their results still depend on the norm-boundedness of the unknown parameter $B$ and the maximum size of possible outcomes $K$. To address this, we first derive an online confidence bound of $\mathcal{O}(\sqrt{d \log t} + B \sqrt{d} )$, which is a significant improvement over the previous bound of $\mathcal{O} (B \sqrt{d} \log t \log K )$ (Lee & Oh, 2024). This is mainly achieved by establishing tighter self-concordant properties of the MNL loss and applying Ville’s inequality to bound the estimation error. Using this new online confidence bound, we propose a constant-time algorithm, **OFU-MNL++**, which achieves a variance-dependent regret bound of $\mathcal{O} \Big( d \log T \sqrt{ \sum_{t=1}^T \sigma_t^2 } \Big) $ for sufficiently large $T$, where $\sigma_t^2$ denotes the variance of the rewards at round $t$, $d$ is the dimension of the contexts, and $T$ is the total number of rounds. Furthermore, we introduce a Maximum Likelihood Estimation (MLE)-based algorithm, **OFU-M$^2$NL**, which achieves an anytime $\operatorname{poly}(B)$-free regret of $\mathcal{O} \Big( d \log (BT) \sqrt{ \sum_{t=1}^T \sigma_t^2 } \Big) $.
关键词
BanditMultinomial Logistic BanditConfidence BoundRegretOnline Mirror Descent

评审与讨论

审稿意见
4

This work extends the MLE-based poly(B)\text{poly}(B)-free online confidence sequences for generalized linear models, where BB is the norm-bound of the parameter set; to the setting of multinomial bandits to give a statistically SOTA algorithm (in terms of the dependence on BB and), which is also variance-dependent. The work also provides a computationally optimal algorithm under the relatively well-known "online Newton step" framework base on ellipsoidal confidence sets.

给作者的问题

  • The online confidence sequence is claimed to be the "tightest", so could the authors provide some lower bounds on the width confidence radius itself, or at least some intuition on why couldn't we do better?

论据与证据

Although the "tightest" confidence sequence is a bit "over-claimed" (without a concrete lower bound), the mathematical results themselves are very clear and correct.

方法与评估标准

N/A for this theoretical paper.

理论论述

The results for both the computationally optimal algorithm and the statistically SOTA algorithm are well-supported by theorems and lemmas. The paper is well-written and easy to follow.

实验设计与分析

The simulation results are consistent with the theoretical claims. The proposed algorithm even outperforms Thompson-sampling-based baselines in simulation even down to constant factors, which is surprising.

补充材料

The supplementary material is well-written and the proofs are complete and correct.

与现有文献的关系

The setting of multinomial bandits is well-motivated in operations research and well-established in the literature, either UCB based or posterior sampling based. The paper fits well in this line of research.

The techniques on mirror descent fits in the line of classic online convex optimization, which is also a well-established area.

The statistical techniques on the sharp confidence intervals for generlized linear models are recently developed, and the paper fits well in this line of research.

遗漏的重要参考文献

No significant oversight.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

Thank you very much for your positive evaluation of our paper and for your valuable feedback! Below is our response to your question:

The online confidence sequence is claimed to be the "tightest", so could the authors provide some lower bounds on the width confidence radius itself, or at least some intuition on why couldn't we do better?

When we referred to our result as the "sharpest online confidence bound," we meant that, to the best of our knowledge, it is the sharpest among existing works—not that it is fundamentally optimal or cannot be further improved. In the final version, we will clarify this point to prevent any potential misunderstanding or impression of overclaiming. We sincerely appreciate your constructive feedback, and please don’t hesitate to reach out if you have any further questions.

审稿意见
4

This paper enhances the regret bound for multinomial logistic bandits. The multinomial logistic bandit problem is a contextual bandit setting where each item is associated with a context vector, the player can select up to K items simultaneously, and the reward is determined probabilistically by a logistic model. Specifically, the reward distribution is governed by multiple logits, each computed as the inner product between an item’s context vector and a hidden environment vector. To maximize rewards, the player must simultaneously learn this hidden vector while selecting high-reward items. A common approach involves an optimistic strategy, which first constructs a confidence set that is likely to contain the hidden vector and then acts as if the best possible vector within this set is the true environment vector. This paper introduces a refined confidence set that is tighter than existing ones, leading to improved regret guarantees. The authors analyze the regret of playing with this confidence set and demonstrate that it achieves better regret bounds with reasonable computational cost, maintaining constant per-round complexity.

给作者的问题

  1. Could you provide more applications where the multinomial logistic bandit can be utilized? Discussing real-world use cases would help strengthen the motivation for this work.

  2. Is there a known lower bound on the regret? Additionally, can you discuss whether the variance-dependent term has an optimal form?

论据与证据

All of their claims are supported by theoretical proofs.

方法与评估标准

The proposed methods make sense.

理论论述

The proofs of all the theorems appear to be correct, although I have not verified every detail. In the proof of Theorem 4.2, the observation that self-concordance holds with respect to the \ell_\infty norm is particularly interesting, and the introduction of a novel intermediary term effectively completes the proof. The proof technique is non-trivial and highly insightful. Other theorems are also well-developed and thoroughly discussed.

实验设计与分析

The experimental results are limited, as they only consider synthetic environments. However, given the theoretical nature of this work, this limitation is not a significant concern.

补充材料

The supplementary material is extensively dedicated to proofs. Unfortunately, I do not have sufficient time to verify the entirety of the 30-page-long proof.

与现有文献的关系

I believe the proof technique could be applicable to other bandit settings.

遗漏的重要参考文献

None

其他优缺点

Strengths

  1. This paper improves the regret bound for the Multinomial Logistic Bandit without increasing computational complexity.

  2. The proof is not merely a combination of existing ideas but introduces novel techniques.

Weaknesses

  1. The paper would benefit from a more detailed discussion on the motivations and potential applications of the multinomial logistic bandit.

  2. Including discussions on real-world scenarios and experimental evaluations could further strengthen the paper.

其他意见或建议

None

作者回复

Thank you for your positive review! We are very pleased that the reviewer appreciates the significance of our technical contributions. In particular, we’re delighted that you found our proposed \ell_\infty-norm self-concordant property interesting. We hope this new property will inspire further research on MNL models, extending beyond the MNL bandit setting.

Please find our responses to your questions below:


Motivation and Applications

The MNL bandit model offers a powerful and flexible framework for modeling choice behavior in sequential decision-making settings, where a decision-maker presents a subset of items (an assortment) and observes user selections based on relative utilities. This structure naturally arises in a wide range of real-world applications, making the study of MNL bandits both practically relevant and theoretically important. Some key domains include:

  • E-commerce and Online Retail: Platforms choose a subset of products to display to each customer. The customer's purchase behavior is naturally modeled using discrete choice models such as MNL.

  • Recommender Systems: Services like Netflix or Spotify present a small selection of items (e.g., movies or songs), and the user selects one. Modeling this behavior with an MNL model enables the system to learn user preferences and optimize engagement.

  • Online Advertising: Ad platforms select a set of ads to display, aiming to maximize click-through rate (CTR) or conversions. The MNL model captures how user choice probabilities depend on both the ad features and the composition of the ad set.

  • Dynamic Pricing and Assortment Optimization: Retailers dynamically select assortments subject to pricing or inventory constraints. The MNL model reflects how customer choices shift in response to changes in availability and pricing.

  • Potential Application to RLHF: The MNL bandit framework also has promising applications in Reinforcement Learning from Human Feedback (RLHF). Human preferences in RLHF are often modeled using BTL or Plackett-Luce (PL) models. Since MNL generalizes BTL to multi-choice settings and serves as a building block for PL (ranking models), it provides a natural and flexible foundation for modeling richer forms of preference feedback in RLHF.

We believe that the MNL bandit model provides not only a unifying theoretical framework but also a practically impactful tool for real-world sequential decision-making. We will consider expanding this motivation section in the final version. We sincerely appreciate the suggestion to highlight these broader applications.


Variance-dependent lower bound

As discussed in Remark 4.7, our proposed algorithm achieves a regret upper bound that is nearly minimax optimal in both uniform and non-uniform reward settings, when compared to the worst-case (variance-independent) lower bounds established by Lee and Oh (2024). To the best of our knowledge, a formal variance-dependent lower bound has not yet been established for MNL bandits. The most closely related result is the instance-dependent lower bound (specifically, κt\kappa^\star_t-dependent) proposed by Lee and Oh (2024); see Proposition 1 in their paper. Here, κt\kappa^\star_t is the variance of the optimal assortment StS^\star_t under uniform rewards (see Appendix E.2 of our paper for further discussion), rather than the assortment StS_t selected by an algorithm.

Thus, whether a fully variance-dependent lower bound can be derived in the MNL bandit setting remains an open question. Very recently (less than three weeks ago), He and Gu (2025) [1] established the first variance-dependent lower bound for general variance sequences in linear contextual bandits [1]. Their techniques may be adaptable to the MNL setting, and we view this as an interesting direction for future research.

[1] He, Jiafan, and Quanquan Gu. "Variance-Dependent Regret Lower Bounds for Contextual Bandits." arXiv preprint arXiv:2503.12020 (2025).

审稿意见
3

This paper studies the multinomial logistic bandits problem, in which the learner submits an assortment of at most KK arms and then receives binary feedback following the MNL model. The main contribution of the paper is the proposal of an improved method for this problem, yielding an online confidence set of O(dlogt+B)O(\sqrt{d\log t}+B), which improves upon previous work by factors of logt\log t, logK\log K, and BB. In addition, the paper provides a variance-dependent bound for the multinomial logistic bandit model and an MLE-based method that completely removes the dependence on BB.

给作者的问题

  • Could you further emphasize the challenges of extending the warm-up technique from Faury et al. (2022) from an offline to an online setting?
  • Could you provide more discussion on the significance of the variance-aware bound in cases where σt\sigma_t depends on xtx_t?
  • Although Theorem 4.5 successfully removes B{B} from the leading term, the dependence on the remaining term appears large compared to Lee et al. (2024). Could you elaborate on the tightness of the dependence on BB?

I would be happy to increase my score if the above concerns are adequately addressed.

论据与证据

From my perspective, the paper's claims are supported by clear evidence.

方法与评估标准

The regret bound is used as the standard metric for the logistic bandit problem.

理论论述

The proofs seem sound, although I haven't verified the details exhaustively.

实验设计与分析

From my perspective, the experimental comparison is adequate.

补充材料

I have briefly reviewed Appendix D.

与现有文献的关系

This paper presents an improved confidence set for the MNL bandit problem. Its primary results improve on the work of Lee and Oh (2024) by factors of O(logK)O(\log K) and O(logT)O(\log T), and the previously multiplicative term involving BB is now an additive term.

遗漏的重要参考文献

The related work is appropriately cited and discussed; however, the technical contribution relative to Faury et al. (2022) remains somewhat ambiguous. Please refer to the strengths and weaknesses section for further details.

其他优缺点

Strengths of this Paper:

  • The paper offers several improvements over previous work on MNL bandits, with improved dependence on logT\log T, logK\log K, and BB. Additionally, it introduces a variance-dependent regret bound.
  • Experiments are provided to demonstrate the effectiveness of the proposed methods.

Weaknesses of the Paper: Although the improvements are interesting, my main concern pertains to the overall significance of the results and the technical contribution.

  • Regarding the improvement in the dependence on BB, the warm-up technique employed here appears closely related to that of Faury et al. (2022). As discussed in Remark 4.8, the primary difference is that the previous work utilises MLE, while the current study adopts online mirror descent. However, since Faury et al. (2022) also propose an adaptive method to verify the condition on the fly, it remains unclear how challenging this extension would be for the MNL problem.
  • Concerning the improvement in KK, one of the main techniques is the constant self-concordant property of the MNL loss, which was already established by Lee and Oh (2024).
  • The variance-dependent optimal bound is interesting; however, it is unclear how large the term σt2\sigma_t^2 may be, given that it depends on StS_t, the set of arms selected by the learner. This dependence is somewhat unfavourable, as previous variance-dependent bounds in linear bandits are typically independent of the learner’s predictions [1,2].

[1] Improved Variance-Aware Confidence Sets for Linear Bandits and Linear Mixture MDP. NeurIPS 2021.
[2] Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. NeurIPS 2022.

其他意见或建议

Please see the comments above.

作者回复

Thank you for taking the time to review our paper and provide valuable feedback.

We would like to clarify that our main contribution is the development of the first B,KB,K-free, variance-dependent optimal regret bound for MNL bandits, which cannot be achieved by existing algorithms such as those proposed by Faury et al. (2022) or Lee and Oh (2024). We provide detailed explanations below and sincerely hope this clarification helps convey the novelty of our approach.


Comparison to Faury et al. (2022)

While our algorithm shares some conceptual similarities with Faury et al. (2022)—notably the adaptive warm-up strategy—it fundamentally differs from their method in several key aspects beyond just the use of online updates:

  • Decision rule: The decision rule used in the adaptive algorithm (ada-OFU-ECOLog) by Faury et al. (2022) does NOT yield a BB-free confidence bound. Their confidence bound, ηt(δ)\eta_t(\delta) (defined in Equation (15) of their paper), has an order of O(Bdlogt)\mathcal{O}(B d \log t). This BB-dependence arises primarily because their method does not guarantee that the diameter of the search space, diamA(Θt)\text{diam}_{\mathcal{A}}(\Theta_t), remains small or bounded by a constant over time.

    This is notably distinct from their non-adaptive algorithm (OFU-ECOLog, fixed-arm setting), which achieves a tighter confidence bound σt(δ)\sigma_t(\delta) of order O(dlogt+B)\mathcal{O}(d \log t + B). We believe the reviewer may have inadvertently confused the confidence bound of ada-OFU-ECOLog with that of the non-adaptive version (OFU-ECOLog).

    In contrast, our proposed decision rule always guarantees a confidence bound that is completely independent of BB. Combined with our new decision rule and the novel online confidence bound established in Theorem 4.2—which incorporates several technical novelties such as \ell_\infty-norm self-concordance, bounded intermediate terms, and a dd-free regularization parameter—we successfully achieve a BB-free regret bound.

  • Prior knowledge of κ\kappa: Faury et al. (2022) require prior knowledge of κ\kappa, which is often impractical or even impossible to know beforehand in real-world settings. In contrast, our algorithm does not rely on any prior knowledge of κ\kappa, making it significantly more practical.

  • Gram matrix update: In Faury et al. (2022), the Gramm matrix VH_tV^{\mathcal{H}}\_t is updated overly conservatively using a weight parameter κ\kappa as VH_taH_t+1κaa+γt(δ)IdV^{\mathcal{H}}\_t \leftarrow \sum_{ a\in \mathcal{H}\_{t+1}} \kappa a a^\top + \gamma_t(\delta)I_d. Since κ\kappa is very small—specifically, κ=O(e3B)\kappa = \mathcal{O}(e^{-3B}) in logistic bandits, and even smaller in MNL bandits with κ=O(K2e3B)\kappa = \mathcal{O}(K^{-2} e^{-3B})—the updates become too conservative. As a result, their search space Θt\Theta_t shrinks too slowly. In contrast, our algorithm does NOT apply such conservative updates to the warm-up Hessian matrix Ht+1wH^w_{t+1}. Instead, we update it using the second derivative of the loss, which enables the search space Wtw(δ)\mathcal{W}^w_t(\delta) to shrink more rapidly and tightly.


\ell_\infty-norm self-concordance

Our self-concordant property is constructed with respect to the \ell_\infty-norm, whereas Lee and Oh (2024) use the 2\ell_2-norm. This distinction is crucial: naively applying an 2\ell_2-norm-based self-concordant property leads to a step size η\eta that depends on KK and BB, which in turn causes the confidence bound to also depend on these parameters.

We believe this novel \ell_\infty-norm self-concordant property is of independent theoretical interest and has potential applicability beyond bandits, extending more broadly into the literature on MNL models.


Variance-dependent bound

It appears the reviewer may be confusing the definition of variance used in the cited papers [1,2]. To clarify, similar to our setting—where variance depends on the selected assortment StS_tthe variance in [1,2] also explicitly depends on the chosen arm xtx_t. Moreover, the regret bounds in both our work and [1,2] are not dependent on the learner’s predictions—they depend solely on the variance under the true parameter.


Tightness of BB in non-leading term

Since Theorem 4.12 (MLE version) achieves a regret bound with a non-leading term that is free of any polynomial dependence on BB, we believe that the BB-dependence in the non-leading term of Theorem 4.5 is likely not optimal. Eliminating this dependency is an interesting direction for future work. We appreciate the insightful question that brought attention to this issue.


We sincerely believe that we have addressed all of your concerns. If you agree, we would greatly appreciate it if you could kindly reconsider your evaluation. Additionally, please feel free to reach out if you have any further questions or require clarification on any point.

审稿人评论

Thank you for the detailed feedback. I believe that the comparison and distinction between this work and that of Faury et al. (2022) deserve a more thorough and careful discussion in the paper. In addition, I remain somewhat uncertain regarding the significance of the variance-aware bound. Since the model depends on StS_t, it is not entirely clear to me how substantial this improvement is. It would be helpful if the authors could provide concrete examples illustrating situations in which the variance-aware bound proves particularly beneficial. For instance, in the linear bandit setting, variance-aware bound may lead to improved guarantees for linear function approximation in RL. Moreover, in the case of linear bandits, as presented in Section 3 of [1], it seems that the variance σk\sigma_k could be independent of the selected action xtx_t, given that the random noise ϵk\epsilon_k may itself be independent of xtx_t.

作者评论

We truly appreciate your reply and the chance to provide more clarification.

We would like to reiterate that our result is strictly more general—since the logistic model is a special case of the MNL model. And, our result is also stronger (sharper), as we provide a tighter regret bound in terms of BB than Faury et al. (2022). Furthermore, when reduced to their logistic setting (i.e., the special case of K=1K=1 and uniform rewards in ours), we can derive an instance-dependent bound (see Proposition 4.10), consistent with their result. As suggested by the reviewer, we are more than happy to provide a more detailed comparison with Faury et al. (2022) in the final version. We appreciate the suggestion—it will certainly strengthen our paper!

Next, we address the reviewer's final question regarding the variance-dependent bound. As clearly stated in the "Discussion of Theorem 4.5", our result of O(dlogTt=1Tσt2)\mathcal{O}\left(d \log T \sqrt{\sum_{t=1}^T \sigma_t^2}\right) is a strict improvement over the previous best bound of O(B3/2dlogK(logT)3/2T)\mathcal{O}\left(B^{3/2} d \log K (\log T)^{3/2} \sqrt{T}\right) established by Lee & Oh (2024). This improvement follows from the fact that, by definition, σt21\sigma_t^2 \leq 1. We also note that, to the best of our knowledge, there was NO prior variance-dependent or instance-dependent regret bound for MNL bandits under non-uniform rewards (rti[0,1]r_{ti} \in [0,1]), which is a strictly more general setting than the uniform rewards case (rti=1r_{ti}=1).

Regarding the dependence of σt\sigma_t on xtx_t in [1], we would like to clarify that σt\sigma_t is indeed defined as a function of xtx_t. Recall the definition of σt2\sigma_t^2: σt2:=E[ϵt2F_t]=E[(rtxtθ)2x1,ϵ1,,x_t1,ϵ_t1,xt]\sigma_t^2 := \mathbb{E}[\epsilon^2_t |\mathcal{F}\_t] = \mathbb{E}[ (r_t - x_t^\top \theta^\star)^2 |x_1, \epsilon_1, \dots, x\_{t-1}, \epsilon\_{t-1}, x_t]. For simplicity, suppose that ϵt\epsilon_t is independent of the history, given xtx_t. Then we can write: σt2:=E[ϵt2xt]\sigma_t^2 := \mathbb{E}[\epsilon^2_t | x_t], which shows that σt2\sigma_t^2 is the conditional variance given xtx_t. Hence, by definition, σt2\sigma_t^2 depends on xtx_t. Note that [1] does NOT state or assume that ϵt\epsilon_t is independent of xtx_t. In fact, in their RL setting, the variance of the value function does depend on the state.

Moreover, the form of our (conditional) variance is identical to that in [1]. Specifically, our variance is defined as: σt2:=E[(rtiE[rtjSt])2St]\sigma_t^2 := \mathbb{E}[(r_{ti} - \mathbb{E}[r_{tj}|S_t])^2 | S_t], where the expectation is with respect to the MNL choice distribution given iSti \in S_t.

We hope this clarifies the reviewer’s concern regarding the significance of our variance-dependent bound. That said, we would like to emphasize that, while the variance-dependent result is certainly interesting in its own right, our main theoretical contribution lies in the development of the first B,KB,K-free online confidence bound and the sharp regret bound derived from it. Even in the absence of the variance-dependent result, we believe our findings offer substantial theoretical contributions and merit recognition.

In the light of clarification, we kindly and respectfully ask the reviewer to take these key contributions into account and to consider adjusting the score accordingly.

审稿意见
4

The main contribution of this paper is proposing an improved online confidence bound for multinomial logistic models. Moreover, the authors applied their results to MNL bandits to achieve an enhanced result. Further, they also showed numerical experiments.

给作者的问题

I do not have any specific question.

论据与证据

I verified the correctness of some theorems. Moreover, numerical experiments make sense to me.

方法与评估标准

The evaluation criteria are natural.

理论论述

I verified the correctness the theoretical results in Appendix C.

实验设计与分析

The experimental results make sense to me.

补充材料

I verified the correctness of some theoretical results in Appendix C. Also, I skimed all other parts.

与现有文献的关系

Bandits are a fundamental topic in learning theory. Any good result in this context will be valuable. This paper provides a new tool, namely a confidence bound for multinomial logistic and applies it to MNL bandits to achieve improved results. This is a solid theoretical paper. I think it is a good contribution to the field of statistical learning theory.

遗漏的重要参考文献

其他优缺点

I like the writing of the paper. In particular, I like proof sketches and intuitions.

其他意见或建议

伦理审查问题

作者回复

We sincerely appreciate your positive support and recognition of the value of our work! We truly hope that this research helps to illuminate a new direction for MNL models and bandit algorithms. Please feel free to reach out if you have any further questions.

最终决定

This paper made solid contributions on improving the bounds. The paper is well-written, provides nice sketches and intuitions, and numerical experiments show the effectiveness of their methods.