PaperHub
7.3
/10
Poster4 位审稿人
最低4最高5标准差0.5
5
5
4
4
3.0
置信度
创新性3.0
质量3.0
清晰度3.0
重要性3.0
NeurIPS 2025

Tractable Multinomial Logit Contextual Bandits with Non-Linear Utilities

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We propose a computationally tractable multinomial logit contextual bandit algorithm, which is designed to handle generic non-linear parametric utility functions.

摘要

We study the _multinomial logit_ (MNL) contextual bandit problem for sequential assortment selection. Although most existing research assumes utility functions to be linear in item features, this linearity assumption restricts the modeling of intricate interactions between items and user preferences. A recent work (Zhang & Luo, 2024) has investigated general utility function classes, yet its method faces fundamental trade-offs between computational tractability and statistical efficiency. To address this limitation, we propose a computationally efficient algorithm for MNL contextual bandits leveraging the upper confidence bound principle, specifically designed for non-linear parametric utility functions, including those modeled by neural networks. Under a realizability assumption and a mild geometric condition on the utility function class, our algorithm achieves a regret bound of $\tilde{\mathcal{O}}(\sqrt{T})$, where $T$ denotes the total number of rounds. Our result establishes that sharp $\tilde{\mathcal{O}}(\sqrt{T})$-regret is attainable even with neural network-based utilities, without relying on strong assumptions such as neural tangent kernel approximations. To the best of our knowledge, our proposed method is the first computationally tractable algorithm for MNL contextual bandits with non-linear utilities that provably attains $\tilde{\mathcal{O}}(\sqrt{T})$ regret. Comprehensive numerical experiments validate the effectiveness of our approach, showing robust performance not only in realizable settings but also in scenarios with model misspecification.
关键词
multinomial logit choice modelcontextual banditnon-linear function approximationregret analysis

评审与讨论

审稿意见
5

The paper studies multinomial logistic bandits where the action is a set of NN items and the feedback is the item chosen. The choice model is a logistic model where the utility of an item depends on the context and the item. The utility function is assumed to be a non-linear parametric function and the learner has access to the parametric function class. The paper proposes a two-stage algorithm where the first state does uniform exploration and the second stage uses the principle of optimism to select the most optimistic set of items. A naive explore-then-commit type of algorithm gives suboptimal regret bound. The paper introduces novel regularization and accompanying analysis to get the optimal T\sqrt{T} regret bound.

优缺点分析

Strengths:

The paper is clearly written and the algorithm is well-motivated by its accompanying analyses.

Weaknesses:

There are some parts of the results presented that raise some questions. (See below).

问题

Q1) What if the parameter κ\kappa in the regret bound is very small? Is there a trivial lower bound on κ\kappa? What is the intuition for the dependence on κ1\kappa^{-1}? For example, why would the presence of an item that is rarely chosen make the problem harder?

Q2) In Line 194, the paper says the context vectors after Phase I may be chosen adversarially. This questions the role of Phase I. What if the context distribution is only along one direction, and the contexts are chosen adversarially to be orthogonal to that direction? In this case, wouldn't the data in Phase I be obsolete?

局限性

Yes.

最终评判理由

Authors have addressed my concerns: question on problem-dependent constant κ\kappa and the clarification on the assumption. The submission is technically strong with clear theoretical contribution.

格式问题

None.

作者回复

Thank you for taking the time to review our paper and for your thoughtful and valuable feedback. We deeply appreciate your recognition of our work and the constructive comments you have provided. Below, we address each of your comments and questions in detail:


[Dependence on κ\kappa (Q1)]

By the definition of κ\kappa, κ\kappa is lower bounded by Ω(1/K2)\Omega(1/K^2), i.e., κ1=O(K2)\kappa^{-1} = O(K^2). However, we would like to emphasize that its behavior is often far more favorable in practice. For example, as shown in the experiments of Lee & Oh [17] (Figure 1), the gap between the regret of κ1\kappa^{-1}-dependent algorithms (e.g., UCB-MNL [23], TS-MNL [22]) and that of κ1\kappa^{-1}-improved algorithm (e.g., OFU-MNL+ [17]) diminishes as KK increases. This implies that the regret of κ1\kappa^{-1}-dependent algorithms [22, 23] does not grow significantly with KK, even as KK increases. This suggests that equating κ\kappa with 1/K21/K^2 or treating it as equivalent to KK is overly pessimistic.

For intuition behind κ\kappa, the parameter κ\kappa is closely related to the curvature of the expected negative log-likelihood function. Consider the expected loss defined as L(w)=E[logp(iX,S,w)]L(\mathbf{w}) = \mathbb{E}[- \log p(i \mid X, S, \mathbf{w})]. Then, the Hessian of L(w)L(\mathbf{w}) is given by

2L(w)=E[_iSp(iX,S,w)f_w(x_i)f_w(x_i)_jS_kSp(jX,S,w)p(kX,S,w)f_w(x_j)f_w(x_k)]. \nabla^2 L(\mathbf{w}) = \mathbb{E}\left[ \sum\_{i \in S} p(i \mid X, S, \mathbf{w}) \nabla f\_{\mathbf{w}}(\mathbf{x}\_i) \nabla f\_{\mathbf{w}}(\mathbf{x}\_i)^\top - \sum\_{j \in S} \sum\_{k \in S} p(j \mid X, S, \mathbf{w}) p(k \mid X, S, \mathbf{w}) \nabla f\_{\mathbf{w}}(\mathbf{x}\_j) \nabla f\_{\mathbf{w}}(\mathbf{x}\_k)^\top \right] .

Using standard calculations (see Lemma B.5 in [25]), this Hessian can be lower-bounded as:

2L(w)E[_iSp(iX,S,w)p(0X,S,w)f_w(x_i)f_w(x_i)], \nabla^2 L(\mathbf{w}) \succeq \mathbb{E} \left[ \sum\_{i \in S} p(i \mid X, S, \mathbf{w}) p(0 \mid X, S, \mathbf{w}) \nabla f\_{\mathbf{w}}(\mathbf{x}\_i) \nabla f\_{\mathbf{w}}(\mathbf{x}\_i)^\top \right],

where p(0X,S,w)p(0 \mid X, S, \mathbf{w}) denotes the probability of outside option. In this context, κ\kappa serves as a uniform lower bound on the product p(iX,S,w)p(0X,S,w)p(i \mid X,S, \mathbf{w}) p(0 \mid X,S, \mathbf{w}), and therefore provides a lower bound on the minimum eigenvalue of 2L(w)\nabla^2 L(\mathbf{w}). Since smaller eigenvalues indicate a more ill-conditioned log-loss landscape, a smaller κ\kappa implies that the learning problem is more challenging.

Alternatively, we can think of how the probability of outside option relate to learning difficulty as follows. Consider the following example with two items i1,i2i_1, i_2:

  • [Case 1] p(0)=0.3,p(i1)=0.35,p(i2)=0.35,κ=0.3×0.35=0.105p(0) = 0.3, \quad p(i_1) = 0.35, \quad p(i_2) = 0.35, \quad \kappa = 0.3 \times 0.35 = 0.105
  • [Case 2] p(0)=0.9,p(i1)=0.09,p(i2)=0.01,κ=0.9×0.01=0.009p(0) = 0.9, \quad p(i_1) = 0.09, \quad p(i_2) = 0.01, \quad \kappa = 0.9 \times 0.01 = 0.009

In Case 2, the high probability of the outside option reduces the likelihood of observing for items i1i_1 and i2i_2. As a result, the expected number of rounds needed to observe feedback for those items is significantly higher than in Case 1. This demonstrates that small value of κ\kappa require more samples to gather informative feedback on individual items, thus increasing the overall sample complexity of learning.


[Role of Phase I (Q2)]

We are happy to answer your question. As described in Assumption 3, the context distribution is assumed to be supported on the context space X\mathcal{X}. This implies that the distribution in Phase I is not restricted to a single direction. Therefore, the case where all Phase I contexts lie along a single direction—rendering them obsolete when Phase II contexts are orthogonal to that direction—would not occur.

评论

Thank you for the explanation. My questions have been fully addressed. I will take your response into consideration when making the final recommendation.

评论

We are very glad to have addressed the reviewer’s questions! We will incorporate all valuable feedback and discussions into the final revision. Once again, we sincerely appreciate the reviewer’s supportive feedback and recognition of our contributions.

审稿意见
5

This paper studies the multinomial logit bandit problem with parameterized non-linear utility functions. Unlike prior works on this problem, computational and statistical efficiency are simultaneously achieved; the authors propose a new algorithm that achieves the order-optimal T\sqrt{T} rate while allowing tractable implementation. The algorithm has two phases: the first is a random exploration phase where an initial estimate of the latent parameter defining the utility function is learned, while in the second phase, a UCB approach is adopted. The estimated parameter vector is refined in the second phase via minimizing at each round a regularized negative log-likelihood function that employs a linear relaxation of the utility function, while the predictions (assortments) are chosen so as to maximize the expected reward when using optimistic estimates for the utility values. The achieved regret bound does not depend on the number of arms nor the cardinality constraint, but depends on a problem-specific constant κ\kappa.

优缺点分析

On the positive side, by proposing a provably statistically efficient and tractable algorithm, the paper seems to bring forth a significant contribution to a challenging but practically relevant problem that combines a combinatorial structure with complex utility function and feedback mechanism. The paper also seems to provide new technical contributions like employing a linearized MNL model and establishing a new concentration result. The presentation of the paper is mostly clear and the writing is of good quality.

On the other hand, maybe I missed this, but it does not seem that the authors have established the tractability of solving the optimization problem through which the initial estimate is obtained. In fact, it has been mentioned on two occasions that the negative log-likelihood function lacks important properties here compared to the linear case, but only its effect on the analyses is highlighted. Moreover, it is not entirely clear if the dependence of the regret on κ\kappa is in any way natural, it also seems that Zhang and Luo [34] argue that this could result in an exponential dependence on some problem parameters. This limitation is mentioned in the conclusion, but elsewhere it is asserted that this is preferred over depending on the number of arms or the cardinality constraint as in [34]. Maybe the authors can elaborate more on why this is the case. It is also worth mentioning that in obtaining their results, the authors employ extra assumptions compared to [34], even as the relative mildness of which is extensively argued for. Lastly, it seems that a similar 'reverse Lipschitz' property has already been established in [34, Lemma 3]; hence, some credit should be given. Maybe the paper could benefit from a more explicit comparison with the techniques in [34].

问题

  • Can you argue for the tractability of solving the optimization problem through which the initial estimate is obtained?
  • Why is the dependence on κ\kappa preferred over a dependence on KK or NN? Are there interesting cases that can motivate this?
  • Do you believe that assumption 4 is necessary for obtaining a sharp regret rate efficiently?

局限性

The authors acknowledge that the dependence of the regret on (the reciprocal of) κ\kappa can be a limitation. No other limitations/shortcomings seem to be highlighted in the paper.

最终评判理由

After reading the other reviews and the rebuttal, I maintain my positive evaluation of the paper as the contributions seem significant. I hope the authors will incorporate the discussion about the worst-case behaviour of κ\kappa and the computation required for obtaining the pilot estimator in the revised version.

格式问题

No formatting concerns.

作者回复

Thank you for taking the time to review our paper and for your thoughtful and valuable feedback. We deeply appreciate your recognition of our work and the constructive comments you have provided. Below, we address each of your comments and questions in detail:


[Computation tractability (W, Q1)]

We are glad to address this comment. We would like to clarify that the notion of tractability in both Zhang & Luo [34] and our work refers to the computational feasibility of the exploration strategy, i.e., the process of selecting an assortment given a current utility estimator. The computation required for obtaining our pilot estimator is comparable to what is already used in [34]; for instance, their use of offline regression oracles in the stochastic setting and online regression oracles in the adversarial setting (see Assumptions 2 and 3 in [34]). As in [34], any tractable empirical risk minimization method (e.g., SGD) is sufficient for computing our pilot estimator in practice. We will incorporate this point into the revised manuscript.


[κ\kappa dependence (W, Q2)]

We are happy to answer this question. By the definition of κ\kappa, we have κ1=O(K2)\kappa^{-1} = O(K^2) in the worst-case scenario. Note that the exponential dependence can be adjusted by rescaling the utility values. Zhang & Luo [34], for instance, define the MNL probability model using bounded utilities without applying the exponential transformation; see Eq. (1) in [34]. Consequently, our reverse Lipschitzness result (Lemma 4) can be used to recover the reverse Lipschitz bound in [34] (Lemma B.3). Nevertheless, we adopt a κ\kappa-dependent analysis because, while κ1\kappa^{-1} can scale as O(K2)O(K^2) in the worst case, its behavior is often far more favorable in practice. For example, as shown in the experiments of Lee & Oh [17] (Figure 1), the gap between the regret of κ1\kappa^{-1}-dependent algorithms (e.g., UCB-MNL [23], TS-MNL [22]) and that of κ1\kappa^{-1}-improved algorithm (e.g., OFU-MNL+ [17]) diminishes as KK increases. This implies that the regret of κ1\kappa^{-1}-dependent algorithms [22, 23] does not grow significantly with KK, even as KK increases. This suggests that equating κ\kappa with 1/K21/K^2 or treating it as equivalent to KK is overly pessimistic.

In contrast, the dependence on NN—the total number of items—is a more serious limitation. As discussed in the main text, our regret bound is independent of NN. However, all algorithms in [34] exhibit polynomial dependence on NN, regardless of tractability. This poses a significant obstacle for real-world applications involving large item sets—for example, platforms like Amazon, where the number of items can reach into the millions.


[Assumption 4 (W, Q3)]

Even though Assumption 4 is used as a sufficient condition to establish the fast convergence rate of the pilot estimator, we do not believe it a necessary condition. We expect that comparable convergence rates could still be achieved without relying on Assumption 4, if one instead uses parametric models with specific structures—such as one-hidden-layer neural networks with ReLU activation [31] or quadratic activation [32]—rather than broad classes of non-linear parametric functions. However, it is worth noting that even with such structural assumptions, existing convergence analyses often require the context vectors to be sampled from specific distributions (e.g., uniform over the unit sphere [31]), which limits their applicability in practical settings.

评论

Thank you for addressing my questions and concerns. Please also incorporate the discussion regarding the worst case behvaiour of κ\kappa in the revised version. I will maintain my positive evaluation.

评论

We are very glad to have addressed the reviewer’s questions! We will incorporate all valuable feedback and the discussion on the worst-case behavior of κ\kappa into the final revision. Once again, we sincerely appreciate the reviewer’s supportive feedback and recognition of our contributions.

审稿意见
4

This paper studies the contextual multinomial logistic (MNL) bandit problem with non-linear utility function. Based on upper confidence bound (UCB) principle, a novel computationally efficient algorithm for contextual MNL is provided. Such algorithm is applicable for utility functions modelled by neural network, and achieves a near-optimal regret O(T)O(\sqrt{T}) without relying on over-parameterized neural tangent kernel approximations. Numerical experiments are conducted to validate the effectiveness and robustness of the proposed algorithm, in both the realizable setting and model misspecification scenarios.

优缺点分析

Strengths

(1) This paper proposes for multinomial logistic bandit problem a novel UCB-based algorithm, which is computationally tractable and statistically efficient. The regret bound of O~(T)\tilde{O}(\sqrt{T}) is near optimal under non-linear utility settings.

(2) The algorithm uses two-phase optimization: in the first phase it runs uniform exploration and empirical risk minimization strategy to output a pilot estimator as the initial utility parameter for the second phase; in the second phase, the algorithm adopts a sophisticated strategy, uses the estimated parameter w^t\hat{\boldsymbol{w}} _{t} to construct a tight confidence set and the optimistic utility estimate, then construct the optimistic reward based on the optimistic utility estimate, finally chooses an assortment set by maximizing the optimistic expected reward.

(3) In the design of negative log-likelihood in MNL model, the authors apply a first-order Taylor approximation of the non-linear utility function to circumvent the non-linear challenges.

(4) The proofs are technical. I check the proofs as far as I can, the crucial technique is to leverage Lemma D.1 in Zhang and Luo [34] and the quadratic growth condition in Assumption 4 to derive high-probability convergence rate O(1/n)O(1/n) (via induction) for minwWw^tw22\min _{\boldsymbol{w} \in \mathcal{W}^{\star}} \lVert \hat{\boldsymbol{w}} _{t} - \boldsymbol{w}\rVert _{2}^{2}, use it to set the upper confidence bound on minwWw^twVt2\min _{\boldsymbol{w} \in \mathcal{W}^{\star}} \lVert \hat{\boldsymbol{w}} _{t} - \boldsymbol{w}\rVert _{\mathbf{V} _{t}}^{2}, and eliminate the dependence on the number NN of items. Finally, use this upper confidence bound to derive the near-optimal regret O(TlogT)O(\sqrt{T\log{T}}).

(5) Experiments on the realizable and misspecification settings further validate the effectiveness and lower regret of the proposed algorithm.

Weaknesses

(1) The major concern lies in whether the used assumptions are reasonable enough to derive improved results. For example, the boundedness of loss function/its gradient/its Hessian in Assumption 2 is a very strong assumption, especially in the non-convex deep neural network settings. It is not very clear whether such assumption is used in the related MNL works, or holds in the neural network-based bandit setting.

(2) The comparisons with related works are not fair enough. It is true that this work obtains sharper regret bound than that in Zhang and Luo [34], but in referent [34] their theoretical results do not rely on the very strong boundedness assumption as in Assumption 2 of this work. Besides, [34] does not use generalized geometric condition in Assumption 4 either (although I know Assumption 4 is the key assumption to derive improved regret bound of this paper). Therefore, I would suggest authors to list a table (like Table 1 in [34]) in the main paper or appendix, to explicitly compare the necessary assumptions and regret bounds of existing works and this work.

(3) The quadratic growth condition in Assumption 4 is the key to derive tighter confidence bound. However, more concrete examples (especially in the non-convex deep neural network bandit setting) should be added to explain the reasonableness of this assumption.

(4) In Theorem 1, κ=minw,X,S,ip(0X,S,w)p(iX,S,w)\kappa=\min _{\boldsymbol{w}, X,S,i}p(0|X,S,\boldsymbol{w})p(i|X,S,\boldsymbol{w}) is the minimum of many probability, and its value affects the number of iteration n/Tn/T. Will it be close to zero, and more experiments for the value of this hyperparameter should be added.

问题

(1) Could you please give more examples to show the reasonableness of Assumptions 2 and 4 ?

(2) Could you please give more fair comparisons (i.e. the assumptions used to derive regret bounds) with related works (e.g. [34]) in the discussion of Theorem 1 ?

(3) Could you please show empirically the value of the hyperparameter κ\kappa ?

局限性

(1) It seems that the used Assumption 2 is very strong, and more examples of neural network-based bandit model (except logit bandit with negative loss) are missing.

最终评判理由

Thanks authors for their detailed explanations. I have read the rebuttal and others' reviews, and most of my concerns have been addressed. This is a nice work and i still keep my positive evaluation.

格式问题

I found no major formatting issues in this paper.

作者回复

Thank you for taking the time to review our paper and for your thoughtful and valuable feedback. We deeply appreciate your recognition of our work and the constructive comments you have provided. Below, we address each of your comments and questions in detail:


[Reasonableness of assumptions (W1, W3, Q1)]

We thank the reviewer for providing us with the opportunity to clarify this point. First, we respectfully clarify that Assumption 2 is a condition on the utility function ff itself, not on the loss function. The properties in Assumption 2—boundedness, differentiability, and smoothness—are standard in the bandit literature that employs parametric reward models [1, 8, 9, 10, 18, 19, 20, 27], and they are also satisfied in NTK-based neural bandits [36, 38]. For instance, in over-parameterized neural networks, both the gradient and the Hessian are bounded; see Lemmas B.6 and C.2 in [38]. Moreover, in most existing MNL bandit literature [17, 22, 23, 25, 37], the utility function is assumed to be linear, in which case Assumption 2 holds trivially.

We also clarify that Assumption 4 does not require a quadratic growth condition. Instead, it introduces local strong convexity and a (τ,γ)(\tau, \gamma)-growth condition (for 0<γ<20 < \gamma < 2) on the expected loss (w)\ell(w) around the equivalence set. Also, Assumption 4 is satisfied by a broad class of neural networks studied in prior work. For example, the local strong convexity of the population risk has been shown for two-layered neural networks (Zhong et al., 2017), quadratic neural networks [31], and ReLU networks (Zhang et al., 2019), all of which fall within the scope of our assumption. These models typically involve multiple global minimum (due to symmetries), and thus fall outisde the scope of earlier geometric conditions that rely on uniqueness of the global minimum. We will add this comment on the revised version.

Now, we would like to reiterate that Assumption 4 is strictly weaker than the global strong convexity assumed in prior parametric bandits [1, 8, 9, 10, 19], and does not require the existence of a unique global minimum [18, 20, 27]. This allows for an arbitrary number of spurious local minima and global minima in the landscape of the expected loss, enabling our analysis to cover highly non-convex and non-linear utility functions. In summary, the function class covered by Assumption 4 generalizes the parametric function classes studied in prior bandit literature, and additionally includes a broader family of non-linear, non-convex functions that have not been addressed in earlier works.

(References)

  • Zhong et al. (2017), "Recovery Guarantees for One-hidden-layer Neural Networks", ICML
  • Zhang et al. (2019), "Learning One-hidden-layer ReLU Networks via Gradient Descent", AISTATS

[Comparison with Zhang & Luo (2024) (W2, Q2)]

We appreciate the reviewer's suggestion and will include a table that compares the assumptions and regret bounds of prior work and ours.

Utility classAssumptionsTractabilityRegret
Corollary 3.5 [34]Boundedness, LipschitznessStochastic context, offline regression oracleOO~((dNK)1/3T2/3)\tilde{O}\left( (d N K)^{1/3} T^{2/3} \right)
Corollary 3.8 [34]Boundedness, LipschitznessStochastic context, offline regression oracleXO~(K2dNT)\tilde{O} \left( K^2 \sqrt{d N T} \right)
Corollary 4.4 [34]Boundedness, LipschitznessAdversarial context, online regression oracleOO~((NK)1/3T5/6)\tilde{O} \left( (N K)^{1/3} T^{5/6} \right)
Corollary 4.8 [34]Boundedness, LipschitznessAdversarial context, online regression oracleXO~(K2dNT)\tilde{O} \left( K^2 \sqrt{d N T} \right)
This workBoundedness, Lipschitzness, smoothnessgeometric condition, stochastic context (Phase I)OO~(κ2μ1dwT)\tilde{O} \left(\kappa^{-2} \mu^{-1} d_w \sqrt{T} \right)

We would like to clarify that Zhang & Luo [34] also make boundedness and Lipschitz continuity assumptions. As mentioned in the main text, Assumption 4 is required in Phase I to establish the convergence rate of the pilot estimator. In contrast, the analysis in [34] relies on the generalization error of the (offline or online) regression oracle measured in terms of log-loss (Assumption 2 & 3 in [34]), rather than on the distance between the estimator and the true parameter. This implies that even if a geometric condition like our Assumption 4 were imposed in the setting of [34], it would not yield any advantage in terms of computational tractability.

On the other hand, under the same generalization error rate of O(1/n)O(1/n) established in [34], we show that, when combined with Assumption 4, this is sufficient to guarantee a fast convergence rate of the estimator. The main contribution of our work lies in proposing an algorithm for MNL bandits with non-linear utilities that is both computationally tractable and statistically efficient, under Assumption 4, which is satisfied by a broad class of parametric models and neural networks. We believe it remains an open question how large the class of utility functions is in practical scenarios where Assumption 4 does not hold, and whether a suitable regression oracle, as required in [34], exists for such function classes.


[About κ\kappa (W4, Q3)]

By definition, κ1=O(K2)\kappa^{-1} = O(K^2) in the worst-case scenario, and we use κ=exp(1)(1+Kexp(1))2\kappa = \frac{\exp(-1)}{(1 + K \exp(1))^2} in our numerical experiments. We would like to note that in our current analysis, the dependence of nn or TT on κ\kappa serves as a sufficient condition for achieving the O~(T)\tilde{O}(\sqrt{T}) regret bound. However, as shown in our experiments, the algorithm performs well even with a small number of pure exploration rounds, suggesting that the theoretical κ\kappa-dependence may be overly conservative in practice. We believe that closing this gap, or deriving regret bounds with improved dependence on κ\kappa in the context of non-linear MNL models, remains a valuable direction for future research.

评论

We sincerely appreciate the time and effort the reviewer has devoted to reviewing our paper. We hope that our responses have adequately addressed the reviewer’s requests and provided the necessary clarifications. With our recent comment ("Key Contributions") in mind, we would be happy to address any further questions or concerns the reviewer may have. Otherwise, if our responses have sufficiently addressed the reviewer’s comments, we respectfully hope that the reviewer will kindly reconsider their assessment in light of the strengths and contributions of our work. Thank you again for your thoughtful feedback and consideration.

审稿意见
4

This paper tackles the contextual multinomial logistic (MNL) bandit problem, a key model for sequential assortment selection in applications like online recommendation and retail. Most prior work assumes linear utility functions, which limits expressiveness. The authors propose a new algorithm—ONL-MNL—that:

  1. Handles general non-linear parametric utility functions (e.g., neural networks),
  2. Achieves a regret bound of the square root of T.
  3. Is computationally tractable (unlike prior work that trades off tractability for statistical efficiency),
  4. Avoids reliance on neural tangent kernel (NTK) assumptions or overparameterization.

The algorithm uses a two-phase strategy: uniform exploration followed by optimistic exploration using a novel confidence set construction. Theoretical guarantees are supported by a new generalized geometric condition and a concentration inequality. Empirical results validate the method under both realizable and misspecified settings.

优缺点分析

Strengths:

  1. The problem is relevant and well-motivated.
  2. The regret bound is great. However, I did not check the math in great detail.

Weaknesses:

  1. Can you provide some real-world numerical evaluations? Experiments are limited to synthetic settings with fixed N, K, and d. There is no exploration of scalability (e.g., larger N or deeper networks), nor any real-world datasets.
  2. As there can be multiple global minima, the convergence rate of the algorithm may depend on which global minima the algorithm finds. This may affect the regret bound. Did you consider this in the design and analysis of your algorithm?
  3. The paper is dense and highly technical. Some sections (e.g., the regret analysis) could benefit from more intuition.
  4. The regret bound depends on a problem-specific parameter ϵ\epsilon, which may be small in worst-case scenarios.
  5. The baselines are mostly linear or from a single prior work on general utilities. Recent advances in neural bandits (e.g., NeuralUCB, NeuralTS) are not included.

问题

Refer to weaknesses.

局限性

Refer to weaknesses.

最终评判理由

Based on the rebuttal, I have decided to maintain my score.

格式问题

N/A

作者回复

Thank you for taking the time to review our paper and for your thoughtful and valuable feedback. We deeply appreciate your recognition of our work and the constructive comments you have provided. Below, we address each of your comments and questions in detail:


[Real-world dataset (W1)]

Since the sequential assortment selection problem addressed in our paper involves online decision making, it is inherently challenging to obtain real-world datasets that exactly match our problem setting. First, ground-truth choice probabilities are typically unobservable in real-world environments. Second, suppose there are NN total base items that a user can choose from. In practice, a real-world dataset would not contain responses for all combinatorially many assortments (at least NN choose KK combinations). If we only have responses for a small subset of these assortments, it becomes unclear how to evaluate an algorithm when it selects an assortment for which no feedback is available in the dataset.

Furthermore, even if suitable offline datasets exist, semi-synthetic experiments are still necessary to compute online regret—an essential evaluation metric in our setting. For these reasons, we—along with the majority of prior MNL bandit work [2, 3, 4, 17, 22, 23, 24, 25, 37]—rely on synthetic datasets that simulate realistic choice behavior, enabling reliable and consistent comparisons between algorithms. Additionally, we emphasize that we evaluate our method under model misspecification, where the true utility function differs from the estimated model, in order to better reflect practical scenarios.


[Scalability (W1)]

We conducted experiments across a wide range of values for KK, dd, and different network structures, as much as time permitted. We adopted the default configuration used in the original paper, setting N=100N = 100, K=5K = 5, d=3d = 3, L=2L = 2, and m=3m = 3 where LL denotes the number of layers and mm denotes the number of hidden units per layer. In each experiment, we varied one or more of these parameters to assess performance of the algorithms. We report the cumulative regret at T=1000T = 1000, averaged over 5 independent runs.

ParametersTrue utility & ContextONLMNL`ONL-MNL` (ours)UCBMNL`UCB-MNL`TSMNL`TS-MNL`OFUMNL+`OFU-MNL+`εgreedyMNL\varepsilon `-greedy-MNL`
K=10NN & Gaussian5.7025.9330.0125.3618.49
NN & uniform11.6033.4541.0431.9327.20
cos & Gaussian13.7933.2055.3230.0423.38
cos & uniform10.7324.2370.0519.4617.40
K=15NN & Gaussian4.2917.6420.1616.9316.02
NN & uniform5.5323.1527.3522.0120.48
cos & Gaussian9.3321.5234.6419.5320.66
cos & uniform6.1816.6745.2012.3215.00
K=10, d=5NN & Gaussian7.0123.7827.6023.3518.40
NN & uniform8.2028.8434.1527.9619.12
cos & Gaussian11.8231.9154.4529.0517.52
cos & uniform6.5619.3365.4515.2712.03
K=10, d=10NN & Gaussian11.5438.1143.6537.7021.28
NN & uniform12.8141.9749.7141.4518.89
cos & Gaussian9.0428.9652.8428.3117.09
cos & uniform4.3614.1350.7412.9210.17
L=3, m=20NN & Gaussian11.5457.77179.7062.6216.50
NN & uniform8.0555.81191.4358.0314.42
cos & Gaussian38.8560.86112.2164.6251.41
cos & uniform24.1139.47133.5741.4039.80
L=5, m=20NN & Gaussian29.11112.91156.37116.2982.12
NN & uniform17.2695.73178.6096.7033.35
cos & Gaussian37.4860.85110.6665.2855.02
cos & uniform28.3837.81138.5042.0268.51

Across all tested settings, our algorithm consistently achieves lower cumulative regret than baselines. We also note that experiments with larger NN are already provided in Appendix G for the reviewer’s reference.


[Multiple global minima (W2)]

The existence of multiple global minima does not affect the main regret bound presented in Theorem 1. This is because our regret analysis depends not on convergence to a particular global minimizer, but rather on how quickly the algorithm approaches the equivalence set W\mathcal{W}^{\ast}. Based on Assumption 4, which characterizes the local geometry of the expected loss around the equivalence set W\mathcal{W}^{\ast} (as opposed to the unknown utility parameter w\mathbf{w}^{\ast}), we establish a convergence guarantee for the pilot estimator w^_0\hat{\mathbf{w}}\_0 relative to W\mathcal{W}^{\ast}. Subsequently, the construction of the confidence set and the computation of the optimistic utility are both based on this convergence behavior with respect to the equivalence set. Thus, the regret bound remains valid no matter which global minimum the algorithm converges to.


[Density (W3)]

Since non-linear utility functions in the MNL bandit setting have not been extensively explored in prior literature, we aimed to provide as much technical detail as possible to clearly convey the challenges of learning non-linear utilities within this framework. We respectfully believe that this level of depth is a strength rather than a weakness of the paper.

In response to the reviewer's concern, we have supplemented the intuition behind our regret analysis. As the reviewer is likely aware, the core technical components of the UCB algorithm involve:

(i) constructing a confidence set for the unknown true parameter (see Lemmas 1 and 2), and

(ii) computing an optimistic estimate within that confidence set (see Lemma 3).

In Section 3.3, we explain how these two elements are instantiated in the context of MNL bandits with non-linear utility functions. If additional space becomes available at the camera-ready stage, we would be happy to further incorporate intuition and discussion behind our regret analysis.


[κ\kappa dependence (W4)]

Since we do not use an ε\varepsilon in our manuscript, we believe the reviewer’s question as referring to the problem-dependent instance κ\kappa. Indeed, our current regret bound depends on κ\kappa. This type of dependence has also been observed in prior work on MNL bandits with linear utilities [10, 19, 22, 23], and only recently have algorithms been proposed that achieve regret bounds with improved dependence on κ\kappa [8, 17, 37]. A key ingredient such improvements were possible is that the negative log-likelihood in the linear MNL model exhibits a self-concordant-like property. However, this property is generally not expected to hold in non-linear utility settings, because it requires the underlying utility function to satisfy a strong third-order smoothness condition—specifically, that the third derivative is controlled by the second derivative. Such functions constitute a very restricted class. Therefore, we believe that developing regret bounds with weaker κ\kappa-dependence in the non-linear utility setting remains an open and challenging problem, likely requiring new analytical tools.


[Baselines (W5)]

To the best of our knowledge, except for Zhang & Luo [34], almost all prior work on contextual MNL bandits assumes a linear utility function [5, 24, 22, 4, 23, 25, 2, 37, 17]. This is why the baselines we compare against are predominantly based on linear models.

Meanwhile, reward models used in neural bandits are fundamentally mismatched with the MNL bandit setting. Neural bandit algorithms are designed for multi-armed bandit problems that learn reward functions for individual arms, whereas MNL bandit algorithms are designed for assortment selection problem, where the goal is to model choice probabilities over item sets. This model mismatch is also why standard linear contextual bandits like LinUCB or LinTS are not typically included in comparisons within the linear MNL bandit literature. However, we would be happy to include additional baselines if the reviewer could suggest any methods that handle non-linear utilities in the MNL setting; we are more than willing to run further experiments within the remaining review period!

评论

We sincerely appreciate the time and effort the reviewer has devoted to reviewing our paper. We hope that our responses have adequately addressed the reviewer’s requests and provided the necessary clarifications. In particular, as requested by the reviewer, the additional experiment using a real-world dataset—reported in our recent official comment ("Additional Experiments on Real-World Datasets")—demonstrates that our proposed algorithm outperforms existing baselines even in practical settings. We are grateful that the reviewer’s suggestion led us to design this experiment and obtain results that further strengthen the practical significance of our contributions.

With our recent comment ("Key Contributions") in mind, we would be happy to address any further questions or concerns the reviewer may have. Otherwise, if our responses have sufficiently addressed the reviewer’s comments, we respectfully hope that the reviewer will kindly reconsider their assessment in light of the strengths and contributions of our work. Thank you again for your thoughtful feedback and consideration.

评论

In light of Reviewer FLnn's comment (which we appreciate) regarding the use of real-world data, we have carefully designed an additional semi-synthetic experiment leveraging a real-world dataset. As noted in our previous response to Reviewer FLnn, this approach serves as the most viable and practical alternative in the absence of access to live online field experimentation.

We used the IMDB Large Movie Review dataset (Maas et al., 2011), which consists of 50,000 movie reviews in text form, each labeled as either positive or negative. To evaluate online assortment selection algorithms, as previously mentioned, one needs access to ground-truth choice probabilities for given assortments, which are typically unobservable in real-world datasets. To address this, we first transformed the review texts into vector representations using TF-IDF (Term Frequency–Inverse Document Frequency), implemented via the TfidfVectorizer from scikit-learn. We then applied truncated SVD to reduce the dimensionality of the vectors to d=30d = 30. This process yielded a dataset consisting of 50,000 context feature vectors with corresponding binary labels (0 or 1).

We then used 40,000 of these samples as training data to fit a binary classification model. Specifically, we trained a two-layered neural network with 32 hidden nodes to classify the binary labels. The learned model was subsequently used to define the true utility for each movie based on its extracted context feature, allowing us to formulate an online assortment selection task. We then evaluated both the MNL bandit baselines and our proposed algorithm under this setting.

At each round of the online experiment, we randomly sampled N=100N = 100 movies from the remaining 10,000 held-out samples, and asked the algorithm to choose an assortment of size K=5K = 5. As in our main experiments, we used a uniform revenue of 1 for each item in the assortment (since revenue information is not given in the dataset). The experimental results (cumulative regret at time step T=1000T=1000) are summarized in the table below.

MethodONLMNL`ONL-MNL` (ours)UCBMNL`UCB-MNL`TSMNL`TS-MNL`OFUMNL+`OFU-MNL+`εgreedyMNL\varepsilon`-greedy-MNL`
Cum. Regret at T=10004.72 ± 0.9314.48 ± 1.4663.80 ± 5.1025.34 ± 1.7010.42 ± 2.31

According to the table above, our proposed algorithm significantly outperforms the baseline methods. We are happy to offer the additional experiments that go beyond synthetic experiments and offer a meaningful way to evaluate MNL bandit algorithms under non-linear utilities and features from the real-world observation. We are pleased that the reviewer FLnn’s suggestion led us to design this additional experiment and obtain results that further support our contributions. We also add this experiment in the final revision.

We hope our additional response has helped evaluate our method. If there are any remaining questions or points of clarification, please feel free to let us know at your convenience so we have sufficient time for further discussion.


(Reference)

  • Maas et al. (2011) "Learning word vectors for sentiment analysis." Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies.
评论

We would like to sincerely thank the reviewers once again for taking the time to review our manuscript, provide constructive feedback, and offer a positive evaluation. As the author-reviewer discussion period draws to a close, we would like to take this opportunity to revisit the key contributions of our work, reflecting on the invaluable feedback provided by the reviewers.


  • Novel Algorithm: In this paper, we propose a novel algorithm for the MNL bandit with generic non-linear utilities, which is both computationally tractable and statistically efficient. Unlike prior work on general utility function classes [34], our method leverages optimistic exploration based on UCB, enabling polynomial-time computation while achieving a O~(T)\tilde{O}(\sqrt{T}) regret bound. Moreover, the regret bound of our proposed algorithm is independent of the total number of items, making it particularly suitable for large-scale applications.

  • Relaxed Assumptions: Our algorithm is built on two main conditions: boundedness (Assumption 2) and a generalized geometric condition (Assumption 4). The boundedness assumption (along with differentiability and smoothness) is standard in the bandit literature involving parametric reward models and is also satisfied in NTK-based neural bandits. More importantly, the generalized geometric condition introduced in our work is strictly weaker than the global strong convexity assumptions used in prior works. Besides, it does not require the existence of a unique global minimum and applies to a broader family of non-linear, non-convex functions beyond those covered by existing bandit algorithms.

  • Novel Analytical Techniques: The proof techniques in our paper significantly differ from those used in previous MNL bandit literature. Leveraging fast generalization error bounds of the regression oracle in terms of log-loss, we establish the convergence rate of the pilot estimator. Building on this O(1/n)O(1/n) rate, we employ a first-order Taylor approximation to linearize the non-linear MNL model and analyze the resulting log-loss. Combined with the newly established concentration inequality for the MNL utility parameter (Lemma 5) and a carefully chosen regularization parameter λ=O~(T)\lambda = \tilde{\mathcal{O}}(\sqrt{T}), we derive the concentration bound minw~Ww^w~V2=O~(dw)\min_{\tilde{w} \in \mathcal{W}^*} || \hat{\mathbf{w}} - \tilde{\mathbf{w}} ||_{\mathbf{V}}^2 = \tilde{O}(d_w) for the estimated parameter. These analytical techniques are novel in the context of MNL bandits and we believe they offer promising directions for future research.

  • Extensive Empirical Evaluation: While the prior work [34] did not provide any numerical experiments for MNL bandits with general utilities, we present empirical results across a variety of settings to validate our approach. Besides, in light of Reviewer FLnn's comment on scalability, we conducted experiments at various scales and further included semi-synthetic evaluations based on real-world datasets. Across all experimental settings, our proposed algorithm consistently outperformed baselines, demonstrating both its robustness and practicality.


In summary, we remain confident in the substantial contributions of our work—not only in terms of introducing a new algorithm and relaxed assumptions, but also through refined regret analysis and practical applicability. We believe that this work offers a comprehensive contribution, bringing together deep theoretical rigor, novel analysis and practical relevance.

最终决定

The paper introduces a new optimistic algorithm and stronger regret guarantees for multinomial logistic bandits with nonlinear values. This is a very general and practically important setting, and the theoretical and algorithmic contributions are significant. These results are derived under strong Lipschitz‐continuity and smoothness conditions, which often scale poorly in deep networks. Furthermore, the authors should clarify the role of \kappa in their regret bound as well as the computational cost of their algorithm.