PaperHub
7.4
/10
Poster5 位审稿人
最低7最高9标准差0.8
7
7
9
7
7
3.4
置信度
正确性3.2
贡献度3.0
表达3.4
NeurIPS 2024

Nearly Minimax Optimal Regret for Multinomial Logistic Bandit

OpenReviewPDF
提交: 2024-05-14更新: 2025-01-15
TL;DR

In this paper, we investigate an adversarial contextual multinomial logit (MNL) bandit problem and establish minimax optimal lower and upper regret bounds.

摘要

In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{\mathcal{O}}(d\sqrt{\smash[b]{T/K}})$. We also provide instance-dependent minimax regret bounds under uniform rewards. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{\mathcal{O}}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality --- for either uniform or non-uniform reward setting --- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
关键词
BanditContextual BanditMultinomial Logistic BanditMinimax Regret

评审与讨论

审稿意见
7

This paper studied the minimax optimal regret for a series of contextual MNL bandit problems, covering rewards—uniform and non-uniform cases and varying the value of the outside option v0v_0. The authors first derive the regret lower bound that provides explicit dependence on the key parameters. Then they propose efficient algorithm OFU-MNL+ that achieves a matching upper bound up to a logarithm term. Simulations were conducted to demonstrate the tight dependence of the derived bounds.

优点

  1. Despite many recent advances in contextual MNL bandit problems, however no previous studies have proven the minimax optimality. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the feature dimension dd and the maximum assortment size KK. This paper closes this gap. The technical contribution is strong and sufficient.

  2. In spite of a theoretical paper, it is well written and is easy to follow. I enjoy reading the paper.

缺点

  1. My major question is about Assumption 2, which requires the uniform lower bound for every item iSi\in {\cal S} and any assortment set and all round tt. Requiring κ\kappa to be a constant would be a strong assumption in practice. Since minipt(iS,w)1/K\min_i p_t(i|S,w) \le 1/K, κ\kappa would naturally depend on the size KK. In this case, would it be possible to derive more tighter results by explicitly including κ\kappa in the lower and upper bounds?

Below are a few minor suggestions:

  1. I would suggest to introduce Section 4 (Problem Setting) before Section 3 (Existing Gap). Readers who are not familiar with the contextual MNL bandit problems might be confusing while reading Section 3.

  2. Section 3 overlaps with many discussions in the Introduction section. It might be helpful to shorten the discussion and add more discussions on (1) why proving the new lower and upper bound results is non-trivial; (2) providing some proof outlines on the key steps used in the proof.

  3. In the Introduction, line 31 on page 1, it is helpful to add some clarification on the definition of v0v_0. The current terminology "utility of the outside option" is not very clear.

  4. In line 162 on page 4, the authors mentioned that the set of items [N][N] can vary over time. While I understand the assortment set S\cal S could change over time, I am confused how [N][N] could vary over time. The notation [N][N] means {1,,N}\{1, \ldots, N\}. What do you mean this set varies over time? Do you allow new items to enter the market or some items to be removed over time?

问题

see Weakness

局限性

N/A

作者回复

We are delighted that you enjoyed reading our paper and appreciated the value of our contributions. Thank you very much for your valuable feedback. Based on your detailed and insightful suggestions, we will make further improvements to our paper.


Regarding Assumption 2

We would like to clarify that the existence of a non-zero κ\kappa in our Assumption 2 should NOT be considered a weakness. It is important to note that κ\kappa is always non-zero, as all MNL probabilities are non-zero for bounded feature vectors and bounded parameters, and κ\kappa represents the minimum value of the product of two MNL probabilities. Thus, it is more appropriate to "define" κ\kappa rather than assuming its existence. In Goyal and Perivier (2021), κ\kappa is also defined rather than assumed. For clarity, we will revise Assumption 2 to be a Definition in our final version.

Moreover, this is a common assumption in logistic, MNL, and generalized linear model (GLM) bandits, and should NOT be considered a unique limitation of our work.

More interestingly, although κ\kappa can be exponentially small, our algorithm does not need to know κ\kappa. And we avoid the κ\kappa-dependence in the leading term of our regret analysis, making our algorithm much more practical.


Problem-dependent regret bounds

It turns that problem-dependent upper and lower bounds under uniform rewards are achievable. By making some modifications to our results, we can easily achieve this.

1. Upper bound
By adapting Theorem 2 from Faury et al. (2022) [21] to the multinomial and online parameter setting, we can achieve a regret upper bound of O~(dt=1Tκt+d2/κ)\tilde{\mathcal{O}}\left( d \sqrt{\sum_{t=1}^T \kappa_t^\star} + d^2/\kappa \right), where κt=iStpt(iSt,w)pt(0St,w)\kappa_t^\star = \sum_{i \in S_t^\star} p_t(i \mid S_t^\star, \mathbf{w}^\star)p_t(0 \mid S_t^\star, \mathbf{w}^\star). See Section A in the comment below for the proof.

2. Lower bound
We can also derive a regret lower bound of Ω(dκT)\Omega(d\sqrt{\kappa^\star T}) by applying Theorem 2 from Abeille et al. (2021).

In the proof of Theorem 1, both the optimal assortment SS^\star and an alternative assortment S~t\tilde{S}_t (only has feature vectors that maximizes the utility in StS_t) have the same feature vectors for each assortment. Let S=x,,xS^\star = \\{x,\dots,x \\} and S~t=x^,,x^\tilde{S}_t = \\{\hat{x}, \dots, \hat{x}\\}. Then, by Proposition D.1, we have

iSp(iS,w)iStp(iSt,w)iSp(iS,w)iS~tp(iS~t,w)=:f(x;w)f(x^;w).\sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in S_t} p (i \mid S_t, \mathbf{w}^\star) \geq \sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in \tilde{S}_t} p (i \mid \tilde{S}_t, \mathbf{w}^\star) =: f(x;\mathbf{w}^\star) - f(\hat{x};\mathbf{w}^\star).

Here, f(x;w)=Kexp(xw)v0+Kexp(xw)f(x;\mathbf{w}^\star) = \frac{ K\exp(x^\top\mathbf{w}^\star ) }{v_0 + K \exp(x^\top \mathbf{w}^\star)} maps from Rd\mathbb{R}^d to R\mathbb{R}. Since the input to the function ff is a single vector xx, ff can be regarded as a logistic function. This characteristic allows us to follow the proof of Theorem 2 from Abeille et al. (2021) [4], resulting in a regret lower bound of Ω(dκT)\Omega(d\sqrt{\kappa^\star T}).


Suggestions

1. Introducing Section 4 before Section 3: That’s a great suggestion. Thank you for your input. We will incorporate these changes into the final version.

2. More discussion on (1) why proving the new lower and upper bounds is non-trivial; (2) providing some proof outlines for the key steps used in the proofs: Due to the page limit, we have briefly outlined the technical novelties used in each proof in the discussions section of the main paper, while providing more comprehensive details in the Appendix. If we receive an extra page for the camera-ready version, as you suggested, we will expand the discussions to further emphasize the importance of our bounds and elaborate on the key steps used in the proofs.

3. Clear definition of v0v_0: The utility for the outside option v0v_0 is the attraction parameter for not choosing any item, and it is used to construct the MNL probability. For better presentation, we will include a reference to the definition of MNL probabilities (Eq.1) when we first introduce the term v0v_0 and provide additional explanation about v0v_0.

4. The meaning of "the set of items [N][N] can vary over time" (Line 162): The statement indicates that we allow for the entry of new items into the market or the removal of existing items over time, which could also lead to changes in the total number of items, denoted as NtN_t. For clarity, we need to define the item set at time tt as It:=it1,,itNt\mathcal{I_t}:= \\{ i_{t1}, \dots, i_{tN_t} \\}. Thus, the item set It\mathcal{I_t} can vary over time. Thank you for pointing this out. We will clarify this notation in the final version of our paper.

评论

A. Proof of problem-dependent upper bound

We start the proof from line 748 in the proof of Theorem 2.

$

\sum_{t=1}^T \nabla Q(\mathbf{u}_t^\star)^\top (\mathbf{u}_t - \mathbf{u}_t^\star) \leq 2 \beta_T(\delta) \sum_{t=1}^T \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \| x_{ti} \|_{H_t^{-1}}.

$

Let gt(S;w):=_iSpt(iS,w)pt(0S,w)x_tiHt1g_t(S ; \mathbf{w}):= \sum\_{i \in S} p_t(i \mid S, \mathbf{w})p_t(0 \mid S, \mathbf{w}) \\| x\_{ti} \\|_{H_t^{-1}}. Let J_1 = \\{ t \in [T] \mid \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \geq \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}\_{t+1})p_t(0 \mid S_t, \mathbf{w}\_{t+1})\\\} and J2=[T]J1J_2 = [T] \setminus J_1. Thus, we get

$

\text{RHS} = 2 \beta_T(\delta) \sum_{t=1}^T g_t(S_t; \mathbf{w}^\star) = 2 \beta_T(\delta) \sum_{t \in J_1} g_t(S_t; \mathbf{w}^\star) + 2 \beta_T(\delta) \sum_{t \in J_2} g_t(S_t; \mathbf{w}^\star).

$

To bound tJ1gt(St;w)\sum_{t \in J_1} g_t(S_t; \mathbf{w}^\star), by the mean value theorem, we get

$

\sum_{i \in J_1} g_t(S_t; \mathbf{w}^\star) = \sum_{i \in J_1} g_t(S_t; \mathbf{w}_{t+1}) + \sum_{i \in J_1} \nabla_{\mathbf{w}} g_t(S_t; \bar{\mathbf{w}}_t)^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}),

$

where wˉt\bar{\mathbf{w}}_t is the convex combination of w\mathbf{w}^\star and w_t+1\mathbf{w}\_{t+1}. The first term can be bound by

$

\sum_{t \in J_1} g_t(S_t; \mathbf{w}_{t+1}) &\leq \sqrt{\sum_{t\in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}_{t+1})p_t(0 \mid S_t, \mathbf{w}_{t+1})} \sqrt{\sum_{t \in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}_{t+1})p_t(0 \mid S_t, \mathbf{w}_{t+1}) \| x_{ti}\|_{H_t^{-1}}^2 } \\ &\leq \sqrt{\sum_{t\in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star)} \cdot \tilde{\mathcal{O}}( \sqrt{d} ) \leq \sqrt{\sum_{t=1}^T \kappa_t^\star + \text{Reg}_T} \cdot \tilde{\mathcal{O}}( \sqrt{d} ) ,

$

where the second-to-the last inequality holds by the definition of J1J_1 and the elliptical potential lemma in our paper, and the last inequality holds by Lemma 11 in Goyal and Perivier [24] with κt:=_iStpt(iSt,w)pt(0St,w)\kappa_t^\star := \sum\_{i \in S_t^\star} p_t(i \mid S_t^\star, \mathbf{w}^\star)p_t(0 \mid S_t^\star, \mathbf{w}^\star). Moreover, the second term can be bounded by

$

\left|\sum_{i \in J_1}\nabla_{\mathbf{w}} g_t(S_t; \bar{\mathbf{w}}_t)\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \right| &=\Bigg|\sum_{t \in J_1} p_t(0 \mid S_t, \bar{\mathbf{w}}_t) \sum_{i \in S_t}p_t(i \mid S_t, \bar{\mathbf{w}}_t) x_{ti}^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \| x_{ti} \|_{H_t^{-1}} - 2 p_t(0 \mid S_t, \bar{\mathbf{w}}_t) \sum_{i \in S_t}p_t(i \mid S_t, \bar{\mathbf{w}}_t) \| x_{ti} \|_{H_t^{-1}} \sum_{j \in S_t}p_t(j \mid S_t, \bar{\mathbf{w}}_t) x_{tj}^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \Bigg| \\ &\leq 3 \beta_{T}(\delta) \sum_{t=1}^T \max_{i \in S_t} \| x_{ti} \|_{H_t^{-1}}^2 = \tilde{\mathcal{O}}(d^{3/2}/\kappa).

$

Hence, we get 2βT(δ)tJ1gt(St;w)=O~(dt=1Tκt+RegT+d2/κ)2 \beta_T(\delta) \sum_{t \in J_1} g_t(S_t; \mathbf{w}^\star) = \tilde{\mathcal{O}}\left( d \sqrt{\sum_{t=1}^T \kappa_t^\star + \text{Reg}_T} + d^{2}/\kappa \right).

Now, we bound _tJ2gt(St;w)\sum\_{t \in J_2} g_t(S_t; \mathbf{w}^\star).

$

\sum\_{t \in J_2} g_t(S_t; \mathbf{w}^\star)
&\leq \sqrt{\sum\_{t \in J_2} \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star)} \sqrt{\sum\_{t \in J_2} \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \\| x\_{ti} \\|\_{H_t^{-1}}^2 }
\\\\
&\leq \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T} \sqrt{ \sum\_{t=1}^T \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}\_{t+1})p_t(0 \mid S_t, \mathbf{w}\_{t+1}) \\| x\_{ti} \\|\_{H_t^{-1}}^2 }
\\\\
&= \tilde{\mathcal{O}}\left( \sqrt{d} \cdot \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T}  \right),

$

where the second inequality holds by the definition of J2J_2 and Lemma 11 in Goyal and Perivier [24]. Hence, we get 2βT(δ)_tJ2gt(St;w)=O~(d_t=1Tκt+RegT)2 \beta_T(\delta) \sum\_{t \in J_2} g_t(S_t; \mathbf{w}^\star) = \tilde{\mathcal{O}}\left( d \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T} \right). Therefore, by solving RegT=O~(d_t=1Tκt+RegT+d2/κ)\text{Reg}_T = \tilde{\mathcal{O}} \left( d \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T } + d^2/\kappa \right), we can conclude that RegT=O~(d_t=1Tκt+d2/κ)\text{Reg}_T = \tilde{\mathcal{O}}\left( d \sqrt{\sum\_{t=1}^T \kappa_t^\star} + d^2/\kappa \right).

审稿意见
7

This paper studies MNL bandit problem and proposes computationally efficient algorithms in both uniform and non-uniform settings. The authors present lower bounds for these two settings and show their results are minimax optimal up to logarithmic factors.

优点

  1. The paper achieves the minimax optimal results for the MNL bandit problem in both uniform and non-uniform settings, contributing new findings to the community.
  2. This paper introduces computationally efficient algorithms that enhance the computational complexity over prior works.
  3. Empirical evidence provided by the authors convincingly demonstrates the effectiveness of the proposed algorithms, strengthening the paper's claims with solid experimental validation.

缺点

  1. My main concern is about the contribution and technical novelty of this paper. From the statistical perspective, the regret for the uniform setting does not advance beyond the results presented in Goyal and Perivier (2021) and the κ\kappa-independent bound for the non-uniform setting has been achieved in Agrawal et al. (2023) (A comparison with Agrawal et al. (2023) in the table would be beneficial for highlighting any improvement). The main contribution of this paper is the improvement of the computational complexity. However, this enhancement comes from the online parameter estimation, which is almost the same as that in Zhang and Sugiyama (2023), with only minor modifications. As MNL bandits considered in this paper and MLogB bandits are quite similar, this modification is not significant.
  2. The claim that "This complexity arises because the optimistic revenue...requiring a different analysis" in Remark 2 seems overstated. The issue of assortment optimization to maximize optimistic revenue has been previously addressed by Oh and Iyengar (2021). The approach in the paper does not deviate from their method, which makes the purported contribution questionable.
  3. The presentation of lower bounds for both settings to demonstrate result optimality is appreciated, yet the inclusion of a problem-dependent lower bound that illustrates the influence of the problem-specific parameter κ\kappa would have added significant value, as seen in studies on logistic bandits.

问题

  1. Can the proposed algorithms be directly applied to achieve the optimal results for binary logistic bandits? It would be beneficial for the paper to discuss this applicability.
  2. The empirical performance shown in Figure 1 suggests UCB outperforms TS in non-uniform settings, contrary to the fact that TS usually achieves better empirical performance than UCB in the literature. Moreover, though OFU-MNL+ outperforms the two other baselines, all algorithms appear to incur linear regret in specific scenarios (e.g., (N=100, K=10, d=5, Uniform R)). Could the authors provide some intuition for these phenomena?

局限性

\

作者回复

We appreciate your time to review our paper and your feedback. We would like to clarify some of the possible misunderstandings regarding our main contributions. Our main contributions are to establish the first minimax regret for MNL bandits, along with providing several novel insights related to variables such as v0v_0 and reward structures. We strongly believe that our results significantly advance what have been knwon in the field of MNL bandits. We will address your feedback point by point below. We sincerely hope to clarify any misunderstandings.


Regarding our main contributions

Our main contribution is NOT the improvement of computational complexity (although it is still an advantage of our proposed method). While we employ the online parameter estimation adapted from Zhang and Sugiyama [47], it is important to note that new analyses based on improved self-concordant-like properties and considering varying assortment sizes are introduced in our work.

We clearly emphasize that our main constributions are closing the gap between the upper and lower bounds in contextual MNL bandits for the first time and proposing a nearly minmax optimal algorithm. Additionally, we believe it is a significant and novel insight that the regret: 1) varies depending on the value of v0v_0 and 2) is influenced by the reward structure (uniform versus non-uniform), and 3) can improve as the assortment size KK increases under uniform rewards. Our contributions can be summarized as follows:

  1. First minimax results in terms of d,T,Kd,T,K and even v0v_0
  2. First proofs demonstrating how regret varies with v0v_0 and reward structure
  3. First lower bound under non-uniform rewards
  4. First tractable κ\kappa-free uppder bound under non-uniform rewards
  5. Adaptation of the online parameter estimation while maintaining the above regret bound

Each of our results tackles distinct non-trivial challenges, which we have successfully resolved in our paper. For instance, we have enhanced the self-concordance-like properties and the elliptical potential method, and introduced novel "centralized features" to prove the upper bound for non-uniform rewards. Please refer to the discussions for each theorem for more details.

Furthermore, our regret bound (Theorem 2) is a clear improvement over those presented by Goyal and Perivier (2021) and Agrawal et al. (2023).

  • vs Goyal and Perivier (2021) [24]: We have improved upon the regret results of Goyal and Perivier (2021) by a factor of KK: reducing it from O~(dKT+d2K4/κ)\tilde{\mathcal{O}}(d\sqrt{KT} + d^2K^4/\kappa ) (let κ=O(1/K)\kappa^\star = \mathcal{O}(1/K) in the leading term) to O~(dT/K+d2/κ)\tilde{\mathcal{O}}(d\sqrt{T/K} + d^2/\kappa ). This advancement was achieved by enhancing the self-concordant-like property of the loss function (Proposition C.1) and refining the elliptical potential lemma (Lemma E.2). Moreover, we significantly improved the regret associated with the transient term, reducing it from d2K4/κd^2K^4/\kappa to d2/κd^2/\kappa, by utilizing an improved bound for the second derivative of the revenue (Lemma E.3). For further details, please refer to the discussion of Theorem 2 in our paper.

  • vs Agrawal et al. (2023) [5]: In fact, their paper contains significant technical errors (refer Section A in the comments below), rendering a direct comparison to our work meaningless. This is the main reason we initially chose not to compare our results with theirs. Nevertheless, in response to requests from some reviewers, we will consider including a discussion about this in our camera-ready version. Even if all errors were corrected, our result (Theorem 2) still shows an improvement over theirs in terms of KK. Agrawal et al. (2023) considered a uniform rewards setting (with v0=1v_0=1) and achieved a regret of O~(dT+d2/κ)\tilde{\mathcal{O}}(d\sqrt{T} + d^2/\kappa). However, the corrected regret should be O~(dKT+d2/κ)\tilde{\mathcal{O}}(d\sqrt{KT} + d^2/\kappa)).


Remark 2 is intended solely for comparing our setting with that of Zhang and Sugiyama [47].

The purpose of Remark 2 is to explain that our setting fundamentally differs from those described by Zhang and Sugiyama [47], necessitating a different analysis. Therefore, this distinction should NOT be viewed as a weakness. Again, our main contribution is proving the first minimax results in MNL bandits.


Problem-dependent (κ\kappa) lower bound

The main goal of this paper is to explicitly show how the values of KK and v0v_0 affect the regrets (both upper and lower bounds). To clearly demonstrate the dependency of regret on these values, we have presented the results in terms of KK and v0v_0, rather than κ\kappa.

As per your request, we believe that a problem-dependent lower bound can be easily derived from our intermediate results. In the proof of Theorem 1, both the optimal assortment SS^\star and an alternative assortment S~t\tilde{S}_t (which only includes feature vectors that maximizes the utility in StS_t) have the same feature vectors for each assortment. Given that the input to the MNL probability function is a single feature vector, our problem can be reduced to the logistic bandit problem. Subsequently, using similar analyses to those in the proof of Theorem 2 from Abeille et al. (2021) [4], we can derive a regret lower bound of Ω(dκT)\Omega(d\sqrt{\kappa^\star T}) where κ:=_iSp(iS,w)p(0S,w)\kappa^\star := \sum\_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) p(0 \mid S^\star, \mathbf{w}^\star). See Section B in the comments below for more detailed proof.

We are more than happy to include this lower bound. However, we do NOT believe that not stating the problem-dependent lower bound should be considered a weakness, as our focus is to show the regret dependence on KK and v0v_0.

Due to space limitations, we will address your questions in the following comments.

评论

B. Proof of problem-dependent (κ\kappa) lower bound

In the proof of Theorem 1, both the optimal assortment SS^\star and an alternative assortment S~t\tilde{S}_t (only has feature vectors that maximizes the utility in StS_t) have the same feature vectors for each assortment. Let S=x,,xS^\star = \\{x,\dots,x \\} and S~t=x^,,x^\tilde{S}_t = \\{\hat{x}, \dots, \hat{x}\\}. Then, by Proposition D.1, we have

iSp(iS,w)iStp(iSt,w)iSp(iS,w)iS~tp(iS~t,w)=:f(x;w)f(x^;w).\sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in S_t} p (i \mid S_t, \mathbf{w}^\star) \geq \sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in \tilde{S}_t} p (i \mid \tilde{S}_t, \mathbf{w}^\star) =: f(x;\mathbf{w}^\star) - f(\hat{x};\mathbf{w}^\star).

Here, f(x;w)=Kexp(xw)v0+Kexp(xw)f(x;\mathbf{w}^\star) = \frac{ K\exp(x^\top\mathbf{w}^\star ) }{v_0 + K \exp(x^\top \mathbf{w}^\star)} maps from Rd\mathbb{R}^d to R\mathbb{R}. Since the input to the function ff is a single vector xx, ff can be regarded as a logistic function. This characteristic allows us to follow the proof of Theorem 2 from Abeille et al. (2021) [4], resulting in a regret lower bound of Ω(dκT)\Omega(d\sqrt{\kappa^\star T}) where κ:=iSp(iS,w)p(0S,w)\kappa^\star := \sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) p(0 \mid S^\star, \mathbf{w}^\star).

评论

Thank you for your response. I appreciate that my main concerns were addressed, so I've improved my score. I'm also grateful to the author for highlighting the technical errors in the previous article. However, I still have a few technical questions regarding the paper. The analysis by Zhang & Sugiyama (2023) seems to rely heavily on the one-step distance of online mirror descent (Lemma 20 of their paper). However, I could not find the one-step distance analysis within the paper. Could you please provide more details on this?

评论

Thank you very much for kindly increasing the score. We sincerely hope that we have addressed all of your concerns and that our work’s significance is now clearer. Your thoughtful feedback has been instrumental, and we believe that the final version will be strengthened as a result.

评论

Questions

Q1. Achieving the optimal results for binary logistic bandits

Yes, it is explained in Lines 291-293. When K=1,rti=1K=1, r_{ti}=1, and v0=1v_0=1, our problem can be reduced to the logistic bandit problem. Under these conditions, we achieve a regret of O~(dκT)\tilde{\mathcal{O}}(d\sqrt{\kappa T}) by Corollay 1 in Zhang and Sugiyama [49].

Q2-1. Figure 1 suggests UCB outperforms TS in non-uniform settings, contrary to the fact that TS usually achieves better empirical performance than UCB.

While TS often outperforms UCB algorithms in many bandit problems, it is NOT guaranteed to perform better. For example, even in linear bandits, the worst-case regret of TS (Agrawal and Goyal [6]) is known to be worse than that of UCB (Abbasi-Yadkori et al. [1]) as many experts would know. That is, there are instances where UCB can empirically outperform TS. Additionally, the empirical results from Oh and Iyengar [39] further demonstrate that UCB performs better than TS in MNL bandits.

Q2-2. All algorithms appear to incur linear regret in specific scenarios (e.g., (N=100, K=10, d=5, Uniform R)).

First, we would like to clarify that the regret upper bounds of the baseline algorithms (UCB-MNL and TS-MNL) are inversely proportional to κ\kappa. As we explained in Line 196 of our paper, since 1/κ=O(K2)1/\kappa = \mathcal{O}(K^2), the curvature of the regret curves of the baselines decreases as the assortment size KK increases. This is why their curves appear to be linear.

However, it's important to note that while the regret curves of these baselines may appear linear, they are actually sub-linear, curving very slowly. This indicates that these algorithms learn at a significantly slower rate compared to our algorithm. The table included below clearly demonstrates that the regret curves are indeed sub-linear. The average cumulative regret and the increase in regret every 500 rounds under N=100, K=10, d=5, uniform rewards:

roundUCB-MNLΔ\Delta UCBTS-MNLΔ\Delta TSOursΔ\Delta Ours
10.0449080.0462250.033627
50116.3358316.29092226.5728826.5266557.650577.616943
100131.89653515.56070552.9346426.3617611.8833414.232771
150147.22510615.32857179.14498826.21034814.855612.972269
200162.32684815.101742105.14502626.00003817.1240882.268478
250177.19559314.868745130.84738125.70235519.0585571.934469
300191.96397814.768385156.31564325.46826220.6765821.618025

Other scenarios show similar trends.


A. Technical errors in Agrawal et al. (2023) [5]

1. Equation (16): αi(XSt,θt,θ)x_ti:=μi(X_Stθ)μi(X_Stθt)\alpha_i(\mathbf{X}_{S_t}, \theta_t, \theta^{\star}) x\_{ti} :=\mu_i(\mathbf{X}\_{S_t}^\top \theta^\star) - \mu_i(\mathbf{X}\_{S_t}^\top \theta_t), where X_St\mathbf{X}\_{S_t} is a design matrix whose columns are the attribute vectors x_tix\_{ti} of the items in the assortment StS_t and μi(X_Stθ)=Pt(iSt,θ)\mu_i(\mathbf{X}\_{S_t}^\top \theta) = P_t(i \mid S_t, \theta).

  • It appears that the authors may have intended to derive this equation using a first-order exact Taylor expansion. However, in MNL bandits, this equation generally does not hold. Consider a counterexample where xti=0x_{ti}=0, xtj0x_{tj} \neq 0 for jij \neq i, and θθt\theta^\star \neq \theta_t. Then, LHS is equals to 00. In this case, the left-hand side (LHS) equals 00 (since xti=0x_{ti}=0), but the right-hand side (RHS) is not 00, because the denominators of each μi(X_Stθ)\mu_i(\mathbf{X}\_{S_t}^\top \theta^\star) and μi(X_Stθt)\mu_i(\mathbf{X}\_{S_t}^\top \theta_t) differ. This equation only holds in special cases, such as when K=1K=1, which corresponds to the logistic bandit case. Equation (16) serves as the foundation for the entire proof in their paper. Consequently, all subsequent results derived from it are also incorrect.

2. Cauchy-Schwarz inequality in the regret analysis (Page 46)

  • When using the Cauchy-Schwarz inequality on the regret before applying the elliptical potential lemma, they indeed incur an additional K\sqrt{K} factor: t=1Tmin(iStxtiJ_t1,1)KTmin(_iStx_ti_J_t12,1).\sum_{t=1}^T \min \left(\sum_{i \in S_t}\\|x_{ti}\\|_{\mathbf{J}\_t^{-1}},1 \right) \leq \sqrt{KT} \sqrt{\min \left(\sum\_{i \in S_t} \\|x\_{ti}\\|\_{\mathbf{J}\_t^{-1}}^2 ,1 \right)}.

  • Hence, their regret should actually be O~(dKT+d2/κ)\tilde{\mathcal{O}}(d\sqrt{KT} + d^2/\kappa), which is worse than the result in our Theorem 2 (upper bound under uniform rewards) by a factor of KK.


Continued in the next comment.

评论

Thank you once again for your positive reevaluation of our paper. We are glad that we were able to address your major concern!

To answer your question: We could have used Lemma 20 from Zhang & Sugiyama (2023). However, since the leading term of the regret upper bound remains unchanged, we determined that the lemma was not essential and decided not to include it.

First, let’s revisit Lemma 20 from Zhang & Sugiyama (2023). Notably, their bound can be improved by a factor of K\sqrt{K}. Specifically, because t(w_t)22\\| \nabla \ell_t (\mathbf{w}\_t) \\|_2 \leq 2 (rather than t(w_t)22K\\| \nabla \ell_t (\mathbf{w}\_t) \\|_2 \leq 2\sqrt{K} as they suggested), we can derive w_t+1w_t_Htαλ\\| \mathbf{w}\_{t+1} - \mathbf{w}\_t \\|\_{H_t} \leq \frac{\alpha}{\lambda} (instead of w_t+1w_t_HtαKλ\\| \mathbf{w}\_{t+1} - \mathbf{w}\_t \\|\_{H_t} \leq \frac{\alpha \sqrt{K}}{\lambda}), where α=O(logK)\alpha = \mathcal{O}(\log K).

With this revised version of Lemma 20, we could apply it to derive the same regret upper bound. In Line 746, by taking the absolute value of the regret, we can use a second-order Taylor expansion around u_t\mathbf{u}\_t (instead of u_t\mathbf{u}\_t^\star) as follows:

$

\sum_{t=1}^T | Q(\mathbf{u}_t) - Q(\mathbf{u}_t^\star) | \leq \underbrace{\sum_{t=1}^T | \nabla Q(\mathbf{u}_t)^\top (\mathbf{u}_t^\star - \mathbf{u}_t) |}_{\text{(A)}} + \underbrace{\frac{1}{2} \sum_{t=1}^T| (\mathbf{u}_t^\star - \mathbf{u}_t)^\top \nabla^2 Q(\bar{\mathbf{u}}_t) (\mathbf{u}_t^\star - \mathbf{u}_t) |}_{\text{(B)}}

$

The term (A) can be bounded using a similar analysis to our proof of Theorem 2, except that we need to bound w_t+1w_t_Ht\\| \mathbf{w}\_{t+1} - \mathbf{w}\_t\\|\_{H_t} instead of w_t+1w_Ht\\| \mathbf{w}\_{t+1} - \mathbf{w}^\star \\|\_{H_t} (refer to Lines 763 and 768). Since w_t+1w_t_Htαλ=O(1d)\\| \mathbf{w}\_{t+1} - \mathbf{w}\_t\\|\_{H_t} \leq \frac{\alpha}{\lambda} = \mathcal{O}(\frac{1}{d}) and w_t+1w_Htβt(δ)=O(dlogtlogK)\\| \mathbf{w}\_{t+1} - \mathbf{w}^\star \\|\_{H_t} \leq \beta_t(\delta) = \mathcal{O}(\sqrt{d}\log t \log K), applying Lemma 20 of Zhang & Sugiyama (2023) results in a tighter bound. However, we only apply Lemma 20 to non-leading terms—the second and third terms of Eq. (E.4) (Line 757). Specifically, by using Lemma 20, we can bound these terms by O~(1κ)\tilde{\mathcal{O}}(\frac{1}{\kappa}) which is dominated by the bound for term (B) at O~(d2κ)\tilde{\mathcal{O}}(\frac{d^2}{\kappa}). Thus, the leading term of the regret upper bound remains unchanged.

We hope our response addresses your question. If you have any additional questions or comments, we would be more than happy to address them.

评论

Thank you for your reply. I have no further questions. I will further raise my score to 7.

评论

We sincerely appreciate your decision to further increase the score. It has been a pleasure discussing our work with you. Thank you very much!

审稿意见
9

The paper closes the gap between upper and lower bounds for contextual MNL bandits over various scenarios, including for varying v0v_0 and uniform/non-uniform rewards. Furthermore, the paper proposes a computationally efficient algorithm that achieves such nearly optimal upper bound, which is validated numerically as well.

优点

  • Clearly, very well-written
  • Clear-cut contributions that close all the gaps remaining in the contextual MNL bandits
  • Numerical verification
  • Various technical contributions, mainly for the analysis but also for the algorithm design (Eqn. (6))

缺点

  • Nothing noticeable; see questions

问题

Questions

  • In Remark 2, the authors mention that the optimistic bonus incurs exponential computational cost. But, disregarding the computational complexity, is there any advantage of doing so from the statistical/theoretical perspective? Somewhat related, although the authors have avoided the MLE-based approach for computational complexity, does using MLE with the intuitions provided here could lead to better regret guarantees in other dependencies?
  • In the regret upper bounds, there is a transient term d2κ\frac{d^2}{\kappa}. Is this an artifact of the proof? Or is there a hidden warmup phase done implicitly by the OFU-MNL+? Then, if one is okay with incurring such transient terms anyhow, is there potential for doing even better through forced exploration-type approaches (e.g., [1,2] for logistic bandits)?
  • Although the 11-boundedness of ww^\star is standard in MNL bandits, if we relax this to BB-boundedness for some B>0B > 0, then does this imply that κ\kappa may scale exponentially in BB, like logistic bandits [3]?
  • Can the lower bound in Theorem 1 be written as a local minimax lower bound (e.g., Theorem 2 of [3]) centered around a specific instance? This seems like a better presentation than the current representation, which seems to be a generic minimax, where the max is over all possible problem instances. Also, the lower-bound construction, although inspired by prior works on MNL bandits, resembles a bit of [3] as well. I encourage the authors to check Appendix D of [3], especially the Averaging Hammer technique, and if applicable, include some discussions on this as well. Such instance-specific local minimax lower bound may make the result stronger.

Suggestions

  • In the main text, for all the theorem statements, please include (hyper)ref for where the full proof is in the Appendix.
  • Although a bit complex, for practitioners, it might be beneficial to include the precise constants for the βt(δ)\beta_t(\delta) in Lemma 1 (or point to the Appendix with the constants). A follow-up question: for the experiments, do the authors use the precise parameter as specified theoretically?
  • Might be beneficial to move the problem setting to earlier part of the paper, but not too sure here.

[1] https://arxiv.org/abs/2404.06831

[2] https://arxiv.org/abs/2202.02407

[3] https://proceedings.mlr.press/v130/abeille21a.html

局限性

Yes

作者回复

Thank you very much for your positive evaluation of our paper and for your insightful and valuable feedback. Each question you raised is an intriguing research topic, shedding light on new areas for future study.


Advantage of optimistic bonus

First, we want to clarify that the simple adaptation of Zhang and Sugiyama [49] to our method incurs "exponential computational cost" -- this is NOT an inherent drawback of our method. To the best of our knowledge, the adaptation of an optimistic bonus, as done in Zhang and Sugiyama [49], to our method does not provide any statistical advantage, despite the exponential computation cost.


Does using MLE with the intuitions provided here could lead to better regret guarantees in other dependencies?

To the best of our knowledge, given that our regret upper bound is minimax optimal (up to logarithmic factors), we believe that the use of MLE may not improve our regret bound polynomially. However, it might be possible to affect the bound logarithmically, which could be of future interest.


Regarding d2/κd^2/\kappa and the potential for forced exploration-type approaches

The term d2/κd^2/\kappa is an "artifact" of the proof, resulting from the second-order Taylor expansion of the regret. We do not have any warm-up phase or forced exploration in the algorithm. Hence, we are not sure whether forced exploration-type approaches can be beneficial. Based on your feedback, we considered adding an initial/intermediate forced-sampling step, but we do not believe the regret can be improved (in our current analysis). This would be an interesting area for future work.


If we relax the bound of w2\\|\mathbf{w}^\star\\|_2 to BB-boundedness for some B>0B>0, then does this imply that κ\kappa may scale exponentially in BB, like logistic bandits [3]?

Yes, by the deifinition, κ\kappa is affected by BB. Specficially, 1/κ1/\kappa (which appears in the lower-order term in our upper bounds) scales with exp(B)\exp(B).


Instance-specific local minimax lower bound

As you suggested, we have checked the Averaging Hammer technique and found that an instance-dependent lower bound is indeed achievable. Thank you very much for your valuable feedback. Your insights have greatly strengthened our paper!

We can achieve a regret lower bound of Ω(dκT)\Omega(d\sqrt{\kappa^\star T}) by applying Theorem 2 from Abeille et al. (2021), where κ:=iSp(iS,w)p(0S,w)\kappa^\star := \sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) p(0 \mid S^\star, \mathbf{w}^\star).

In the proof of Theorem 1, both the optimal assortment SS^\star and an alternative assortment S~t\tilde{S}_t (only has feature vectors that maximizes the utility in StS_t) have the same feature vectors for each assortment. Let S=x,,xS^\star = \\{x,\dots,x \\} and S~t=x^,,x^\tilde{S}_t = \\{\hat{x}, \dots, \hat{x}\\}. Then, by Proposition D.1, we have

iSp(iS,w)iStp(iSt,w)iSp(iS,w)iS~tp(iS~t,w)=:f(x;w)f(x^;w).\sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in S_t} p (i \mid S_t, \mathbf{w}^\star) \geq \sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in \tilde{S}_t} p (i \mid \tilde{S}_t, \mathbf{w}^\star) =: f(x;\mathbf{w}^\star) - f(\hat{x};\mathbf{w}^\star).

Here, f(x;w)=Kexp(xw)v0+Kexp(xw)f(x;\mathbf{w}^\star) = \frac{ K\exp(x^\top\mathbf{w}^\star ) }{v_0 + K \exp(x^\top \mathbf{w}^\star)} maps from Rd\mathbb{R}^d to R\mathbb{R}. Since the input to the function ff is a single vector xx, ff can be regarded as a logistic function. This characteristic allows us to follow the proof of Theorem 2 from Abeille et al. (2021) [4], resulting in a regret lower bound of Ω(dκT)\Omega(d\sqrt{\kappa^\star T}). We will include this result in our final paper.


Suggestions

We appreciate all your suggestions and will incorporate them into the paper as you proposed.

To answer your additional question about whether we used precise parameters in the experiments: yes, we used the exact values.

评论

Thank you for your detailed responses. I'm quite happy that the lower bound holds in a local minimax sense, which I believe is a much stronger notion for claiming instance-wise optimality of the algorithms. All my questions and concerns have been addressed, and after reading through other reviews, other significances of this paper (not requiring the knowledge of κ\kappa for the algorithm, instance-specific upper/lower bounds available, additional experiments) have come to my attention.

Thus, I'm further raising the score and willing to advocate strongly for the paper. Congratulations on the nice work!

Some additional questions/suggestions:

  • Exactly where in the proof is the transient term d2/κd^2 / \kappa popping up? Could one hope for an arm-set geometry-dependent transient term, as in the "detrimental arms" of Abeille et al. (2021)?
  • I don't think this was mentioned properly in the paper. Is the obtained confidence sequence (CS) the first κ\kappa-free CS in contextual MNL bandits? If time allows, for the camera ready (not the rebuttal), it would be interesting to see the resulting CS at the final stage and compare it to existing CSs.
  • For the camera-ready ver, I would like to see explicit dependencies on BB as well, with the more general assumption. Also, under the BB-boundedness, would all the constants for the algorithm (λ,η,βt(δ)\lambda, \eta, \beta_t(\delta)) as well as the regret also depend on BB?
评论

We truly appreciate your strong support for our paper! The discussions with you during this rebuttal period have been both highly valuable and genuinely enjoyable. Each of your comments has been very constructive and has helped us improve our work. Below are our responses to your additional questions.


Q1. Regarding the transient term d2/κd^2/\kappa

The transient term d2/κd^2/\kappa pops up when bounding the second-order term of the regret (term (B) in Line 746), analogous to what is observed in logistic bandits as discussed by Abeille et al. (2021) [3] and Faury et al. (2020) [18]. In their work, the second-order regret term also scales with 1/κ1/\kappa. Global information (measured through 1/κ1/\kappa) is pushed into a second-order term that vanishes quickly.

To address the question of whether an arm-set geometry-dependent transient term is possible, we believe that directly applying the proof from logistic bandits is not feasible. Extending it to our setting presents several challenges:

  1. Setting: We would need to shift from our current setting (adversarial contexts) to a fixed-arm setting.

  2. Multiple item selection: In Abeille et al. (2021), the tighter bound for the second-order term is largely achieved by focusing on whether x(θ)θ0x_{\star}(\theta_\star)^\top \theta_{\star} \geq 0 holds or not. However, in our setting, the optimal assortment includes multiple items, which makes it much more complex to divide the cases in a similar way.

  3. Lemma 9 is not applicable: To bound the second-order term, Abeille et al. (2021) use Lemma 9, which is only valid when the input to the function f()f(\cdot) is a scalar. Since our function Q()Q(\cdot) (Line 742) takes a KK-dimensional vector as input, this lemma does not apply to our case.

There are likely additional complexities that make it challenging to derive a bound for the second-order term that depends on the arm structure. Nonetheless, we believe this is a highly interesting direction for future research. Thank you for the intriguing question!


Q2. Regrading confidence sequence

Goyal and Perivier (2022) [24] also achieved a κ\kappa-free CS (See Proposition 2 in their paper). However, they constructed their CS based on the MLE, which incurs a computational cost of O(t)\mathcal{O}(t) computation cost per round to estimate the parameter. More importantly, their CS is non-convex, making it intractable to compute an optimistic parameter θ~_ti=argmax_θC_t(δ)x_tiθ\tilde{\theta}\_{ti} = \arg\max\_{\theta \in \mathcal{C}\_t(\delta)}x\_{ti}^\top \theta.

We plan to compare our resulting confidence set (Lemma 1) with other works in the camera-ready version. Thank you for the constructive suggestion!


Q3. Explicit dependencies on BB

We agree that assuming BB-boundedness of the parameter norm would enhance clarity. We will update the assumption and the corresponding constant in our camera-ready version. With the BB-boundedness assumption, the constants for the algorithms are as follows:

  • η\eta: 12log(K+1)+B+1\frac{1}{2} \log(K+1) + B + 1
  • λ\lambda: 842dη84\sqrt{2} d \eta
  • βt(δ)\beta_t(\delta):

$

&\sqrt{(\log(K+1)+B+1) \left( ( 3\log(1+(K+1)t) + B + 2 ) \left( \frac{17}{16}\lambda + 2\sqrt{\lambda} \log \left( \frac{2 \sqrt{1+2t}}{\delta} \right) + 16 \left( \log\left( \frac{2\sqrt{1+2t}}{\delta} \right) \right)^2 \right) + 2 + \frac{7\sqrt{6}}{6} \eta d \log \left( 1 + \frac{t+1}{2\lambda}\right) \right) + 2 \ B^2} \\ &= \mathcal{O}(B \sqrt{d} \log K \log t )

$

审稿意见
7

The paper studies the contextual multinomial logit bandit problem. In this problem a learner interacts with the environment over a sequence of rounds. At each round, the learner receives features for NN arms and decides to select up-to KK of them (called an assortment) and play them. The environment might select one of these arms with a probability determined by a multinomial distribution and the learner receives a reward for the selected arm. The goal is to select the optimal assortment at each round that minimizes the cumulative regret of the algorithm.

The authors design algorithms for cases when the rewards are uniform (same for each arm) and non-uniform. They also provide matching lower bounds (for large enough TT). Finally numerical experiments are performed to demonstrate the practical performance of the algorithm.

优点

The authors do a great job of explaining the problem statement and prior work and position their work appropriately. I believe this is a very important direction since bandit algorithms have recently started getting popular in practical settings where the assumption of linear rewards no longer holds. Getting optimal regret guarantees for such non-linear models has gained good attention recently and the paper contributes to this in a clear non-trivial way.

It's great to see that the authors prove matching lower bounds for the uniform case (for general v0v_0) and for non-uniform case (for bounded v0v_0). This adds further credibility to their algorithm and the upper bound.

缺点

I am not an expert on the MNL problem but I was wondering if the algorithm offers anything new to the logistic bandit setting. Recently lot of work has happened in obtaining computationally efficient algorithms with optimal regret (independent of a problem non-linearity parameter κ\kappa). It would be good to see if anything new can be said about that setting using the results in this paper. Also, empirical comparisons on both regret and time complexity (for logistic bandit i.e. K=1 as well as larger K) can be a good addition to the paper.

Minor comment - Typo on line 183. The correct definition should be St=argmaxSSRt(S,w)S_t^\star = \arg\max_{S\in \mathcal{S}} R_t(S, w^\star).

My expertise in this topic is limited and I will wait to see what the other reviewers think as well.

问题

Is the problem dependent constant κ\kappa in line 189 same as the non-linearity κ\kappa defined in prior work for logistic bandits (See [4,20,21]). It seems to be slightly differently defined here so I was curious to see if it meant the same at least in the logistic case.

局限性

N/A

作者回复

Thank you for acknowledging the value of our work and providing a positive evaluation. We will address your questions below and make improvements to our paper based on your suggestions.


Anything new to the logistic bandit

There are interesting points that can be said about the logistic bandit setting:

1. The MNL bandit is better than the logistic bandit (under uniform rewards).
According to Theorem 2 in our paper, a large KK can be beneficial under uniform rewards. This suggests that in recommendation systems, it is advantageous to recommend multiple items (MNL bandit) rather than just a single item (logistic bandit). This aligns with the intuition that offering multiple items allows us to gather more information from user feedback. However, in real-world scenarios, offering a large assortment might lead to user fatigue. It would be interesting for future research to consider this user fatigue and determine an appropriate size of the assortment KK.

2. As value of the utility for outside option v0v_0 increases, both the upper and lower bounds of regret for the logistic bandit decrease.
When K=1K=1, our setting simplifies to the logistic bandit scenario. In this case, the upper and lower bounds are O~(dT/v0)\tilde{\mathcal{O}}(d\sqrt{T/v_0}) and Ω(dT/v0)\Omega(d\sqrt{T/v_0}) respectively. This highlights an interesting aspect about how the value of v0v_0 influences the regrets in the logistic bandit.


Additional empirical comparison

Following your suggestion, we have conducted additional experiments for the logistic bandit case (K=1) and cases with larger KK (K=20,30K=20, 30). See the "global" response above. These additional results also corroborate our theoretical claims. Particularly, under uniform rewards, we observed that regret decreases as KK increases, aligning with Theorems 1 and 2. In contrast, under non-uniform rewards, regret remains unchanged regardless of changes in KK, consistent with Theorems 3 and 4.

We are pleased to provide these additional results that further substantiate the claims in our paper. Thank you very much!


Typo

Thank you very much for identifying the typo. We will make the necessary correction and ensure it is reflected in the final version of the paper. Your attention to detail is greatly appreciated.

审稿意见
7

The authors consider the problem of sequentially picking assortments of items in an online manner over a horizon of TT rounds, where a user then selects from the items or not according to a multinomial logit model with a fixed unknown preference vector. The items are represented as (known) feature vectors which can change adversarially from round to round. The authors prove tighter cumulative regret lower bounds for both uniform and non-uniform cases and propose an algorithm with matching upper bounds. The authors account for the probability of a user not choosing from the given options in the bounds. The authors also run experiments to compare their algorithm to baselines.

优点

Major

  • Improved lower bounds for contextual MNL bandits in the uniform-reward setting, accounting for the impact of the probability of the null action.
  • improved lower bounds also in the non-uniform reward setting.
  • A tractable algorithm is proposed with matching regret upper bounds for both uniform and non-uniform reward setting

Minor

  • The authors include experiments for non-uniform rewards, showing significant improvements in both run-time and cumulative regret for included baselines.
  • Good discussions and remarks on effects of problem setup, such as v0v_0 and whether rewards are uniform or not on regret bounds, and relations with prior works (among those discussed to some length).
  • Overall I found the writing to be good. Table 1 is helpful.

缺点

Minor

  1. Zhang and Luo “Contextual Multinomial Logit Bandits with General Value Functions” [48] is only mentioned once in line 128 in a sentence about past works claiming KK-independent regret lower bounds while not accounting for v0v_0 (with the next sentence pointing out issue that several lower bounds used v0=Kv_0 = K. However, in [48] they propose algorithms and show regret bounds for a harder problem (adversarial contexts and adversarial rewards and for more general settings than MNL models) that avoid κ\kappa dependence in the leading term, though for the simpler problem studied in the current paper may have loose assortment size KK dependence. In [48] it looks to me like they consider v0=1v_0=1 not v0=Kv_0=K as implied by the current paper. Thus, while [48] is cited, to me it seems to be improperly cited and the results should be more thoroughly discussed, and unless some justification should be included in Table 1.
  2. State of the art upper bounds for non-contextual MNL are not discussed. Non-contextual MNL lower bounds are included in Table 1 and discussed, but unless I overlooked it did not see anything on upper bounds, even from papers like Agrawal et al. (2019) “MNL-bandit...” [8] that were discussed for lower bounds. (Or do the contextual upper bounds with d=Nd=N recover them in current state of the art?)
  3. Agrawal et al. “A tractable ..” 2023 [5] is cited in a few places regarding the problem setting (like in line 186 for assumptions 1 and 2, line 282 for v0v_0, …), but not included in Table 1 or discussed in detail. They achieve improved regret upper bound dependence on κ\kappa from prior works using a reportedly tractable algorithm. Agrawal et al. [5] consider a (contextual) combinatorial assortment problem using an MNL model with adversarial contexts and uniform rewards and v0=1v_0 = 1.
  4. Choi et al. “Cascading Contextual Assortment Bandits” in NeurIPS 2023 is not cited. How would their results compare? Their regret upper bound (specialized to the non-cascading setting you consider) looks competitive for v0=1v_0=1 and uniform reward setting. Another cascading assortment paper by Cao and Sun “Tiered Assortment: Optimization and Online Learning” in Management Science is also not cited, though they acknowledge in their Remark 2 that in the special case of a single tier, their result matches Agrawal et al. (2019) “MNL-bandit...” [8].
  5. The authors consider the parameter vector w**w** bounded by 11 instead of bounded by a known constant SS like in previous works (such as Agrawal et al. 2023 [5] and Lee et al. 2024 [32]). As pointed out by Leet et al. 2024 [32], “These quantities [κ\kappa] can scale exponentially in S in the worst-case (Faury et al., 2020).” How would the results be impacted accounting for such a known bound instead of setting it to 1?
  6. A clearer discussion on the computational complexity of the methods listed as “intractable” (in particular Goyal and Perivier 2022 [24]) would be better, both worst-case and (if not included in experiments) conjecture whether where they would have been in Fig. I.1 (eg even for those instances would it have been so slow as to be off the charts?).

Very Minor

  • [24] cites an arxiv version of a NeurIPS 2022 paper

问题

(some questions are included above in comments made in “Weaknesses”)

局限性

Yes

作者回复

Thank you very much for your insightful and valuable feedback. As we start our discussition, we first would like to ask for your patience for our lengthy responses. We have much to share as we have genuinely appreciated your feedback and look forward to our discussion with you! We will ensure that all the following discussions are included in our final version.


Regading Zhang and Luo (2024)

First, we would like to clarify that our setting also considers adversarial contexts and adversarial rewards. In our work, the term "the non-uniform rewards" is equivalent to the adversarial rewards. Thus, apart from the fact that Zhang and Luo (2024) [48] use a "general value function", other problem settings in [48] are not necessarily harder than ours.

Even though their tightest regret bound (Corollary 17) O~(K2.5dNT)\tilde{\mathcal{O}}(K^{2.5}\sqrt{dNT}) avoids dependence on κ\kappa based on the general value function, their regret still scales with KK and the number of items NN. Moreover, as they mentioned in their paper, there is no efficient way to implement the algorithm.

To clarify, in Line 128, our point in citing [48] was to emphasize that previous studies' oversight of v0v_0's impact on regret lower bounds has caused substantial confusion in the field. They stated that Ω(NT)\Omega(\sqrt{NT}) in the non-contextual setting is optimal; however, this claim only holds true when v0=Θ(K)v_0=\Theta(K).


Upper bounds for non-contextual MNL bandits

We plan to discuss the upper bounds in the non-contextual setting to clearly illustrate how previous studies have derived regret bounds under varying conditions, such as different v0v_0 values and reward structures. For instance, Agrawal et al. (2019) achieved an O~(NT)\tilde{\mathcal{O}}(\sqrt{NT}) upper bound with "v0=1v_0=1 and non-uniform rewards". However, the known tightest lower bound, Ω(NT)\Omega(\sqrt{NT}), was proven under settings of "v0=Kv_0=K and uniform rewards".


Comparison to Agrawal et al. (2023)

In fact, their paper contains significant technical errors (see Section A in the comments below), rendering a direct comparison to our work meaningless. This is the main reason we initially chose not to compare our results with theirs.

Even if all errors were corrected, our result (Theorem 2) still shows an improvement over theirs in terms of KK. Agrawal et al. (2023) considered a uniform rewards setting (with v0=1v_0=1) and achieved a regret of O~(dT+d2/κ)\tilde{\mathcal{O}}(d\sqrt{T} + d^2/\kappa). However, the corrected regret should be O~(dKT+d2/κ)\tilde{\mathcal{O}}(d\sqrt{KT} + d^2/\kappa).


Comparison to cascading assortment problems

Recently, both works have considered the cascading assortment bandit problem, which encompasses the MNL bandit problem as a special case where the cascading length is 11. However, these works do not encompass our results for the following reasons.

  • vs Choi et al. (2023): They consider uniform rewards, and achieve a regret upper bound of O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}), which avoids dependence on both the cascading length and κ\kappa in the leading term. When the cascading length is 11, our result (Theorem 2) improves upon theirs by a factor of K\sqrt{K}. Moreover, their computation cost per round is O(t)\mathcal{O}(t) since they employ MLE to estimate the parameter.
  • vs Cao and Sun (2023): They consider non-uniform rewards, and achieve a regret upper bound of O~(h2dMT)\tilde{\mathcal{O}}(h^2d\sqrt{MT}), where MM is the cascading length and hp(iS,w)p(iS,w)h \geq \frac{p(i \mid S, \mathbf{w})}{p(i \mid S, \mathbf{w'})} for all w,wW\mathbf{w}, \mathbf{w}' \in \mathcal{W}, SSS\in \mathcal{S}, and iS{0}i \in S \cup \{0\}. However, their regret bound still suffers from a harmful dependence on h2h^2, which can be exponentially large.

What happens if w2B\\|\mathbf{w}\\|_2 \leq B? (+ problem-dependent bounds)

When w2B\\|\mathbf{w}\\|_2 \leq B (we use BB instead of SS to avoid notational overlap), the confidence bound βt(δ)\beta_t(\delta) is O~(Bd)\tilde{\mathcal{O}}(B \sqrt{d}). Consequently, the regret upper bounds are as follows:

  • Uniform rewards: O~(BeBv0Kv0+KdT)\tilde{\mathcal{O}}(B e^{B} \frac{\sqrt{v_0 K}}{v_0 + K} d \sqrt{T}). It's important to note that one of our main goals is to explicitly demonstrate the regret depends on KK and v0v_0. In deriving such a result, the dependence on eBe^B is unavoidable to our best knowledge. Note that for the instance-dependent regret, the regret bound does not depend on eBe^B (see below).
  • Non-uniform rewards: O~(BdT)\tilde{\mathcal{O}}(B d \sqrt{T}).

And the regret lower bounds remain the same, because we can always construct an instance where w21\\| \mathbf{w}^\star\\|_2 \leq 1 even if B1B \gg 1. Note that a small norm of w\mathbf{w}^\star typically indicates a more challenging instance, as it becomes harder to identify the optimal arm. Overall, our upper and lower bounds in terms of KK, dd and TT (and v0v_0) are still tight, maintaining the (near) minimax optimality in terms of those problem parameters, which is our main focus.

Moreover (as per other reviewers' request), it turns that instance-dependent upper and lower bounds under uniform rewards (without dependence on eBe^B) are achievable. By making some modifications to previous results, we can easily achieve this. See Section B in the comments below for the proofs.


Computational complexity of Goyal and Perivier (2022)

The computational bottleneck of Goyal and Perivier (2022) [24] is to find an optimistic parameter θ~_t=argmax_θC(δ)x_tiθ\tilde{\theta}\_t = \arg\max\_{\theta \in \mathcal{C}(\delta)} x\_{ti}^\top \theta. However, since the confidence set C(δ)\mathcal{C}(\delta) is non-convex, computing θ~t\tilde{\theta}_t is extremely expensive, and the worst-case cost is not well-defined. A possible solution could be a convex relaxation, as suggested by Proposition 3 in Abeille et al. (2021) [4], though exploring this is beyond the scope of our work.

评论

$

\left|\sum_{i \in J_1}\nabla_{\mathbf{w}} g_t(S_t; \bar{\mathbf{w}}_t)\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \right| &=\Bigg|\sum_{t \in J_1} p_t(0 \mid S_t, \bar{\mathbf{w}}_t) \sum_{i \in S_t}p_t(i \mid S_t, \bar{\mathbf{w}}_t) x_{ti}^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \| x_{ti} \|_{H_t^{-1}} - 2 p_t(0 \mid S_t, \bar{\mathbf{w}}_t) \sum_{i \in S_t}p_t(i \mid S_t, \bar{\mathbf{w}}_t) \| x_{ti} \|_{H_t^{-1}} \sum_{j \in S_t}p_t(j \mid S_t, \bar{\mathbf{w}}_t) x_{tj}^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}) \Bigg| \\ &\leq 3 \beta_{T}(\delta) \sum_{t=1}^T \max_{i \in S_t} \| x_{ti} \|_{H_t^{-1}}^2 = \tilde{\mathcal{O}}(d^{3/2}/\kappa).

$

Hence, we get 2βT(δ)_tJ1gt(St;w)=O~(d_t=1Tκ_t+RegT+d2/κ)2 \beta_T(\delta) \sum\_{t \in J_1} g_t(S_t; \mathbf{w}^\star) = \tilde{\mathcal{O}}\left( d \sqrt{\sum\_{t=1}^T \kappa\_t^\star + \text{Reg}_T} + d^{2}/\kappa \right).

Now, we bound _tJ2gt(St;w)\sum\_{t \in J_2} g_t(S_t; \mathbf{w}^\star).

$

\sum\_{t \in J_2} g_t(S_t; \mathbf{w}^\star)
&\leq \sqrt{\sum\_{t \in J_2} \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star)} \sqrt{\sum\_{t \in J_2} \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \\| x\_{ti} \\|\_{H_t^{-1}}^2 }
\\\\
&\leq \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T} \sqrt{ \sum\_{t=1}^T \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}\_{t+1})p_t(0 \mid S_t, \mathbf{w}\_{t+1}) \\| x\_{ti} \\|\_{H_t^{-1}}^2 }
\\\\
&= \tilde{\mathcal{O}}\left( \sqrt{d} \cdot \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T}  \right),

$

where the second inequality holds by the definition of J2J_2 and Lemma 11 in Goyal and Perivier [24]. Hence, we get 2βT(δ)_tJ2gt(St;w)=O~(d_t=1Tκt+RegT)2 \beta_T(\delta) \sum\_{t \in J_2} g_t(S_t; \mathbf{w}^\star) = \tilde{\mathcal{O}}\left( d \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T} \right). Therefore, by solving RegT=O~(d_t=1Tκt+RegT+d2/κ)\text{Reg}_T = \tilde{\mathcal{O}} \left( d \sqrt{\sum\_{t=1}^T \kappa_t^\star + \text{Reg}_T } + d^2/\kappa \right), we can conclude that RegT=O~(d_t=1Tκt+d2/κ)\text{Reg}_T = \tilde{\mathcal{O}}\left( d \sqrt{\sum\_{t=1}^T \kappa_t^\star} + d^2/\kappa \right).


2. Lower bound
We can also derive a regret lower bound of Ω(dκT)\Omega(d\sqrt{\kappa^\star T}) by applying Theorem 2 from Abeille et al. (2021).

Proof of the lower bound: In the proof of Theorem 1, both the optimal assortment SS^\star and an alternative assortment S~t\tilde{S}_t (only has feature vectors that maximizes the utility in StS_t) have the same feature vectors for each assortment. Let S=x,,xS^\star = \\{x,\dots,x \\} and S~t=x^,,x^\tilde{S}_t = \\{\hat{x}, \dots, \hat{x}\\}. Then, by Proposition D.1, we have

$

\sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in S_t} p (i \mid S_t, \mathbf{w}^\star) \geq \sum_{i \in S^\star} p(i \mid S^\star, \mathbf{w}^\star) - \sum_{i \in \tilde{S}_t} p (i \mid \tilde{S}_t, \mathbf{w}^\star) =: f(x;\mathbf{w}^\star) - f(\hat{x};\mathbf{w}^\star).

$

Here, f(x;w)=Kexp(xw)v0+Kexp(xw)f(x;\mathbf{w}^\star) = \frac{ K\exp(x^\top\mathbf{w}^\star ) }{v_0 + K \exp(x^\top \mathbf{w}^\star)} maps from Rd\mathbb{R}^d to R\mathbb{R}. Since the input to the function ff is a single vector xx, ff can be regarded as a logistic function. This characteristic allows us to follow the proof of Theorem 2 from Abeille et al. (2021) [4], resulting in a regret lower bound of Ω(dκT)\Omega(d\sqrt{\kappa^\star T}).

评论

Thanks to the authors for the detailed response. I have read the rebuttal and decided to increase my score. I do not have further questions.

评论

We sincerely appreciate your decision to increase the score. Your valuable and constructive comments will greatly contribute to strengthening the final version of our paper. Thank you for the insightful and enjoyable discussion.

评论

A. Technical errors in Agrawal et al. (2023) [5]

1. Equation (16): αi(XSt,θt,θ)x_ti:=μi(X_Stθ)μi(X_Stθt)\alpha_i(\mathbf{X}_{S_t}, \theta_t, \theta^{\star}) x\_{ti} :=\mu_i(\mathbf{X}\_{S_t}^\top \theta^\star) - \mu_i(\mathbf{X}\_{S_t}^\top \theta_t), where X_St\mathbf{X}\_{S_t} is a design matrix whose columns are the attribute vectors x_tix\_{ti} of the items in the assortment StS_t and μi(X_Stθ)=Pt(iSt,θ)\mu_i(\mathbf{X}\_{S_t}^\top \theta) = P_t(i \mid S_t, \theta).

  • It appears that the authors may have intended to derive this equation using a first-order exact Taylor expansion. However, in MNL bandits, this equation generally does not hold. Consider a counterexample where xti=0x_{ti}=0, xtj0x_{tj} \neq 0 for jij \neq i, and θθt\theta^\star \neq \theta_t. Then, LHS is equals to 00. In this case, the left-hand side (LHS) equals 00 (since xti=0x_{ti}=0), but the right-hand side (RHS) is not 00, because the denominators of each μi(X_Stθ)\mu_i(\mathbf{X}\_{S_t}^\top \theta^\star) and μi(X_Stθt)\mu_i(\mathbf{X}\_{S_t}^\top \theta_t) differ. This equation only holds in special cases, such as when K=1K=1, which corresponds to the logistic bandit case. Equation (16) serves as the foundation for the entire proof in their paper. Consequently, all subsequent results derived from it are also incorrect.

2. Cauchy-Schwarz inequality in the regret analysis (Page 46)

  • When using the Cauchy-Schwarz inequality on the regret before applying the elliptical potential lemma, they indeed incur an additional K\sqrt{K} factor: t=1Tmin(iStxtiJ_t1,1)KTmin(_iStx_ti_J_t12,1).\sum_{t=1}^T \min \left(\sum_{i \in S_t}\\|x_{ti}\\|_{\mathbf{J}\_t^{-1}},1 \right) \leq \sqrt{KT} \sqrt{\min \left(\sum\_{i \in S_t} \\|x\_{ti}\\|\_{\mathbf{J}\_t^{-1}}^2 ,1 \right)}.

  • Hence, their regret should actually be O~(dKT+d2/κ)\tilde{\mathcal{O}}(d\sqrt{KT} + d^2/\kappa), which is worse than the result in our Theorem 2 (upper bound under uniform rewards) by a factor of KK.


B. Problem-dependent bounds

1. Upper bound
By adapting Theorem 2 from Faury et al. (2022) [21] to the multinomial and online parameter setting, we can achieve a regret upper bound of O~(dt=1Tκt+d2/κ)\tilde{\mathcal{O}}\left( d \sqrt{\sum_{t=1}^T \kappa_t^\star} + d^2/\kappa \right), where κt=iStpt(iSt,w)pt(0St,w)\kappa_t^\star = \sum_{i \in S_t^\star} p_t(i \mid S_t^\star, \mathbf{w}^\star)p_t(0 \mid S_t^\star, \mathbf{w}^\star).

Proof of the upper bound: We start the proof from line 748 in the proof of Theorem 2.

$

\sum_{t=1}^T \nabla Q(\mathbf{u}_t^\star)^\top (\mathbf{u}_t - \mathbf{u}_t^\star) \leq 2 \beta_T(\delta) \sum_{t=1}^T \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \| x_{ti} \|_{H_t^{-1}}.

$

Let gt(S;w):=_iSpt(iS,w)pt(0S,w)x_tiHt1g_t(S ; \mathbf{w}):= \sum\_{i \in S} p_t(i \mid S, \mathbf{w})p_t(0 \mid S, \mathbf{w}) \\| x\_{ti} \\|_{H_t^{-1}}. Let J_1 = \\{ t \in [T] \mid \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star) \geq \sum\_{i \in S_t} p_t(i \mid S_t, \mathbf{w}\_{t+1})p_t(0 \mid S_t, \mathbf{w}\_{t+1})\\\} and J2=[T]J1J_2 = [T] \setminus J_1. Thus, we get

$

\text{RHS} = 2 \beta_T(\delta) \sum_{t=1}^T g_t(S_t; \mathbf{w}^\star) = 2 \beta_T(\delta) \sum_{t \in J_1} g_t(S_t; \mathbf{w}^\star) + 2 \beta_T(\delta) \sum_{t \in J_2} g_t(S_t; \mathbf{w}^\star).

$

To bound tJ1gt(St;w)\sum_{t \in J_1} g_t(S_t; \mathbf{w}^\star), by the mean value theorem, we get

$

\sum_{i \in J_1} g_t(S_t; \mathbf{w}^\star) = \sum_{i \in J_1} g_t(S_t; \mathbf{w}_{t+1}) + \sum_{i \in J_1} \nabla_{\mathbf{w}} g_t(S_t; \bar{\mathbf{w}}_t)^\top (\mathbf{w}^\star - \mathbf{w}_{t+1}),

$

where wˉt\bar{\mathbf{w}}_t is the convex combination of w\mathbf{w}^\star and w_t+1\mathbf{w}\_{t+1}. The first term can be bound by

$

\sum_{t \in J_1} g_t(S_t; \mathbf{w}_{t+1}) &\leq \sqrt{\sum_{t\in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}_{t+1})p_t(0 \mid S_t, \mathbf{w}_{t+1})} \sqrt{\sum_{t \in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}_{t+1})p_t(0 \mid S_t, \mathbf{w}_{t+1}) \| x_{ti}\|_{H_t^{-1}}^2 } \\ &\leq \sqrt{\sum_{t\in J_1} \sum_{i \in S_t} p_t(i \mid S_t, \mathbf{w}^\star)p_t(0 \mid S_t, \mathbf{w}^\star)} \cdot \tilde{\mathcal{O}}( \sqrt{d} ) \leq \sqrt{\sum_{t=1}^T \kappa_t^\star + \text{Reg}_T} \cdot \tilde{\mathcal{O}}( \sqrt{d} ) ,

$

where the second-to-the last inequality holds by the definition of J1J_1 and the elliptical potential lemma in our paper, and the last inequality holds by Lemma 11 in Goyal and Perivier [24] with κt:=_iStpt(iSt,w)pt(0St,w)\kappa_t^\star := \sum\_{i \in S_t^\star} p_t(i \mid S_t^\star, \mathbf{w}^\star)p_t(0 \mid S_t^\star, \mathbf{w}^\star). Moreover, the second term can be bounded by

Continued in the next comment.

作者回复

We would like to express our heartfelt thanks to all the reviewers for their constructive feedback. Your insights and suggestions are greatly appreciated.

In response to Reviewer 8Ks7's request, we conducted additional experiments for the logistic case (K=1K=1) and for larger values of KK (K=20,30K=20,30). The results of these experiments are detailed in the attached PDF. These results also support our theoretical claims. Specifically, under uniform rewards, we observe that the regret decreases as KK increases, aligning with Theorems 1 and 2. Conversely, under non-uniform rewards, the regret remains unaffected by changes in KK, consistent with Theorems 3 and 4.

We are pleased to provide additional results that further strengthen the claims in our paper. Thank you very much!

最终决定

This paper explores the problem of contextual multinomial logit (MNL) bandits, an elegant abstraction that captures many real-world decision-making scenarios. Previous research has shown a clear theoretical gap between upper and lower bounds, particularly with respect to the maximum assortment size KK. The authors thoroughly investigate this issue and propose several innovative algorithms with provable regret bounds, demonstrating their results' near-minimax optimality. These findings are very nice, and the reviewers unanimously recognize the value of these results. On the other hand, they also highlight several points for improvement. Key points include addressing experimental issues and providing clearer distinctions between the paper’s contributions and earlier work, such as Zhang and Luo [48] and Zhang and Sugiyama [49]. The authors are encouraged to revise the manuscript to incorporate the reviewers' suggestions and enhance its impact on the field.