True Impact of Cascade Length in Contextual Cascading Bandits

审稿意见

评分: 4置信度: 42025-06-27

The cascading bandit problem is widely used in recommendation systems, search engines, and online advertising. In this model, an agent selects an ordered list (cascade) of length $K$ from $N$ options (arms), and users sequentially browse the list, stopping at the first appealing item. Traditional studies suggest that regret—defined as the difference in expected rewards between the optimal and selected cascades—increases or remains constant with larger $K$ . However, this contradicts the intuition that longer cascades offer more feedback opportunities, potentially reducing regret. The paper challenges this view, proving that in contextual cascading bandits, regret decreases when $K$ is sufficiently large.

优缺点分析

Strengths

The paper challenges existing research on the impact of cascade length $K$ on regret in contextual cascading bandits, demonstrating that regret decreases when $K$ is sufficiently large. Specifically, the authors propose a new regret upper bound of $\mathcal{O}(K \bar{p}^{K-1} d \sqrt{T})$ , where $\bar{p}$ is the upper bound of the feedback probability, $d$ is the dimension of the context vector, and $T$ is the total number of rounds. This bound shows that regret diminishes with increasing $K$ due to the exponential decay of $\bar{p}^{K-1}$ ( $\bar{p} < 1$ ).
Additionally, the authors derive a matching lower bound of $\Omega(K \underline{p}^{K-1} d \sqrt{T})$ , where $\underline{p}$ is the lower bound of the feedback probability. Together, these bounds fully characterize the true trend of regret with respect to $K$ , addressing a theoretical gap in contextual cascading bandits.
The paper introduces UCB-CLB (Upper Confidence Bound for Cascading Logistic Bandits), an algorithm based on Upper Confidence Bound (UCB) that integrates Online Mirror Descent (OMD) and a cascade restructuring technique (DO-SWAP). Operating under a logistic model, UCB-CLB achieves efficient parameter estimation and cascade selection. By avoiding the high computational cost of traditional maximum likelihood estimation (MLE), UCB-CLB significantly improves efficiency while maintaining theoretically optimal regret bounds. The DO-SWAP technique optimizes the exploration-exploitation trade-off by prioritizing items with higher uncertainty in the cascade.

Weaknesses

The theoretical results rely on specific assumptions, such as Assumption 5.1 (bounded norms of feature vectors and parameters) and Assumption 5.3 (a positive lower bound $\kappa$ in the logistic model). These assumptions may not always hold in practice. For instance, a small $\kappa$ (indicating significant deviation from a linear model) could impact algorithm performance, despite claims that UCB-CLB remains effective in such cases. However, the paper does not thoroughly explore how violations of these assumptions affect performance, potentially limiting the generalizability of its theoretical findings.
The analysis depends on $\bar{p} \leq 1/e \approx 0.37$ , where the regret upper bound decreases monotonically with increasing $K$ . When $\bar{p} > 1/e$ , regret may initially increase before decreasing (as $K \bar{p}^{K-1}$ peaks at $K = \frac{1}{\log(1/\bar{p})}$ ). The paper does not adequately address the implications of $\bar{p} > 1/e$ in practical high-feedback-probability scenarios, which may restrict the applicability of its theoretical results.
Experiments only test cascade lengths $K = 5$ and $K = 10$ , without exploring a broader range of $K$ values.
The UCB-CLB algorithm relies on the DO-SWAP technique to restructure cascades, prioritizing items with higher uncertainty. While effective, this increases algorithmic complexity, potentially introducing additional computational overhead or implementation challenges in practice. The paper does not provide a detailed analysis of DO-SWAP’s computational costs or potential failure cases across different scenarios.

问题

The UCB-CLB algorithm leverages Online Mirror Descent (OMD) and DO-SWAP techniques to enhance computational efficiency and optimize regret. Can this approach effectively extend to dynamic environments, such as those with time-varying contextual features?
The paper introduces the combination of OMD and DO-SWAP as a novel proof technique to address the sequential feedback characteristics of cascading bandits. What new challenges might this integration introduce?

局限性

N/A

最终评判理由

The response from the authors resolves my concerns. I will maintain my positive rating.

格式问题

N/A

作者回复

2025-07-31

We sincerely thank the reviewer for the thoughtful and detailed comments on our work. Your feedback has greatly helped us refine and strengthen the paper.

1. Discussion of Regularity Conditions

Firstly, Assumption 5.1 is without loss of generality: any raw feature vector can be rescaled to lie inside the unit ball before learning, and relaxing the parameter norm bound to $S$ merely multiplies the theoretical regret by $S^{3/2}$ factor.

Secondly, we strongly believe that the existence of a non-zero $\kappa$ should NOT be viewed as a weakness of our paper, since $\kappa$ is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define $\kappa$ explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define $\kappa$ rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.

Lastly, we conduct additional experiments on the MovieLens- $100K$ benchmark: reducing $\kappa$ from $0.1$ to $0.01$ leads to no noticeable change in cumulative or average regret (see the below table). $R(t)$ denotes the cumulative regret up to round $t$ , whereas $R(t)/t$ is the average regret at round $t$ . We report $R(t)/t$ to check whether cumulative regret is sublinear.

$\kappa$	$t$	$R(t)$	$R(t)/t$
$0.1$	$2000$	$10.97$	$5.40 \times 10^{-3}$
	$4000$	$15.80$	$3.90 \times 10^{-3}$
	$6000$	$19.83$	$3.30 \times 10^{-3}$
	$8000$	$23.84$	$2.90 \times 10^{-3}$
	$10000$	$27.28$	$2.70 \times 10^{-3}$

$\kappa$	$t$	$R(t)$	$R(t)/t$
$0.01$	$2000$	$12.66$	$6.30 \times 10^{-3}$
	$4000$	$18.46$	$4.60 \times 10^{-3}$
	$6000$	$22.94$	$3.80 \times 10^{-3}$
	$8000$	$26.94$	$3.30 \times 10^{-3}$
	$10000$	$30.54$	$3.00 \times 10^{-3}$

2. Discussion of $\bar{p}$

We first clarify meaning of large $\bar{p}$ . Recall that \bar p is defined as an upper bound on the non-click probability: $\bar{p} = \sup_{x}\sigma_0(x^{\top}\theta^\*).$ Consequently, a large $\bar{p}$ corresponds to a high–negative-feedback scenario.

Even though our regret bound may increase in the small- $K$ regime with particularly large $\bar{p}$ , we emphasize that this is the first regret bound to explicitly capture and improve with increasing $K$ , with the explicit dependence on both $K$ and $\bar{p}$ . We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between $\bar{p}$ and $\underline{p}$ ), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small- $K$ regime with particularly large $\bar{p}$ , the regret may initially increase with $K$ but eventually decrease as $K$ becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. It is rather the nature of the problem that our work presents for the first time as the true scaling.

Furthermore, we would like to highlight that in the larger bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) $\tilde{\mathcal{O}}(d\sqrt{T})$ , and (2) $\tilde{\mathcal{O}}(\sqrt{dT \log N})$ , which additionally depends on the number of arms $N$ . When the number of arms is relatively small (e.g., $N = \mathcal{O}(e^d)$ ), the bound $\tilde{\mathcal{O}}(\sqrt{dT \log N})$ is tighter. Conversely, when $N$ is large, $\tilde{\mathcal{O}}(d\sqrt{T})$ becomes the better guarantee.

Hence, it is not just a matter of which bound is tighter but rather what this work focuses is what is the true scale of the regret in cascading contextual bandit problem supported by both upper and lower bounds.

3. Additional Experiments with larger $K$

Following your suggestion, we run the MovieLens- $100K$ experiment with $K = 50$ and $K = 100$ . As the Table 6 shows, cumulative regret remains sub-linear (indicated by the steadily falling the average regret $R(t)/t$ ). Crucially, the overall scale of the regret $R(t)$ diminishes as $K$ increases, exactly reflecting our theoretical claim that regret decreases once the cascade length is long.

$K$	$t$	$R(t)$	$R(t)/t$
$50$	$1000$	$6.09 \times 10^{-17}$	$6.09 \times 10^{-20}$
	$2000$	$9.49 \times 10^{-17}$	$4.74 \times 10^{-20}$
	$3000$	$13.04 \times 10^{-17}$	$4.35 \times 10^{-20}$
	$4000$	$15.29 \times 10^{-17}$	$3.82 \times 10^{-20}$
	$5000$	$17.54 \times 10^{-17}$	$3.51 \times 10^{-20}$

$K$	$t$	$R(t)$	$R(t)/t$
$100$	$1000$	$0.88 \times 10^{-32}$	$8.80 \times 10^{-36}$
	$2000$	$1.59 \times 10^{-32}$	$7.96 \times 10^{-36}$
	$3000$	$2.14 \times 10^{-32}$	$7.14 \times 10^{-36}$
	$4000$	$2.50 \times 10^{-32}$	$6.26 \times 10^{-36}$
	$5000$	$2.88 \times 10^{-32}$	$5.75 \times 10^{-36}$

4. Time–Complexity of DO-SWAP

Contrary to your concern, the DO-SWAP subroutine imposes neither significant computational overhead nor implementation difficulty. With a single partial sort that runs in $O(K)$ time, our agent selects the two most uncertain arms within the length- $K$ cascade $C_t$ . Our MovieLens-100K experiment with K=5 (See the table below) supports this claim, showing that DO-SWAP consistently consumes only $\approx 0.7 \%$ of total cumulative runtime even as the round $t$ grows.

[Table Instruction]

"Total": is the mean cumulative runtime of the full algorithm,
"DO-SWAP": is the mean cumulative time spent in the DO-SWAP sub-routine,
"Ratio": is the share of DO-SWAP within the total time. | $t$ | Total | DO-SWAP | Ratio | |-------|-------------|-------------|--------| | $1000$ | $1.825$ | $0.012$ | $0.7$ % | | $2000$ | $3.446$ | $0.024$ | $0.7$ % | | $3000$ | $4.910$ | $0.035$ | $0.7$ % | | $4000$ | $6.309$ | $0.045$ | $0.7$ % | | $5000$ | $7.687$ | $0.056$ | $0.7$ % |

5. Applicability to Time-Varying Contextual Features

We would like to clarify that our current framework already supports time-varying contextual features. Section 3.2 (Problem Setting) explicitly models the environment using time-varying contextual features, confirming that our theoretical analysis already accommodates such dynamics.

Accordingly, the OMD + DO-SWAP algorithm is naturally applicable to dynamic environments where contextual features change over time. In our MovieLens-100K experiment, the agent observes time-varying features $x_{t,i}$ at every round $t$ : a new user $u_t$ arrives, and the agent chooses among $1,642$ movies $i$ . As reported in Section 7 (Numerical Study), UCB-CLB with DO-SWAP achieves sub-linear regret under this setting, showing in practice that the algorithm still performs well even as the contexts change over time.

6. Integrating OMD and DO-SWAP Introduces No Additional Challenges

OMD is used for parameter estimation, while DO-SWAP handles the cascading selection process. These two components operate independently of each other. For example, one could substitute OMD with an alternative estimator such as MLE, and DO-SWAP would remain fully functional without modification.

2025-08-05

Thank you for your response, which resolves my concerns. I will maintain my positive rating.

2025-08-05

Thank you for your kind response and for maintaining your positive rating. We truly appreciate your time and consideration.

审稿意见

评分: 5置信度: 32025-07-02

This paper investigates the role of the cascade length $K$ in contextual cascading bandits—a learning framework relevant to search and recommendation systems. Contrary to previous works suggesting that regret increases with $K$ , the authors theoretically and empirically demonstrate that regret can decrease as $K$ grows large. They propose a novel UCB-based algorithm, UCB-CLB, which uses Online Mirror Descent (OMD) for efficient parameter estimation, achieving a regret upper bound of $\tilde{O}(K^{\bar{p}} d \sqrt{T})$ that diminishes with increasing $K$ . They also provide the first matching lower bound in the contextual setting, resolving a long-standing open problem about the impact of cascade length.

优缺点分析

Strengths

The paper challenges and overturns a widely held belief in cascading bandits literature—that regret necessarily increases with cascade length. This is an important and surprising contribution.
The UCB-CLB algorithm leverages OMD for parameter estimation, significantly improving computational efficiency compared to MLE-based approaches.
The paper provides both upper and lower bounds that match up to logarithmic factors, a rare accomplishment in bandit theory.
UCB-CLB is shown to be scalable and effective on real-world datasets like MovieLens, and avoids non-convex optimization bottlenecks.
The paper is generally well-organized, with a thorough theoretical exposition, detailed algorithms, and compelling empirical results.

Weaknesses

Some results rely on regularity conditions such as bounded features and logistic noise parameters, which may not hold in all real-world deployments.
While the MovieLens experiments are convincing, broader validation across datasets or task domains would strengthen the claims.
The lower bound uses a slightly different constant $p$ than the upper bound's $\bar{p}$ , though the authors do acknowledge and justify this.

问题

Clarification on $p$ vs $\bar{p}$ : Can the authors comment on how large the gap between $p$ and $\bar{p}$ might be in practice, and whether it's observable in experiments?
OMD vs MLE Trade-offs: Could the authors elaborate on potential limitations of OMD versus MLE, especially in terms of statistical efficiency in low-noise regimes?
Impact of Arm Pool Size $N$ : Even though the bounds are independent of $N$ , does increasing $N$ affect empirical performance (e.g., convergence speed)?

局限性

yes

格式问题

None observed. The paper follows NeurIPS formatting guidelines.

作者回复

2025-07-31

Thank you for taking the time to review our work. We sincerely appreciate your constructive feedback and provide our detailed responses below.

1. Discussion of Regularity Conditions

Could you kindly clarify what is meant by the "logistic noise parameters"? We assume this refers either to $\theta^\star$ or to the curvature parameter $\kappa$ . Accordingly, we will address the boundedness of the three relevant quantities: the feature vector $x$ , the true parameter $\theta^\star$ , and the curvature parameter $\kappa$ . Below, we explicitly show that these boundedness assumptions are made for convenience and can be easily relaxed or reformulated as definitions. Therefore, we believe they should not be viewed as weaknesses of our work.

1) Boundedness assumption on $x$ and $\theta^\star$ (Assumption 5.1): We would like to clarify that this assumption is made purely for convenience and cleaner presentation. If $x$ and $\theta^\star$ are bounded by a general constant $S$ (rather than 1), the regret bound simply scales with $\text{poly}(S)$ . For example, if $\Vert x \Vert_{2} \le 1$ and $\Vert \theta^{\star} \Vert_2 \leq S$ , the regret becomes $\tilde{\mathcal{O}} \left( S^{3/2} K \bar{p}^{K-1} d \sqrt{T} \right)$ , since the confidence parameter scales as $\beta_{t,k}(\delta) = \tilde{\mathcal{O}}(S^{3/2} \sqrt{d})$ .

2) Discussion about $\kappa$ (Assumption 5.3): We would like to highlight that $\kappa$ is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define $\kappa$ explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define $\kappa$ rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.

[1] Faury, Louis, et al. "Improved optimistic algorithms for logistic bandits." International Conference on Machine Learning. PMLR, 2020.

2, Additional Experiments: KuaiRand Pure Click Datasets [2] As the current NeurIPS rebuttal guidelines prohibit the inclusion of figures, we instead present the results in the table below. $R(t)$ denotes the cumulative regret up to round $t$ , whereas $R(t)/t$ is the average regret at round $t$ . We report $R(t)/t$ to check whether cumulative regret is sublinear. Following the reviewer’s suggestion, we ran additional experiments with cascade lengths $K = 10$ on the KuaiRand-Pure dataset.

KuaiRand-Pure is a real-world recommendation benchmark derived from the short-video platform Kuaishou; it contains anonymised user–item interaction logs with click indicators, which we adopt as the binary reward signal for our bandit task. Consistent with our theoretical findings, the regret decreases as $K$ increases, confirming that our method scales favourably with longer cascades and remains robust across datasets.

(i) UCB-CLB with K=5

$t$	$R(t)$	$R(t)/t$
$1000$	$13.15$	$0.013$
$2000$	$16.14$	$0.008$
$3000$	$18.57$	$0.006$
$4000$	$20.48$	$0.005$
$5000$	$21.96$	$0.004$

(ii) UCB-CLB with K=10

$t$	$R(t)$	$R(t)/t$
$1000$	$0.92$	$9.19 \times 10^{-4}$
$2000$	$1.06$	$5.31 \times 10^{-4}$
$3000$	$1.15$	$3.84 \times 10^{-4}$
$4000$	$1.20$	$3.00 \times 10^{-4}$
$5000$	$1.24$	$2.47 \times 10^{-4}$

[2] Gao, Chongming, et al. "Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos." Proceedings of the 31st ACM international conference on information & knowledge management. 2022.

3, Clarification on $\underline{p}$ vs $\bar{p}$

While our upper and lower bounds differ due to the gap between $\bar{p}$ and $\underline{p}$ , these quantities are often overly conservative in practice. As a result, the realized regret typically lies between the two—significantly lower than the upper bound and higher than the lower bound—indicating that our algorithm is robust to the worst-case values of $\bar{p}$ and $\underline{p}$ . To illustrate this, we provide a table below comparing the theoretical upper bound, theoretical lower bound, and the actual cumulative regret observed in an experiment with $\bar{p} = 0.8$ and $\underline{p} = 0.1$ . In the table:

"Upper Bound" refers to the theoretical regret upper bound,
"Lower Bound" refers to the theoretical regret lower bound, and
"Cum. Regret" refers to the cumulative regret actually incurred by our algorithm on the MovieLens 100K dataset.

	$K=4$	$K=6$	$K=8$	$K=10$
Upper Bound	$1619.09$	$1554.32$	$1326.36$	$1061.08$
Cum. Regret	$18.36$	$5.15$	$1.39$	$0.36$
Lower Bound	$3.16 \times 10^{0}$	$4.74 \times 10^{-2}$	$6.32 \times 10^{-4}$	$7.91 \times 10^{-6}$

Currently, we do not yet know how to close the gap between $\bar{p}$ and $\underline{p}$ to achieve minimax optimality. One promising direction is to pursue instance-dependent regret bounds, as in Abeille et al. (2021). For example, by defining $p^\star = \sigma_0((x^\star)^\top \theta^\star)$ , where $x^\star$ is an optimal arm, one could derive $p^\star$ -dependent regret bounds instead of relying on worst-case quantities like $\bar{p}$ and $\underline{p}$ . While this is a meaningful direction, it falls outside the scope of the present work, and we leave it for future research. We thank the reviewer for this valuable and constructive suggestion. Moreover, note that our algorithm does not need to observe or know $\bar{p}$ or $\underline{p}$ at any point.

4. OMD vs. MLE Trade-offs in Low-noise Regimes

It remains unclear which method—OMD or MLE—is preferable. In a low-noise regime, click feedback is almost deterministic. Under the logistic model this means the parameter norm $S$ must be large (i.e., $\\| \theta\\|_2 \leq S$ ), reflecting a strong signal or a wide parameter space.

Recall that when $\\|\theta^\star\\|_2 \leq S$ , our regret bound (that uses OMD) scales as $\tilde{\mathcal{O}}\left(S^{3/2} K \bar{p}^{K-1} d \sqrt{T} \right)$ , which deteriorates as $S$ increases. In contrast, for MLE-based approaches, $S$ -free regret bounds have been established in the GLM bandit literature (e.g., Lee et al., 2024 [3]). This suggests that a similar $S$ -free regret might be achievable in our setting by adopting MLE and applying similar techniques. Alternatively, an adaptive warm-up strategy, as proposed by Lee and Oh [4], could also eliminate the dependence on $S$ .

Because both MLE and carefully tuned OMD variants can avoid explicit $S$ dependence, deciding which approach is superior—even in low-noise scenarios—remains unresolved. Further work is required to pin down their precise statistical and computational trade-offs.

[3] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.

[4] Lee, Joongkyu, and Min-hwan Oh. "Improved Online Confidence Bounds for Multinomial Logistic Bandits." Forty-second International Conference on Machine Learning.

5. Impact of Arm Pool Size $N$

Our regret bound—and the algorithm in practice—is $N$ -independent; to substantiate this, we conduct additional experiments. Using the MovieLens-100K benchmark with cascade length $K=5$ , we vary the arm pool from $N=2000, 2500, 3000$ movies while keeping the same optimal arm across all settings to ensure a fair comparison. The resulting cumulative-regret, $R(t)$ is reported in the table below.

Across all horizons the largest gap between any two $R(t)$ is $0.009$ (at $t=5000$ ), which is approximately $1\%$ of the regret scale in those rows. Thus, even when $N$ increases from $2000$ to $3000$ , the regret remains almost unchanged, supporting our theoretical results that regret does not depend on $N$ .

	$N=2000$	$N=2500$	$N=3000$
$t$	$R(t)$	$R(t)$	$R(t)$
$1000$	0.0233	0.0245	0.0307
$2000$	0.0356	0.0342	0.0409
$3000$	0.0432	0.0421	0.0492
$4000$	0.0506	0.0510	0.0581
$5000$	0.0593	0.0612	0.0683

评论- Please respond to the authors' rebuttal

2025-08-07

Dear Reviewer 3rPH,

Thanks a lot for reviewing this paper!

It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?

Thanks!

2025-08-08

Thank you for your response, which has clarified my concerns. I will maintain my original positive evaluation.

审稿意见

评分: 5置信度: 12025-07-03

This paper addresses a long-standing and misunderstood problem in the field of recommendation systems: how the length of a recommended list (the cascade length.

优缺点分析

Strengths: They propose a new Upper Confidence Bound (UCB) algorithm for cascading logistic bandits. It uses Online Mirror Descent (OMD) for parameter estimation, which makes it computationally efficient and scalable, with per-round costs that are independent of the total time horizon. Specifically, they novelly achieve the decreasing for large K regret bound compared to others.

Weaknesses:

They assume a lower bound for the vairances. This assumptions are not so standard. Could the authors discuss about this assumption, why do they need it and could their analysis degrade to deterministic settings where variances are zero?

问题

In the experiment, the K values are only 5, 10. Could the authors try a larger K, such as 50, 100? Also, it would be interesting and convincing if they could plot the curve between the regret and a range of K to validate their theories.
For \bar p in the bound, \bar p is small when all the arms in the space has a relatively high logit. It seems a little strict since there may exist x such that x^T\theta is very small. In this case, the bound only decreases when K is sufficienlty large. Could the authors disscuss more about the limitations of their result?

局限性

no

最终评判理由

The authors solve my questions, so I will raise my score accordingly.

格式问题

no

作者回复

2025-07-31

We appreciate your thoughtful feedback on our work. Below, we provide detailed responses to each comment and question.

1. Variance Lower-Bound Assumption

Before addressing the reviewer’s concern, we would first like to clarify the intended meaning of “variance” in the review. We assume the reviewer is referring to the parameter $\kappa$ . In that case, we strongly believe that the existence of a non-zero $\kappa$ should NOT be viewed as a weakness of our paper, since $\kappa$ is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define $\kappa$ explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define $\kappa$ rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.

To further elaborate on $\kappa$ : this parameter captures the degree of non-linearity of the sigmoid function. Specifically, $\kappa$ becomes smaller when the agent selects actions across a wider range of the sigmoid. As such, $\kappa$ quantifies the difficulty of learning induced by the non-linearity of the logistic model. This dependence is not unique to our work—it has been unavoidable in much of the existing literature (Faury et al., 2020, 2021, 2022; Lee and Oh, 2024, 2025; Vial et al., 2022; Liu et al., 2024), whether it appears in the leading term or non-leading constants of the regret bounds.

In response to the reviewer’s question regarding the deterministic choice setting: such a setting corresponds to a degenerate case where each item is either (a) always clicked or (b) never clicked. Neither case is interesting from a learning or decision-making perspective. In case (a), the learner could simply select any $K$ items at random and incur zero regret, since every selected item leads to a click. In case (b), the learner also incurs zero regret, regardless of the selection, as the optimal reward is itself zero. Thus, the deterministic setting is uninformative and trivial in terms of both exploration and regret analysis.

[1] Faury, Louis, et al. "Improved optimistic algorithms for logistic bandits." International Conference on Machine Learning. PMLR, 2020.

2. Additional Experiments: larger $K$

In accordance with this year’s NeurIPS rebuttal guidelines, we are not permitted to include plots; thus, we present the results in tabular form below. $R(t)$ denotes the cumulative regret up to round $t$ , whereas $R(t)/t$ is the average regret at round $t$ . We report $R(t)/t$ to check whether cumulative regret is sublinear.

As per reviewer's request, we have conducted additional MovieLens benchmark experiments with larger $K = 50, 100$ to evaluate the robustness of our method. Consistent with our theoretical findings, the regret decreases as $K$ increases, demonstrating that our method scales favorably with the cascade length. The scale difference is enormous: for every round in the table, $R(t)$ at $K = 100$ is roughly $10^{15} \sim 10^{16}$ times smaller than at $K = 50$ .

$K$	$t$	$R(t)$	$R(t) / t$
$50$	$1000$	$6.09 × 10^{-17}$	$6.09 × 10^{-20}$
$50$	$2000$	$9.49 × 10^{-17}$	$4.74 × 10^{-20}$
$50$	$3000$	$13.04 × 10^{-17}$	$4.35 × 10^{-20}$
$50$	$4000$	$15.29 × 10^{-17}$	$3.82 × 10^{-20}$
$50$	$5000$	$17.54 × 10^{-17}$	$3.51 × 10^{-20}$

$K$	$t$	$R(t)$	$R(t) / t$
$100$	$1000$	$0.88 × 10^{-32}$	$8.80 × 10^{-36}$
$100$	$2000$	$1.59 × 10^{-32}$	$7.96 × 10^{-36}$
$100$	$3000$	$2.14 × 10^{-32}$	$7.14 × 10^{-36}$
$100$	$4000$	$2.50 × 10^{-32}$	$6.26 × 10^{-36}$
$100$	$5000$	$2.88 × 10^{-32}$	$5.75 × 10^{-36}$

3. Large $\bar{p}$ Case

Even though our regret bound may increase in the small- $K$ regime with particularly large $\bar{p}$ , we emphasize that this is the first regret bound to explicitly capture and improve with increasing $K$ , with the explicit dependence on both $K$ and $\bar{p}$ . We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between $\bar{p}$ and $\underline{p}$ ), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small- $K$ regime with particularly large $\bar{p}$ , the regret may initially increase with $K$ but eventually decrease as $K$ becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. it is rather the nature of the problem that our work presents for the first time as the true scaling.

Furthermore, we would like to highlight that in the bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) $\tilde{\mathcal{O}}(d\sqrt{T})$ , and (2) $\tilde{\mathcal{O}}(\sqrt{dT \log N})$ , which additionally depends on the number of arms $N$ . When the number of arms is relatively small (e.g., $N = \mathcal{O}(e^d)$ ), the bound $\tilde{\mathcal{O}}(\sqrt{dT \log N})$ is tighter. Conversely, when $N$ is large, $\tilde{\mathcal{O}}(d\sqrt{T})$ becomes the better guarantee.

In the same spirit, our newly established bounds are optimal in the small- $\bar{p}$ regime, while for large $\bar{p}$ , the existing bound of $\tilde{\mathcal{O}}(d\sqrt{T})$ remains optimal. We will incorporate this clarification into the final revision.

Additionally, we conduct an experiment to demonstrate the robustness of our algorithm to the value of $\bar{p}$ in practice. Importantly, the actual performance depends on the no-click probabilities $\sigma_0(x_{t,i}^\top \theta^\star)$ for items in the selected set $C_t$ (i.e., $i \in C_t$ ), rather than on the worst-case value $\bar{p} = \sup_x \sigma_0(x^\top \theta^\star)$ used in the regret analysis. As a result, the realized regret does NOT typically scale with $\bar{p}$ —the worst-case analysis can be overly conservative.

The following table illustrates that **even when $\bar{p**$ is large, the observed regret decreases monotonically as $K$ increases}, highlighting the practical stability of our method.

$K$	$\bar{p} = 0.37$	$\bar{p} = 0.74$	$\bar{p} = 0.95$
$2$	$1.712 × 10^{1}$	$1.884 × 10^{1}$	$2.103 × 10^{1}$
$4$	$3.965 × 10^{-1}$	$4.833 × 10^{-1}$	$5.183 × 10^{-1}$
$6$	$1.089 × 10^{-2}$	$1.322 × 10^{-2}$	$2.942 × 10^{-2}$
$8$	$4.042 × 10^{-4}$	$4.928 × 10^{-4}$	$1.147 × 10^{-3}$
$10$	$1.992 × 10^{-5}$	$2.409 × 10^{-5}$	$4.615 × 10^{-5}$

评论- Please respond to the authors' rebuttal

2025-08-07

Dear Reviewer Rqrd,

Thanks a lot for reviewing this paper!

It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?

Thanks!

审稿意见

评分: 5置信度: 12025-07-03

This paper considers contextual cascading bandits where an agent select ordered lists of $K$ arms from $N$ total arms, receiving sequential feedback until a criterion is met.

The authors propose UCB with online mirror descent for cascading logistic bandits, which requires only $O(1)$ per-round computation, independent of $T$ .

This work is the first to prove that longer cascades can improve performance

优缺点分析

Strengths:

With the help of OMD subroutine, the proposed algorithm is computationally efficient.
The first to prove that longer cascades can improve performance
Provides first lower bound derivation for contextual cascading bandits

Weaknesses:

There is still s gap between the upper and lower bound. It would be better if the authors discuss more on how to close the gap in the future.
The regret bound increases when $K$ is relatively small, especially when $\bar p$ is large.

问题

Q1. In the description of Algorithm 2 on page 5, there is no explanation on how to choose $i_t^{(1)}$ and $i_t^{(2)}$ .

Q2. When $K$ is not large enough, how can you argue that your regret bound is the tightest compared to previous results (on Page 2)?

局限性

Yes

最终评判理由

The authors addressed most of my concerns. I will keep my score.

格式问题

NA

作者回复

2025-07-31

Thank you for your time to review our paper and for your valuable feedback. Here are our responses to each comment and question:

1. Gap between the Upper-Lower Bound Gap

As the reviewer pointed out, closing the gap between the current upper and lower bounds—i.e., matching $\bar{p}$ and $\underline{p}$ —is indeed an important direction for future research. At present, we do not yet know how to achieve minimax optimality in this regime. One potential approach is to pursue instance-dependent regret bounds, as done by Abeille et al. (2021). For example, by defining $p^\star = \sigma_0((x^\star)^\top \theta^\star)$ , where $x^\star$ is an optimal arm, one could try to derive $p^\star$ -dependent regret bounds instead of relying on worst-case probabilities ( $\bar{p}$ and $\underline{p}$ ). While promising, this line of work is beyond the current scope and we leave it for future work. We thank the reviewer for this valuable and constructive suggestion.

2. Discussion of Regret Bound for Small $K$ and Large $\bar{p}$

Even though our regret bound may increase in the small- $K$ regime with particularly large $\bar{p}$ , we emphasize that this is the first regret bound to explicitly capture and improve with increasing $K$ , with the explicit dependence on both $K$ and $\bar{p}$ . We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between $\bar{p}$ and $\underline{p}$ ), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small- $K$ regime with particularly large $\bar{p}$ , the regret may initially increase with $K$ but eventually decrease as $K$ becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. It is rather the nature of the problem that our work presents for the first time as the true scaling.

Furthermore, we would like to highlight that in the bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) $\tilde{\mathcal{O}}(d\sqrt{T})$ , and (2) $\tilde{\mathcal{O}}(\sqrt{dT \log N})$ , which additionally depends on the number of arms $N$ . When the number of arms is relatively small (e.g., $N = \mathcal{O}(e^d)$ ), the bound $\tilde{\mathcal{O}}(\sqrt{dT \log N})$ is tighter. Conversely, when $N$ is large, $\tilde{\mathcal{O}}(d\sqrt{T})$ becomes the better guarantee.

In the same spirit, our newly established bounds become much tighter than existing bounds for cascading contextual bandits, as $K$ increases, and optimal in the large- $K$ regime. We will incorporate this clarification into the final revision.

Additionally, we conduct an experiment to demonstrate the robustness of our algorithm to the value of $\bar{p}$ in practice. Importantly, the actual performance depends on the no-click probabilities $\sigma_0(x_{t,i}^\top \theta^\star)$ for items in the selected set $C_t$ (i.e., $i \in C_t$ ), rather than on the worst-case value $\bar{p} = \sup_x \sigma_0(x^\top \theta^\star)$ used in the regret analysis. As a result, the realized regret does NOT typically scale with $\bar{p}$ —the worst-case analysis can be overly conservative.

The following table illustrates that even when $\bar{p}$ is large, the observed cumulative regret decreases monotonically as $K$ increases, highlighting the practical stability of our method.

$K$	$\bar{p} = 0.37$	$\bar{p} = 0.74$	$\bar{p} = 0.95$
$2$	$1.712 × 10^{1}$	$1.884 × 10^{1}$	$2.103 × 10^{1}$
$4$	$3.965 × 10^{-1}$	$4.833 × 10^{-1}$	$5.183 × 10^{-1}$
$6$	$1.089 × 10^{-2}$	$1.322 × 10^{-2}$	$2.942 × 10^{-2}$
$8$	$4.042 × 10^{-4}$	$4.928 × 10^{-4}$	$1.147 × 10^{-3}$
$10$	$1.992 × 10^{-5}$	$2.409 × 10^{-5}$	$4.615 × 10^{-5}$

3. Selection Rule for $i_t^{\scriptscriptstyle (1)}$ and $i_t^{\scriptscriptstyle (2)}$

Thank you for pointing out this omission. The exact rule for choosing the two swap arms is stated on page 7, line 226:

$

i_t^{\scriptscriptstyle (1)} = \arg\max_{i\in C_t} \dot\sigma_1 \bigl(x_{t,i}^{\top}\theta_t\bigr) \lVert x_{t,i}\rVert_{H_t^{-1}}, \quad i_t^{\scriptscriptstyle (2)} = \arg\max_{i\in C_t} \lVert x_{t,i}\rVert_{H_t^{-1}}.

$

To make this clear in the final version, we will (i) assign an explicit equation number to the formula above, and (ii) add a reference to the equation directly inside Algorithm 2 (page 5). We appreciate your careful reading and for helping us improve the clarity.

评论- Please respond to author rebuttal

2025-08-07

Dear Reviewer C937,

Thanks a lot for reviewing this paper!

It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?

Thanks!

2025-08-07

Thanks the authors for their detailed response. I will keep my score. Please make sure to include these discussions in the final version.

评论- Additional Result: Regret is NO greater than $d\sqrt{T}$ for all $K$ (1/2)

2025-08-05

Motivated by some of the reviewer’s insightful comments and questions regarding the regret behavior in the large- $\bar{p}$ and small- $K$ regime, we are pleased to share that our analysis can be naturally extended to yield even stronger results with only minor modifications.

Specifically, we can show that the regret does not increase with $K$ at all even in the regime of the large- $\bar{p}$ and small- $K$ setting, i.e., when $K\cdot\bar{p}^{K-1} > 1$ . By leveraging standard techniques from prior works [2, 3], we establish that our algorithm (without any modification) also satisfies the instance-independent bound $\tilde{\mathcal{O}}(d\sqrt{T})$ . As a result, we obtain the following refined instance-dependent upper bound $\mathcal{R}(T)=\tilde{\mathcal O}\big(\min\lbrace K\bar{p}^{K-1},1\rbrace \cdot d\sqrt T\big)$ , along with a matching lower bound of $\Omega \big(\min\lbrace K\underline{p}^{K-1},1\rbrace \cdot d\sqrt{T}\big)$ up to $\bar{p}$ and logarithmic factors.

Please note that these additional results are obtained through a minor refinement of our existing proof techniques. We provide a brief sketch of the argument below and will include a formal corollary and full proof in the final version of the paper.

1. Supplementary Regret Upper Bound of UCB-CLB

Step 1: Introduce a worst-curvature parameter $\tilde{\theta}\_{t,i}$ .

We adopt the intermediary parameter $\tilde{\theta}\_{t,i}$ inspired by Lee et al. [1] as follows:

\tilde{\theta}\_{t,i} := {\arg\min}\_{ \theta \in \cup\_{\tiny \substack{\tau \in [t, T] \\ k \in [K]} } \mathcal{C}\_{\tau, k}^{`ON`}(\delta)} \dot{\sigma}\_1(x_{t,i}^\top \theta), \quad \forall t \in [T], i \in \mathcal{I}

This ensures that the gram matrix $H\_t$ satisfied the following lower bound:

$ H\_t \succeq \sum\_{\tau=1}^{t-1} \sum\_{i \in O\_\tau} \dot{\sigma}\_1(x\_{\tau, i}^\top \tilde{\theta}\_{\tau, i}) x\_{\tau,i}x\_{\tau,i}^\top + \lambda I\_d =: L\_t. $ \tag{1}

Step 2: Control the deviation of the derivative.

Combine Lemma E.3 of Lee et al. [1] with our concentration (Corollary 5.5), we get:

$ \dot{\sigma}\_1(x\_{t,i}^\top \theta^\star) \le \dot{\sigma}\_1(x\_{t,i}^\top \tilde{\theta}\_{t,i}) + \frac{\beta\_{T,K}(\delta)}{4} \Vert x\_{t,i} \Vert\_{H\_t^{-1}}. $ \tag{2}

Step 3: Upper bound on the prediction error via the curvature-weighted norm.

Starting from the 2-nd Order Taylor Expansion, and applying Equation (1) and Equation (2), we obtain the following:

$ \sigma\_1(u\_{t,i}) - \sigma\_1(x\_{t,i}^\top \theta^\star) \le 2\beta\_T(\delta)\dot{\sigma}\_1(x\_{t,i}^\top \tilde{\theta}\_{t,i}) \Vert x\_{t,i} \Vert\_{L\_t^{-1}} + \frac{7 \beta\_{T,K}^2(\delta)}{10} \Vert x\_{t,i} \Vert\_{H\_t^{-1}}^2. $ \tag{3}

To simplify notation, we define $\zeta\_{t,i} := 2 \beta\_T(\delta) \dot{\sigma}\_1 (x\_{t,i}^\top \tilde{\theta}\_{t,i}) \Vert x\_{t,i} \Vert\_{L\_t^{-1}}$ , which corresponds to the first term on the right-hand side of the inequality.

Step 4: Applying Equation (3) in the Regret Decomposition.

Following the regret decompostion of our Theorem 5.3 and substituting Equation (3) for the bound on $\sigma\_1(u\_{t,i})-\sigma\_1(x\_{t,i}^\top \theta^\star)$ leads immediately to the following inequality for the regret $\mathcal{R}(T)$ :

\mathcal{R}(T) \le \underbrace{ \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in C\_t} \prod\_{l=1}^{k-1} \sigma\_0(x\_{t,i\_{tl}}^\top \theta^\star) \zeta\_{t, i\_{tk}} \prod\_{m=k+1}^{K} \sigma\_0(x\_{t,i\_{tm}}^\top \theta^\star) \right] }\_{\text{term 1}} + \text{2$^{\text{nd}}$ Order Term}

Step 5: $K$ -free Control of Term 1 Directly leveraging the proof technique of Lemma 19 in Liu et al. [2], we easily get the bound of term 1 as follows:

$ \text{term 1} \le \sqrt{ T \cdot \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in C\_t} \frac {\prod\_{l=1}^{k-1} \sigma\_0(x\_{t,i\_{tl}}^\top \theta^\star) \zeta\_{t, i\_{tk}}^2} {\sigma\_0(x\_{t,i\_{tk}}^\top \theta^\star)\sigma\_1(x\_{t,i\_{tk}}^\top \theta^\star)} \right] } $ .

Then, substituting the definition of $\zeta\_{t,i}$ and using the fact that $\dot{\sigma}\_1(x\_{t,i}^\top \tilde{\theta}\_{t,i}) \le \dot{\sigma}\_1(x\_{t,i}^\top \theta^\star)$ holds with high probability due to the concentration result in Corollary 5.5 yield the followings:

\text{term 1} \le 2 \beta\_{T}(\delta) \sqrt{ T \cdot \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in C\_t} \prod\_{l=1}^{k-1} \sigma_0(x\_{t,i\_{tl}}^\top \theta^\star) \dot{\sigma}\_1(x\_{t,i\_{tk}}^\top \tilde{\theta}\_{t, i\_{tk}}) \Vert x\_{t, i\_{t,k}} \Vert\_{L\_t^{-1}}^2 \right] }

评论- Additional Result: Regret is NO greater than $d\sqrt{T}$ for all $K$ (2/2)

2025-08-05

Step 6: Applying the triggering-probability equivalence technique

Next, the direct application of the triggering-probability equivalence (TPE) technique introduced by Liu et al. [3], which allows us to replace the summation over the cascade $C\_t$ weighted by the product of non-click probabilities with an equivalent expectation over the set of observed arms $O\_t$ , yields the followings:

\text{term 1} \le 2 \beta\_{T}(\delta) \sqrt{ T \cdot \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in O\_t} \dot{\sigma}\_1(x\_{t,i\_{tk}}^\top \tilde{\theta}\_{t, i\_{tk}}) \Vert x\_{t, i\_{t,k}} \Vert\_{L\_t^{-1}}^2 \right] } = \tilde{O}(d\sqrt{T})

where the last eqaulity arises from applying the well-known elliptical potential lemma.

2. Lower Bound of Contextual Cascading Bandits

By the formulation of our instance in Appendix D.1, $\underline{p} := \inf\_x \sigma_0(x^\top \theta\_v) \le \frac{1}{2}$ , hence $K \underline{p}^{K-1}\le 1, \quad\forall K\ge 1.$ It follows that the worst-case regret admits $\Omega \big(\min \lbrace K \underline{p}^{K-1}, 1\rbrace \cdot d\sqrt{T}\big)$ . Together with our upper bound $\tilde{\mathcal O} \big(\min\lbrace K\bar{p}^{K-1},1\rbrace \cdot d\sqrt{T}\big)$ , the bound matches up to the gap between $\underline{p}$ and $\bar{p}$ .

[1] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.

[2] Liu, Xutong, et al. "Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms." Advances in Neural Information Processing Systems 35 (2022): 14904-14916.

[3] Liu, Xutong, et al. "Contextual combinatorial bandits with probabilistically triggered arms." International Conference on Machine Learning. PMLR, 2023.

We hope that these additional results help clarify the improvements in our regret bounds over prior works and address the reviewer’s concerns regarding the strength of our results. We would greatly appreciate it if these clarifications could be considered in your final assessment. Thank you again for your time and effort in serving as a reviewer.

评论- Final Remarks and Summary of Contributions

2025-08-09

We sincerely thank all reviewers for their helpful feedback and unanimously positive assessments. We especially appreciate the thoughtful discussions during the review period, which helped clarify and strengthen our work. As we wrap up this rebuttal phase, we would like to highlight the main contributions of our paper.

Summary of Our Work

Cascading bandits have a rich theoretical history and are increasingly used in practice (many mobile social-media recommenders adopt a cascading interface); however, in the contextual setting, there has been a fundamental gap in understanding how the cascade length $K$ governs both algorithms and problem difficulty.

Our paper present the whole package of results: to our knowledge, the first to characterize how regret scales with $K$ ; pin down the precise $K$ - and $p$ -dependence with nearly matching upper and lower bounds; and deliver a practical, computationally efficient algorithm that exploits this structure. We believe these results close a longstanding gap since contextual cascading bandits were first introduced and provide algorithmic design principles for practical cascade systems (as commonly seen on today’s social-media platforms).

Key Contributions

1. Large-cascade-friendly algorithm:

We propose UCB-CLB, the first contextual cascading bandit algorithm to show that regret can decrease with larger cascade length $K$ , challenging prior beliefs.
We identify the fundamental factor of regret dynamics. Our upper bound explicitly depends both cascade length $K$ and the worst-case non-click probability $\bar{p}$ .

2. Lower Bound Matching Upper Bound Scaling in $K, p, d$ :

We provide the first lower bound for contextual cascading bandits matching our upper bound up to a gap of $\bar{p}$ and $\underline p$ , indicating that the UCB-CLB achieves a nearly tight regret upper bound.

2. Efficiency in computation and cascade selection:

Beyond improved regret, UCB-CLB is computationally efficient.
UCB-CLB uses OMD for parameter estimation, requiring only $O(K^2 d^3)$ per round — independent of T and significantly lighter than MLE.
We also introduces an optimistic expected reward that allows selecting the cascade from only the top- $K$ arms, instead of exhaustively examining all $\binom{N}{K}$ combinations.

Further Improvement upon Reviewer's Feedback

During the review and discussion, insightful questions were raised regarding the regret behavior in the large- $\bar{p}$ , small- $K$ regime.

By refining our analysis with standard techniques from [1,2,3] and only minor adjustments to our proofs, we confirm that our regret bound never exceeds $\tilde{\mathcal{O}}(d\sqrt{T})$ , even in this regime.

Moreover, we retain the following structure-aware regret upper bound: $\tilde{\mathcal{O}}\big(\min\lbrace K\bar{p}^{K-1}, 1\rbrace \cdot d\sqrt{T}\big)$ , with a matching lower bound $\Omega\big(\min\lbrace K\underline{p}^{K-1}, 1\rbrace \cdot d\sqrt{T}\big)$ .

These results will be included as a formal corollary in the final version.

[1] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.

[2] Liu, Xutong, et al. "Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms." Advances in Neural Information Processing Systems 35 (2022): 14904-14916.

[3] Liu, Xutong, et al. "Contextual combinatorial bandits with probabilistically triggered arms." International Conference on Machine Learning. PMLR, 2023.

最终决定Accept (poster)

2025-09-17

This paper investigates the role of the cascade length $K$ in contextual cascading bandits. Contrary to previous works suggesting that regret increases with $K$ , the authors theoretically and empirically demonstrate that the regret can decrease as $K$ grows large. They propose a novel UCB-based algorithm, UCB-CLB, which utilizes Online Mirror Descent (OMD) for efficient parameter estimation, achieving a regret upper bound that can diminish with $K$ . They also provide an almost matching lower bound. Experiment results on a real-world dataset are also demonstrated.

As the reviewers have pointed out, this paper has made several major contributions. To the best of our knowledge, this is the first paper proving that a longer cascade length can improve the performance of a contextual cascading bandit. This is important and surprising. Moreover, the proposed algorithm leverages OMD for parameter estimation, which significantly improves computational efficiency compared to more standard MLE-based approaches. Overall, the paper is well-written.

All reviewers are on the acceptance side for this paper. After reading the paper, the reviews, and the discussions, I agree with the reviewers and recommend accepting this paper.

True Impact of Cascade Length in Contextual Cascading Bandits

摘要

评审与讨论

优缺点分析

问题

局限性

最终评判理由

格式问题

优缺点分析

Strengths

Weaknesses

问题

局限性

格式问题

优缺点分析

问题

局限性

最终评判理由

格式问题

优缺点分析

问题

局限性

最终评判理由

格式问题