PaperHub
7.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
5
5
5
2.3
置信度
创新性3.0
质量3.5
清晰度3.0
重要性3.3
NeurIPS 2025

True Impact of Cascade Length in Contextual Cascading Bandits

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29
TL;DR

We show that in contextual cascading bandits, regret vanishes as the cascade length grows, with nearly matching upper and lower bounds.

摘要

We revisit the contextual cascading bandit, where a learning agent recommends an ordered list (cascade) of items, and a user scans the list sequentially, stopping at the first attractive item. Although cascading bandits underpin various applications including recommender systems and search engines, the role of the cascade length $K$ in shaping regret has remained unclear. Contrary to prior results that regret grows with $K$, we prove that regret actually decreases once $K$ is large enough. Leveraging this insight, we design a new upper-confidence-bound algorithm built on online mirror descent that attains the sharpest known regret upper bound, $\tilde{\mathcal{O}}\bigl(\min \lbrace K\bar{p}^{K-1}, 1 \rbrace d \sqrt{T}\bigr)$ for contextual cascading bandits. To complement this new regret upper bound, we provide a nearly matching lower bound of $\Omega \bigl(\min \lbrace K\underline{p}^{K-1}, 1 \rbrace d \sqrt{T}\bigr)$, where $0 \leq \underline{p} \leq \bar{p} < 1$. Together, these results fully characterize how regret truly scales with $K$, thereby closing the theoretical gap for contextual cascading bandits. Finally, comprehensive experiments validate our theoretical results and show the effectiveness of our proposed method.
关键词
contextual banditcascading banditupper confidence boundonline leaning

评审与讨论

审稿意见
4

The cascading bandit problem is widely used in recommendation systems, search engines, and online advertising. In this model, an agent selects an ordered list (cascade) of length KK from NN options (arms), and users sequentially browse the list, stopping at the first appealing item. Traditional studies suggest that regret—defined as the difference in expected rewards between the optimal and selected cascades—increases or remains constant with larger KK. However, this contradicts the intuition that longer cascades offer more feedback opportunities, potentially reducing regret. The paper challenges this view, proving that in contextual cascading bandits, regret decreases when KK is sufficiently large.

优缺点分析

Strengths

  1. The paper challenges existing research on the impact of cascade length KK on regret in contextual cascading bandits, demonstrating that regret decreases when KK is sufficiently large. Specifically, the authors propose a new regret upper bound of O(KpˉK1dT)\mathcal{O}(K \bar{p}^{K-1} d \sqrt{T}), where pˉ\bar{p} is the upper bound of the feedback probability, dd is the dimension of the context vector, and TT is the total number of rounds. This bound shows that regret diminishes with increasing KK due to the exponential decay of pˉK1\bar{p}^{K-1} (pˉ<1\bar{p} < 1).

  2. Additionally, the authors derive a matching lower bound of Ω(KpK1dT)\Omega(K \underline{p}^{K-1} d \sqrt{T}), where p\underline{p} is the lower bound of the feedback probability. Together, these bounds fully characterize the true trend of regret with respect to KK, addressing a theoretical gap in contextual cascading bandits.

  3. The paper introduces UCB-CLB (Upper Confidence Bound for Cascading Logistic Bandits), an algorithm based on Upper Confidence Bound (UCB) that integrates Online Mirror Descent (OMD) and a cascade restructuring technique (DO-SWAP). Operating under a logistic model, UCB-CLB achieves efficient parameter estimation and cascade selection. By avoiding the high computational cost of traditional maximum likelihood estimation (MLE), UCB-CLB significantly improves efficiency while maintaining theoretically optimal regret bounds. The DO-SWAP technique optimizes the exploration-exploitation trade-off by prioritizing items with higher uncertainty in the cascade.

Weaknesses

  1. The theoretical results rely on specific assumptions, such as Assumption 5.1 (bounded norms of feature vectors and parameters) and Assumption 5.3 (a positive lower bound κ\kappa in the logistic model). These assumptions may not always hold in practice. For instance, a small κ\kappa (indicating significant deviation from a linear model) could impact algorithm performance, despite claims that UCB-CLB remains effective in such cases. However, the paper does not thoroughly explore how violations of these assumptions affect performance, potentially limiting the generalizability of its theoretical findings.

  2. The analysis depends on pˉ1/e0.37\bar{p} \leq 1/e \approx 0.37, where the regret upper bound decreases monotonically with increasing KK. When pˉ>1/e\bar{p} > 1/e, regret may initially increase before decreasing (as KpˉK1K \bar{p}^{K-1} peaks at K=1log(1/pˉ)K = \frac{1}{\log(1/\bar{p})}). The paper does not adequately address the implications of pˉ>1/e\bar{p} > 1/e in practical high-feedback-probability scenarios, which may restrict the applicability of its theoretical results.

  3. Experiments only test cascade lengths K=5K = 5 and K=10K = 10, without exploring a broader range of KK values.

  4. The UCB-CLB algorithm relies on the DO-SWAP technique to restructure cascades, prioritizing items with higher uncertainty. While effective, this increases algorithmic complexity, potentially introducing additional computational overhead or implementation challenges in practice. The paper does not provide a detailed analysis of DO-SWAP’s computational costs or potential failure cases across different scenarios.

问题

  1. The UCB-CLB algorithm leverages Online Mirror Descent (OMD) and DO-SWAP techniques to enhance computational efficiency and optimize regret. Can this approach effectively extend to dynamic environments, such as those with time-varying contextual features?

  2. The paper introduces the combination of OMD and DO-SWAP as a novel proof technique to address the sequential feedback characteristics of cascading bandits. What new challenges might this integration introduce?

局限性

N/A

最终评判理由

The response from the authors resolves my concerns. I will maintain my positive rating.

格式问题

N/A

作者回复

We sincerely thank the reviewer for the thoughtful and detailed comments on our work. Your feedback has greatly helped us refine and strengthen the paper.

1. Discussion of Regularity Conditions

Firstly, Assumption 5.1 is without loss of generality: any raw feature vector can be rescaled to lie inside the unit ball before learning, and relaxing the parameter norm bound to SS merely multiplies the theoretical regret by S3/2S^{3/2} factor.

Secondly, we strongly believe that the existence of a non-zero κ\kappa should NOT be viewed as a weakness of our paper, since κ\kappa is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define κ\kappa explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define κ\kappa rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.

Lastly, we conduct additional experiments on the MovieLens-100K100K benchmark: reducing κ\kappa from 0.10.1 to 0.010.01 leads to no noticeable change in cumulative or average regret (see the below table). R(t)R(t) denotes the cumulative regret up to round tt, whereas R(t)/tR(t)/t is the average regret at round tt. We report R(t)/tR(t)/t to check whether cumulative regret is sublinear.

κ\kappattR(t)R(t)R(t)/tR(t)/t
0.10.12000200010.9710.975.40×1035.40 \times 10^{-3}
4000400015.8015.803.90×1033.90 \times 10^{-3}
6000600019.8319.833.30×1033.30 \times 10^{-3}
8000800023.8423.842.90×1032.90 \times 10^{-3}
100001000027.2827.282.70×1032.70 \times 10^{-3}
κ\kappattR(t)R(t)R(t)/tR(t)/t
0.010.012000200012.6612.666.30×1036.30 \times 10^{-3}
4000400018.4618.464.60×1034.60 \times 10^{-3}
6000600022.9422.943.80×1033.80 \times 10^{-3}
8000800026.9426.943.30×1033.30 \times 10^{-3}
100001000030.5430.543.00×1033.00 \times 10^{-3}

2. Discussion of pˉ\bar{p}

We first clarify meaning of large pˉ\bar{p}. Recall that \bar p is defined as an upper bound on the non-click probability: pˉ=supxσ0(xθ\*).\bar{p} = \sup_{x}\sigma_0(x^{\top}\theta^\*). Consequently, a large pˉ\bar{p} corresponds to a high–negative-feedback scenario.

Even though our regret bound may increase in the small-KK regime with particularly large pˉ\bar{p}, we emphasize that this is the first regret bound to explicitly capture and improve with increasing KK, with the explicit dependence on both KK and pˉ\bar{p}. We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between pˉ\bar{p} and p\underline{p}), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small-KK regime with particularly large pˉ\bar{p}, the regret may initially increase with KK but eventually decrease as KK becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. It is rather the nature of the problem that our work presents for the first time as the true scaling.

Furthermore, we would like to highlight that in the larger bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}), and (2) O~(dTlogN)\tilde{\mathcal{O}}(\sqrt{dT \log N}), which additionally depends on the number of arms NN. When the number of arms is relatively small (e.g., N=O(ed)N = \mathcal{O}(e^d)), the bound O~(dTlogN)\tilde{\mathcal{O}}(\sqrt{dT \log N}) is tighter. Conversely, when NN is large, O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}) becomes the better guarantee.

Hence, it is not just a matter of which bound is tighter but rather what this work focuses is what is the true scale of the regret in cascading contextual bandit problem supported by both upper and lower bounds.

3. Additional Experiments with larger KK

Following your suggestion, we run the MovieLens-100K100K experiment with K=50K = 50 and K=100K = 100. As the Table 6 shows, cumulative regret remains sub-linear (indicated by the steadily falling the average regret R(t)/tR(t)/t). Crucially, the overall scale of the regret R(t)R(t) diminishes as KK increases, exactly reflecting our theoretical claim that regret decreases once the cascade length is long.

KKttR(t)R(t)R(t)/tR(t)/t
5050100010006.09×10176.09 \times 10^{-17}6.09×10206.09 \times 10^{-20}
200020009.49×10179.49 \times 10^{-17}4.74×10204.74 \times 10^{-20}
3000300013.04×101713.04 \times 10^{-17}4.35×10204.35 \times 10^{-20}
4000400015.29×101715.29 \times 10^{-17}3.82×10203.82 \times 10^{-20}
5000500017.54×101717.54 \times 10^{-17}3.51×10203.51 \times 10^{-20}
KKttR(t)R(t)R(t)/tR(t)/t
100100100010000.88×10320.88 \times 10^{-32}8.80×10368.80 \times 10^{-36}
200020001.59×10321.59 \times 10^{-32}7.96×10367.96 \times 10^{-36}
300030002.14×10322.14 \times 10^{-32}7.14×10367.14 \times 10^{-36}
400040002.50×10322.50 \times 10^{-32}6.26×10366.26 \times 10^{-36}
500050002.88×10322.88 \times 10^{-32}5.75×10365.75 \times 10^{-36}

4. Time–Complexity of DO-SWAP

Contrary to your concern, the DO-SWAP subroutine imposes neither significant computational overhead nor implementation difficulty. With a single partial sort that runs in O(K)O(K) time, our agent selects the two most uncertain arms within the length-KK cascade CtC_t. Our MovieLens-100K experiment with K=5 (See the table below) supports this claim, showing that DO-SWAP consistently consumes only 0.7%\approx 0.7 \% of total cumulative runtime even as the round tt grows.

[Table Instruction]

  • "Total": is the mean cumulative runtime of the full algorithm,
  • "DO-SWAP": is the mean cumulative time spent in the DO-SWAP sub-routine,
  • "Ratio": is the share of DO-SWAP within the total time. | tt | Total | DO-SWAP | Ratio | |-------|-------------|-------------|--------| | 10001000 | 1.8251.825 | 0.0120.012 | 0.70.7% | | 20002000 | 3.4463.446 | 0.0240.024 | 0.70.7% | | 30003000 | 4.9104.910 | 0.0350.035 | 0.70.7% | | 40004000 | 6.3096.309 | 0.0450.045 | 0.70.7% | | 50005000 | 7.6877.687 | 0.0560.056 | 0.70.7% |

5. Applicability to Time-Varying Contextual Features

We would like to clarify that our current framework already supports time-varying contextual features. Section 3.2 (Problem Setting) explicitly models the environment using time-varying contextual features, confirming that our theoretical analysis already accommodates such dynamics.

Accordingly, the OMD + DO-SWAP algorithm is naturally applicable to dynamic environments where contextual features change over time. In our MovieLens-100K experiment, the agent observes time-varying features xt,ix_{t,i} at every round tt: a new user utu_t arrives, and the agent chooses among 1,6421,642 movies ii. As reported in Section 7 (Numerical Study), UCB-CLB with DO-SWAP achieves sub-linear regret under this setting, showing in practice that the algorithm still performs well even as the contexts change over time.

6. Integrating OMD and DO-SWAP Introduces No Additional Challenges

OMD is used for parameter estimation, while DO-SWAP handles the cascading selection process. These two components operate independently of each other. For example, one could substitute OMD with an alternative estimator such as MLE, and DO-SWAP would remain fully functional without modification.

评论

Thank you for your response, which resolves my concerns. I will maintain my positive rating.

评论

Thank you for your kind response and for maintaining your positive rating. We truly appreciate your time and consideration.

审稿意见
5

This paper investigates the role of the cascade length KK in contextual cascading bandits—a learning framework relevant to search and recommendation systems. Contrary to previous works suggesting that regret increases with KK, the authors theoretically and empirically demonstrate that regret can decrease as KK grows large. They propose a novel UCB-based algorithm, UCB-CLB, which uses Online Mirror Descent (OMD) for efficient parameter estimation, achieving a regret upper bound of O~(KpˉdT)\tilde{O}(K^{\bar{p}} d \sqrt{T}) that diminishes with increasing KK. They also provide the first matching lower bound in the contextual setting, resolving a long-standing open problem about the impact of cascade length.

优缺点分析

Strengths

  • The paper challenges and overturns a widely held belief in cascading bandits literature—that regret necessarily increases with cascade length. This is an important and surprising contribution.
  • The UCB-CLB algorithm leverages OMD for parameter estimation, significantly improving computational efficiency compared to MLE-based approaches.
  • The paper provides both upper and lower bounds that match up to logarithmic factors, a rare accomplishment in bandit theory.
  • UCB-CLB is shown to be scalable and effective on real-world datasets like MovieLens, and avoids non-convex optimization bottlenecks.
  • The paper is generally well-organized, with a thorough theoretical exposition, detailed algorithms, and compelling empirical results.

Weaknesses

  • Some results rely on regularity conditions such as bounded features and logistic noise parameters, which may not hold in all real-world deployments.
  • While the MovieLens experiments are convincing, broader validation across datasets or task domains would strengthen the claims.
  • The lower bound uses a slightly different constant pp than the upper bound's pˉ\bar{p}, though the authors do acknowledge and justify this.

问题

  1. Clarification on pp vs pˉ\bar{p}: Can the authors comment on how large the gap between pp and pˉ\bar{p} might be in practice, and whether it's observable in experiments?
  2. OMD vs MLE Trade-offs: Could the authors elaborate on potential limitations of OMD versus MLE, especially in terms of statistical efficiency in low-noise regimes?
  3. Impact of Arm Pool Size NN: Even though the bounds are independent of NN, does increasing NN affect empirical performance (e.g., convergence speed)?

局限性

yes

格式问题

None observed. The paper follows NeurIPS formatting guidelines.

作者回复

Thank you for taking the time to review our work. We sincerely appreciate your constructive feedback and provide our detailed responses below.

1. Discussion of Regularity Conditions

Could you kindly clarify what is meant by the "logistic noise parameters"? We assume this refers either to θ\theta^\star or to the curvature parameter κ\kappa. Accordingly, we will address the boundedness of the three relevant quantities: the feature vector xx, the true parameter θ\theta^\star, and the curvature parameter κ\kappa. Below, we explicitly show that these boundedness assumptions are made for convenience and can be easily relaxed or reformulated as definitions. Therefore, we believe they should not be viewed as weaknesses of our work.

1) Boundedness assumption on xx and θ\theta^\star (Assumption 5.1): We would like to clarify that this assumption is made purely for convenience and cleaner presentation. If xx and θ\theta^\star are bounded by a general constant SS (rather than 1), the regret bound simply scales with poly(S)\text{poly}(S). For example, if x21\Vert x \Vert_{2} \le 1 and θ2S \Vert \theta^{\star} \Vert_2 \leq S, the regret becomes O~(S3/2KpˉK1dT)\tilde{\mathcal{O}} \left( S^{3/2} K \bar{p}^{K-1} d \sqrt{T} \right), since the confidence parameter scales as βt,k(δ)=O~(S3/2d)\beta_{t,k}(\delta) = \tilde{\mathcal{O}}(S^{3/2} \sqrt{d}).

2) Discussion about κ\kappa (Assumption 5.3): We would like to highlight that κ\kappa is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define κ\kappa explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define κ\kappa rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.

[1] Faury, Louis, et al. "Improved optimistic algorithms for logistic bandits." International Conference on Machine Learning. PMLR, 2020.

2, Additional Experiments: KuaiRand Pure Click Datasets [2] As the current NeurIPS rebuttal guidelines prohibit the inclusion of figures, we instead present the results in the table below. R(t)R(t) denotes the cumulative regret up to round tt, whereas R(t)/tR(t)/t is the average regret at round tt. We report R(t)/tR(t)/t to check whether cumulative regret is sublinear. Following the reviewer’s suggestion, we ran additional experiments with cascade lengths K=10K = 10 on the KuaiRand-Pure dataset.

KuaiRand-Pure is a real-world recommendation benchmark derived from the short-video platform Kuaishou; it contains anonymised user–item interaction logs with click indicators, which we adopt as the binary reward signal for our bandit task. Consistent with our theoretical findings, the regret decreases as KK increases, confirming that our method scales favourably with longer cascades and remains robust across datasets.

(i) UCB-CLB with K=5

ttR(t)R(t)R(t)/tR(t)/t
1000100013.1513.150.0130.013
2000200016.1416.140.0080.008
3000300018.5718.570.0060.006
4000400020.4820.480.0050.005
5000500021.9621.960.0040.004

(ii) UCB-CLB with K=10

ttR(t)R(t)R(t)/tR(t)/t
100010000.920.929.19×1049.19 \times 10^{-4}
200020001.061.065.31×1045.31 \times 10^{-4}
300030001.151.153.84×1043.84 \times 10^{-4}
400040001.201.203.00×1043.00 \times 10^{-4}
500050001.241.242.47×1042.47 \times 10^{-4}

[2] Gao, Chongming, et al. "Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos." Proceedings of the 31st ACM international conference on information & knowledge management. 2022.

3, Clarification on p\underline{p} vs pˉ\bar{p}

While our upper and lower bounds differ due to the gap between pˉ\bar{p} and p\underline{p}, these quantities are often overly conservative in practice. As a result, the realized regret typically lies between the two—significantly lower than the upper bound and higher than the lower bound—indicating that our algorithm is robust to the worst-case values of pˉ\bar{p} and p\underline{p}. To illustrate this, we provide a table below comparing the theoretical upper bound, theoretical lower bound, and the actual cumulative regret observed in an experiment with pˉ=0.8\bar{p} = 0.8 and p=0.1\underline{p} = 0.1. In the table:

  • "Upper Bound" refers to the theoretical regret upper bound,
  • "Lower Bound" refers to the theoretical regret lower bound, and
  • "Cum. Regret" refers to the cumulative regret actually incurred by our algorithm on the MovieLens 100K dataset.
K=4K=4K=6K=6K=8K=8K=10K=10
Upper Bound1619.091619.091554.321554.321326.361326.361061.081061.08
Cum. Regret18.3618.365.155.151.391.390.360.36
Lower Bound3.16×1003.16 \times 10^{0}4.74×1024.74 \times 10^{-2}6.32×1046.32 \times 10^{-4}7.91×1067.91 \times 10^{-6}

Currently, we do not yet know how to close the gap between pˉ\bar{p} and p\underline{p} to achieve minimax optimality. One promising direction is to pursue instance-dependent regret bounds, as in Abeille et al. (2021). For example, by defining p=σ0((x)θ)p^\star = \sigma_0((x^\star)^\top \theta^\star), where xx^\star is an optimal arm, one could derive pp^\star-dependent regret bounds instead of relying on worst-case quantities like pˉ\bar{p} and p\underline{p}. While this is a meaningful direction, it falls outside the scope of the present work, and we leave it for future research. We thank the reviewer for this valuable and constructive suggestion. Moreover, note that our algorithm does not need to observe or know pˉ\bar{p} or p\underline{p} at any point.

4. OMD vs. MLE Trade-offs in Low-noise Regimes

It remains unclear which method—OMD or MLE—is preferable. In a low-noise regime, click feedback is almost deterministic. Under the logistic model this means the parameter norm SS must be large (i.e., θ2S\\| \theta\\|_2 \leq S), reflecting a strong signal or a wide parameter space.

Recall that when θ2S\\|\theta^\star\\|_2 \leq S, our regret bound (that uses OMD) scales as O~(S3/2KpˉK1dT)\tilde{\mathcal{O}}\left(S^{3/2} K \bar{p}^{K-1} d \sqrt{T} \right), which deteriorates as SS increases. In contrast, for MLE-based approaches, SS-free regret bounds have been established in the GLM bandit literature (e.g., Lee et al., 2024 [3]). This suggests that a similar SS-free regret might be achievable in our setting by adopting MLE and applying similar techniques. Alternatively, an adaptive warm-up strategy, as proposed by Lee and Oh [4], could also eliminate the dependence on SS.

Because both MLE and carefully tuned OMD variants can avoid explicit SS dependence, deciding which approach is superior—even in low-noise scenarios—remains unresolved. Further work is required to pin down their precise statistical and computational trade-offs.

[3] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.

[4] Lee, Joongkyu, and Min-hwan Oh. "Improved Online Confidence Bounds for Multinomial Logistic Bandits." Forty-second International Conference on Machine Learning.

5. Impact of Arm Pool Size NN

Our regret bound—and the algorithm in practice—is NN-independent; to substantiate this, we conduct additional experiments. Using the MovieLens-100K benchmark with cascade length K=5K=5, we vary the arm pool from N=2000,2500,3000N=2000, 2500, 3000 movies while keeping the same optimal arm across all settings to ensure a fair comparison. The resulting cumulative-regret, R(t)R(t) is reported in the table below.

Across all horizons the largest gap between any two R(t)R(t) is 0.0090.009 (at t=5000t=5000), which is approximately 1%1\% of the regret scale in those rows. Thus, even when NN increases from 20002000 to 30003000, the regret remains almost unchanged, supporting our theoretical results that regret does not depend on NN.

N=2000N=2000N=2500N=2500N=3000N=3000
ttR(t)R(t)R(t)R(t)R(t)R(t)
100010000.02330.02450.0307
200020000.03560.03420.0409
300030000.04320.04210.0492
400040000.05060.05100.0581
500050000.05930.06120.0683
评论

Dear Reviewer 3rPH,

Thanks a lot for reviewing this paper!

It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?

Thanks!

评论

Thank you for your response, which has clarified my concerns. I will maintain my original positive evaluation.

审稿意见
5

This paper addresses a long-standing and misunderstood problem in the field of recommendation systems: how the length of a recommended list (the cascade length.

优缺点分析

Strengths: They propose a new Upper Confidence Bound (UCB) algorithm for cascading logistic bandits. It uses Online Mirror Descent (OMD) for parameter estimation, which makes it computationally efficient and scalable, with per-round costs that are independent of the total time horizon. Specifically, they novelly achieve the decreasing for large K regret bound compared to others.

Weaknesses:

  1. They assume a lower bound for the vairances. This assumptions are not so standard. Could the authors discuss about this assumption, why do they need it and could their analysis degrade to deterministic settings where variances are zero?

问题

  1. In the experiment, the K values are only 5, 10. Could the authors try a larger K, such as 50, 100? Also, it would be interesting and convincing if they could plot the curve between the regret and a range of K to validate their theories.

  2. For \bar p in the bound, \bar p is small when all the arms in the space has a relatively high logit. It seems a little strict since there may exist x such that x^T\theta is very small. In this case, the bound only decreases when K is sufficienlty large. Could the authors disscuss more about the limitations of their result?

局限性

no

最终评判理由

The authors solve my questions, so I will raise my score accordingly.

格式问题

no

作者回复

We appreciate your thoughtful feedback on our work. Below, we provide detailed responses to each comment and question.

1. Variance Lower-Bound Assumption

Before addressing the reviewer’s concern, we would first like to clarify the intended meaning of “variance” in the review. We assume the reviewer is referring to the parameter κ\kappa. In that case, we strongly believe that the existence of a non-zero κ\kappa should NOT be viewed as a weakness of our paper, since κ\kappa is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define κ\kappa explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define κ\kappa rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.

To further elaborate on κ\kappa: this parameter captures the degree of non-linearity of the sigmoid function. Specifically, κ\kappa becomes smaller when the agent selects actions across a wider range of the sigmoid. As such, κ\kappa quantifies the difficulty of learning induced by the non-linearity of the logistic model. This dependence is not unique to our work—it has been unavoidable in much of the existing literature (Faury et al., 2020, 2021, 2022; Lee and Oh, 2024, 2025; Vial et al., 2022; Liu et al., 2024), whether it appears in the leading term or non-leading constants of the regret bounds.

In response to the reviewer’s question regarding the deterministic choice setting: such a setting corresponds to a degenerate case where each item is either (a) always clicked or (b) never clicked. Neither case is interesting from a learning or decision-making perspective. In case (a), the learner could simply select any KK items at random and incur zero regret, since every selected item leads to a click. In case (b), the learner also incurs zero regret, regardless of the selection, as the optimal reward is itself zero. Thus, the deterministic setting is uninformative and trivial in terms of both exploration and regret analysis.

[1] Faury, Louis, et al. "Improved optimistic algorithms for logistic bandits." International Conference on Machine Learning. PMLR, 2020.

2. Additional Experiments: larger KK

In accordance with this year’s NeurIPS rebuttal guidelines, we are not permitted to include plots; thus, we present the results in tabular form below. R(t)R(t) denotes the cumulative regret up to round tt, whereas R(t)/tR(t)/t is the average regret at round tt. We report R(t)/tR(t)/t to check whether cumulative regret is sublinear.

As per reviewer's request, we have conducted additional MovieLens benchmark experiments with larger K=50,100K = 50, 100 to evaluate the robustness of our method. Consistent with our theoretical findings, the regret decreases as KK increases, demonstrating that our method scales favorably with the cascade length. The scale difference is enormous: for every round in the table, R(t)R(t) at K=100K = 100 is roughly 1015101610^{15} \sim 10^{16} times smaller than at K=50K = 50.

KKttR(t)R(t)R(t)/tR(t) / t
5050100010006.09×10176.09 × 10^{-17}6.09×10206.09 × 10^{-20}
5050200020009.49×10179.49 × 10^{-17}4.74×10204.74 × 10^{-20}
50503000300013.04×101713.04 × 10^{-17}4.35×10204.35 × 10^{-20}
50504000400015.29×101715.29 × 10^{-17}3.82×10203.82 × 10^{-20}
50505000500017.54×101717.54 × 10^{-17}3.51×10203.51 × 10^{-20}
KKttR(t)R(t)R(t)/tR(t) / t
100100100010000.88×10320.88 × 10^{-32}8.80×10368.80 × 10^{-36}
100100200020001.59×10321.59 × 10^{-32}7.96×10367.96 × 10^{-36}
100100300030002.14×10322.14 × 10^{-32}7.14×10367.14 × 10^{-36}
100100400040002.50×10322.50 × 10^{-32}6.26×10366.26 × 10^{-36}
100100500050002.88×10322.88 × 10^{-32}5.75×10365.75 × 10^{-36}

3. Large pˉ\bar{p} Case

Even though our regret bound may increase in the small-KK regime with particularly large pˉ\bar{p}, we emphasize that this is the first regret bound to explicitly capture and improve with increasing KK, with the explicit dependence on both KK and pˉ\bar{p}. We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between pˉ\bar{p} and p\underline{p}), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small-KK regime with particularly large pˉ\bar{p}, the regret may initially increase with KK but eventually decrease as KK becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. it is rather the nature of the problem that our work presents for the first time as the true scaling.

Furthermore, we would like to highlight that in the bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}), and (2) O~(dTlogN)\tilde{\mathcal{O}}(\sqrt{dT \log N}), which additionally depends on the number of arms NN. When the number of arms is relatively small (e.g., N=O(ed)N = \mathcal{O}(e^d)), the bound O~(dTlogN)\tilde{\mathcal{O}}(\sqrt{dT \log N}) is tighter. Conversely, when NN is large, O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}) becomes the better guarantee.

In the same spirit, our newly established bounds are optimal in the small-pˉ\bar{p} regime, while for large pˉ\bar{p}, the existing bound of O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}) remains optimal. We will incorporate this clarification into the final revision.

Additionally, we conduct an experiment to demonstrate the robustness of our algorithm to the value of pˉ\bar{p} in practice. Importantly, the actual performance depends on the no-click probabilities σ0(xt,iθ)\sigma_0(x_{t,i}^\top \theta^\star) for items in the selected set CtC_t (i.e., iCti \in C_t), rather than on the worst-case value pˉ=supxσ0(xθ)\bar{p} = \sup_x \sigma_0(x^\top \theta^\star) used in the regret analysis. As a result, the realized regret does NOT typically scale with pˉ\bar{p}—the worst-case analysis can be overly conservative.

The following table illustrates that **even when \bar{p** is large, the observed regret decreases monotonically as KK increases}, highlighting the practical stability of our method.

KKpˉ=0.37\bar{p} = 0.37pˉ=0.74\bar{p} = 0.74pˉ=0.95\bar{p} = 0.95
221.712×1011.712 × 10^{1}1.884×1011.884 × 10^{1}2.103×1012.103 × 10^{1}
443.965×1013.965 × 10^{-1}4.833×1014.833 × 10^{-1}5.183×1015.183 × 10^{-1}
661.089×1021.089 × 10^{-2}1.322×1021.322 × 10^{-2}2.942×1022.942 × 10^{-2}
884.042×1044.042 × 10^{-4}4.928×1044.928 × 10^{-4}1.147×1031.147 × 10^{-3}
10101.992×1051.992 × 10^{-5}2.409×1052.409 × 10^{-5}4.615×1054.615 × 10^{-5}
评论

Dear Reviewer Rqrd,

Thanks a lot for reviewing this paper!

It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?

Thanks!

审稿意见
5

This paper considers contextual cascading bandits where an agent select ordered lists of KK arms from NN total arms, receiving sequential feedback until a criterion is met.

The authors propose UCB with online mirror descent for cascading logistic bandits, which requires only O(1)O(1) per-round computation, independent of TT.

This work is the first to prove that longer cascades can improve performance

优缺点分析

Strengths:

  • With the help of OMD subroutine, the proposed algorithm is computationally efficient.
  • The first to prove that longer cascades can improve performance
  • Provides first lower bound derivation for contextual cascading bandits

Weaknesses:

  • There is still s gap between the upper and lower bound. It would be better if the authors discuss more on how to close the gap in the future.

  • The regret bound increases when KK is relatively small, especially when pˉ\bar p is large.

问题

Q1. In the description of Algorithm 2 on page 5, there is no explanation on how to choose it(1)i_t^{(1)} and it(2)i_t^{(2)}.

Q2. When KK is not large enough, how can you argue that your regret bound is the tightest compared to previous results (on Page 2)?

局限性

Yes

最终评判理由

The authors addressed most of my concerns. I will keep my score.

格式问题

NA

作者回复

Thank you for your time to review our paper and for your valuable feedback. Here are our responses to each comment and question:

1. Gap between the Upper-Lower Bound Gap

As the reviewer pointed out, closing the gap between the current upper and lower bounds—i.e., matching pˉ\bar{p} and p\underline{p}—is indeed an important direction for future research. At present, we do not yet know how to achieve minimax optimality in this regime. One potential approach is to pursue instance-dependent regret bounds, as done by Abeille et al. (2021). For example, by defining p=σ0((x)θ)p^\star = \sigma_0((x^\star)^\top \theta^\star), where xx^\star is an optimal arm, one could try to derive pp^\star-dependent regret bounds instead of relying on worst-case probabilities (pˉ\bar{p} and p\underline{p}). While promising, this line of work is beyond the current scope and we leave it for future work. We thank the reviewer for this valuable and constructive suggestion.

2. Discussion of Regret Bound for Small KK and Large pˉ\bar{p}

Even though our regret bound may increase in the small-KK regime with particularly large pˉ\bar{p}, we emphasize that this is the first regret bound to explicitly capture and improve with increasing KK, with the explicit dependence on both KK and pˉ\bar{p}. We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between pˉ\bar{p} and p\underline{p}), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small-KK regime with particularly large pˉ\bar{p}, the regret may initially increase with KK but eventually decrease as KK becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. It is rather the nature of the problem that our work presents for the first time as the true scaling.

Furthermore, we would like to highlight that in the bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}), and (2) O~(dTlogN)\tilde{\mathcal{O}}(\sqrt{dT \log N}), which additionally depends on the number of arms NN. When the number of arms is relatively small (e.g., N=O(ed)N = \mathcal{O}(e^d)), the bound O~(dTlogN)\tilde{\mathcal{O}}(\sqrt{dT \log N}) is tighter. Conversely, when NN is large, O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}) becomes the better guarantee.

In the same spirit, our newly established bounds become much tighter than existing bounds for cascading contextual bandits, as KK increases, and optimal in the large-KK regime. We will incorporate this clarification into the final revision.

Additionally, we conduct an experiment to demonstrate the robustness of our algorithm to the value of pˉ\bar{p} in practice. Importantly, the actual performance depends on the no-click probabilities σ0(xt,iθ)\sigma_0(x_{t,i}^\top \theta^\star) for items in the selected set CtC_t (i.e., iCti \in C_t), rather than on the worst-case value pˉ=supxσ0(xθ)\bar{p} = \sup_x \sigma_0(x^\top \theta^\star) used in the regret analysis. As a result, the realized regret does NOT typically scale with pˉ\bar{p}—the worst-case analysis can be overly conservative.

The following table illustrates that even when pˉ\bar{p} is large, the observed cumulative regret decreases monotonically as KK increases, highlighting the practical stability of our method.

KKpˉ=0.37\bar{p} = 0.37pˉ=0.74\bar{p} = 0.74pˉ=0.95\bar{p} = 0.95
221.712×1011.712 × 10^{1}1.884×1011.884 × 10^{1}2.103×1012.103 × 10^{1}
443.965×1013.965 × 10^{-1}4.833×1014.833 × 10^{-1}5.183×1015.183 × 10^{-1}
661.089×1021.089 × 10^{-2}1.322×1021.322 × 10^{-2}2.942×1022.942 × 10^{-2}
884.042×1044.042 × 10^{-4}4.928×1044.928 × 10^{-4}1.147×1031.147 × 10^{-3}
10101.992×1051.992 × 10^{-5}2.409×1052.409 × 10^{-5}4.615×1054.615 × 10^{-5}

3. Selection Rule for it(1)i_t^{\scriptscriptstyle (1)} and it(2)i_t^{\scriptscriptstyle (2)}

Thank you for pointing out this omission. The exact rule for choosing the two swap arms is stated on page 7, line 226:

$

i_t^{\scriptscriptstyle (1)} = \arg\max_{i\in C_t} \dot\sigma_1 \bigl(x_{t,i}^{\top}\theta_t\bigr) \lVert x_{t,i}\rVert_{H_t^{-1}}, \quad i_t^{\scriptscriptstyle (2)} = \arg\max_{i\in C_t} \lVert x_{t,i}\rVert_{H_t^{-1}}.

$

To make this clear in the final version, we will (i) assign an explicit equation number to the formula above, and (ii) add a reference to the equation directly inside Algorithm 2 (page 5). We appreciate your careful reading and for helping us improve the clarity.

评论

Dear Reviewer C937,

Thanks a lot for reviewing this paper!

It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?

Thanks!

评论

Thanks the authors for their detailed response. I will keep my score. Please make sure to include these discussions in the final version.

评论

Motivated by some of the reviewer’s insightful comments and questions regarding the regret behavior in the large-pˉ\bar{p} and small-KK regime, we are pleased to share that our analysis can be naturally extended to yield even stronger results with only minor modifications.

Specifically, we can show that the regret does not increase with KK at all even in the regime of the large-pˉ\bar{p} and small-KK setting, i.e., when KpˉK1>1K\cdot\bar{p}^{K-1} > 1. By leveraging standard techniques from prior works [2, 3], we establish that our algorithm (without any modification) also satisfies the instance-independent bound O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}). As a result, we obtain the following refined instance-dependent upper bound R(T)=O~(min{KpˉK1,1}dT)\mathcal{R}(T)=\tilde{\mathcal O}\big(\min\lbrace K\bar{p}^{K-1},1\rbrace \cdot d\sqrt T\big), along with a matching lower bound of Ω(min{KpK1,1}dT)\Omega \big(\min\lbrace K\underline{p}^{K-1},1\rbrace \cdot d\sqrt{T}\big) up to pˉ\bar{p} and logarithmic factors.

Please note that these additional results are obtained through a minor refinement of our existing proof techniques. We provide a brief sketch of the argument below and will include a formal corollary and full proof in the final version of the paper.

1. Supplementary Regret Upper Bound of UCB-CLB

Step 1: Introduce a worst-curvature parameter θ~_t,i\tilde{\theta}\_{t,i}.

We adopt the intermediary parameter θ~_t,i\tilde{\theta}\_{t,i} inspired by Lee et al. [1] as follows:

θ~_t,i:=argmin_θ_τ[t,T]k[K]C_τ,kON(δ)σ˙_1(xt,iθ),t[T],iI\tilde{\theta}\_{t,i} := {\arg\min}\_{ \theta \in \cup\_{\tiny \substack{\tau \in [t, T] \\ k \in [K]} } \mathcal{C}\_{\tau, k}^{`ON`}(\delta)} \dot{\sigma}\_1(x_{t,i}^\top \theta), \quad \forall t \in [T], i \in \mathcal{I}

This ensures that the gram matrix H_tH\_t satisfied the following lower bound:

$ H\_t \succeq \sum\_{\tau=1}^{t-1} \sum\_{i \in O\_\tau} \dot{\sigma}\_1(x\_{\tau, i}^\top \tilde{\theta}\_{\tau, i}) x\_{\tau,i}x\_{\tau,i}^\top + \lambda I\_d =: L\_t. $ \tag{1}

Step 2: Control the deviation of the derivative.

Combine Lemma E.3 of Lee et al. [1] with our concentration (Corollary 5.5), we get:

$ \dot{\sigma}\_1(x\_{t,i}^\top \theta^\star) \le \dot{\sigma}\_1(x\_{t,i}^\top \tilde{\theta}\_{t,i}) + \frac{\beta\_{T,K}(\delta)}{4} \Vert x\_{t,i} \Vert\_{H\_t^{-1}}. $ \tag{2}

Step 3: Upper bound on the prediction error via the curvature-weighted norm.

Starting from the 2-nd Order Taylor Expansion, and applying Equation (1) and Equation (2), we obtain the following:

$ \sigma\_1(u\_{t,i}) - \sigma\_1(x\_{t,i}^\top \theta^\star) \le 2\beta\_T(\delta)\dot{\sigma}\_1(x\_{t,i}^\top \tilde{\theta}\_{t,i}) \Vert x\_{t,i} \Vert\_{L\_t^{-1}} + \frac{7 \beta\_{T,K}^2(\delta)}{10} \Vert x\_{t,i} \Vert\_{H\_t^{-1}}^2. $ \tag{3}

To simplify notation, we define ζ_t,i:=2β_T(δ)σ˙_1(x_t,iθ~_t,i)x_t,i_L_t1\zeta\_{t,i} := 2 \beta\_T(\delta) \dot{\sigma}\_1 (x\_{t,i}^\top \tilde{\theta}\_{t,i}) \Vert x\_{t,i} \Vert\_{L\_t^{-1}}, which corresponds to the first term on the right-hand side of the inequality.

Step 4: Applying Equation (3) in the Regret Decomposition.

Following the regret decompostion of our Theorem 5.3 and substituting Equation (3) for the bound on σ_1(u_t,i)σ_1(x_t,iθ)\sigma\_1(u\_{t,i})-\sigma\_1(x\_{t,i}^\top \theta^\star) leads immediately to the following inequality for the regret R(T)\mathcal{R}(T):

$ \mathcal{R}(T) \le \underbrace{ \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in C\_t} \prod\_{l=1}^{k-1} \sigma\_0(x\_{t,i\_{tl}}^\top \theta^\star) \zeta\_{t, i\_{tk}} \prod\_{m=k+1}^{K} \sigma\_0(x\_{t,i\_{tm}}^\top \theta^\star) \right] }\_{\text{term 1}} + \text{2$^{\text{nd}}$ Order Term} $

Step 5: KK-free Control of Term 1 Directly leveraging the proof technique of Lemma 19 in Liu et al. [2], we easily get the bound of term 1 as follows:

$ \text{term 1} \le \sqrt{ T \cdot \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in C\_t} \frac {\prod\_{l=1}^{k-1} \sigma\_0(x\_{t,i\_{tl}}^\top \theta^\star) \zeta\_{t, i\_{tk}}^2} {\sigma\_0(x\_{t,i\_{tk}}^\top \theta^\star)\sigma\_1(x\_{t,i\_{tk}}^\top \theta^\star)} \right] } $ .

Then, substituting the definition of ζ_t,i\zeta\_{t,i} and using the fact that σ˙_1(x_t,iθ~_t,i)σ˙_1(x_t,iθ)\dot{\sigma}\_1(x\_{t,i}^\top \tilde{\theta}\_{t,i}) \le \dot{\sigma}\_1(x\_{t,i}^\top \theta^\star) holds with high probability due to the concentration result in Corollary 5.5 yield the followings:

term 12β_T(δ)TE[_t=1T_i_tkC_t_l=1k1σ0(x_t,i_tlθ)σ˙_1(x_t,i_tkθ~_t,i_tk)x_t,i_t,k_L_t12]\text{term 1} \le 2 \beta\_{T}(\delta) \sqrt{ T \cdot \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in C\_t} \prod\_{l=1}^{k-1} \sigma_0(x\_{t,i\_{tl}}^\top \theta^\star) \dot{\sigma}\_1(x\_{t,i\_{tk}}^\top \tilde{\theta}\_{t, i\_{tk}}) \Vert x\_{t, i\_{t,k}} \Vert\_{L\_t^{-1}}^2 \right] }
评论

Step 6: Applying the triggering-probability equivalence technique

Next, the direct application of the triggering-probability equivalence (TPE) technique introduced by Liu et al. [3], which allows us to replace the summation over the cascade C_tC\_t weighted by the product of non-click probabilities with an equivalent expectation over the set of observed arms O_tO\_t, yields the followings:

term 12β_T(δ)TE[_t=1T_i_tkO_tσ˙_1(x_t,i_tkθ~_t,i_tk)x_t,i_t,k_L_t12]=O~(dT)\text{term 1} \le 2 \beta\_{T}(\delta) \sqrt{ T \cdot \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in O\_t} \dot{\sigma}\_1(x\_{t,i\_{tk}}^\top \tilde{\theta}\_{t, i\_{tk}}) \Vert x\_{t, i\_{t,k}} \Vert\_{L\_t^{-1}}^2 \right] } = \tilde{O}(d\sqrt{T})

where the last eqaulity arises from applying the well-known elliptical potential lemma.

2. Lower Bound of Contextual Cascading Bandits

By the formulation of our instance in Appendix D.1, p:=inf_xσ0(xθ_v)12\underline{p} := \inf\_x \sigma_0(x^\top \theta\_v) \le \frac{1}{2}, hence KpK11,K1.K \underline{p}^{K-1}\le 1, \quad\forall K\ge 1. It follows that the worst-case regret admits Ω(min{KpK1,1}dT)\Omega \big(\min \lbrace K \underline{p}^{K-1}, 1\rbrace \cdot d\sqrt{T}\big). Together with our upper bound O~(min{KpˉK1,1}dT)\tilde{\mathcal O} \big(\min\lbrace K\bar{p}^{K-1},1\rbrace \cdot d\sqrt{T}\big), the bound matches up to the gap between p\underline{p} and pˉ\bar{p}.


[1] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.

[2] Liu, Xutong, et al. "Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms." Advances in Neural Information Processing Systems 35 (2022): 14904-14916.

[3] Liu, Xutong, et al. "Contextual combinatorial bandits with probabilistically triggered arms." International Conference on Machine Learning. PMLR, 2023.


We hope that these additional results help clarify the improvements in our regret bounds over prior works and address the reviewer’s concerns regarding the strength of our results. We would greatly appreciate it if these clarifications could be considered in your final assessment. Thank you again for your time and effort in serving as a reviewer.

评论

We sincerely thank all reviewers for their helpful feedback and unanimously positive assessments. We especially appreciate the thoughtful discussions during the review period, which helped clarify and strengthen our work. As we wrap up this rebuttal phase, we would like to highlight the main contributions of our paper.


Summary of Our Work

Cascading bandits have a rich theoretical history and are increasingly used in practice (many mobile social-media recommenders adopt a cascading interface); however, in the contextual setting, there has been a fundamental gap in understanding how the cascade length KK governs both algorithms and problem difficulty.

Our paper present the whole package of results: to our knowledge, the first to characterize how regret scales with KK; pin down the precise KK- and pp-dependence with nearly matching upper and lower bounds; and deliver a practical, computationally efficient algorithm that exploits this structure. We believe these results close a longstanding gap since contextual cascading bandits were first introduced and provide algorithmic design principles for practical cascade systems (as commonly seen on today’s social-media platforms).


Key Contributions

1. Large-cascade-friendly algorithm:

  • We propose UCB-CLB, the first contextual cascading bandit algorithm to show that regret can decrease with larger cascade length KK, challenging prior beliefs.
  • We identify the fundamental factor of regret dynamics. Our upper bound explicitly depends both cascade length KK and the worst-case non-click probability pˉ\bar{p}.

2. Lower Bound Matching Upper Bound Scaling in K,p,dK, p, d:

  • We provide the first lower bound for contextual cascading bandits matching our upper bound up to a gap of pˉ\bar{p} and p\underline p, indicating that the UCB-CLB achieves a nearly tight regret upper bound.

2. Efficiency in computation and cascade selection:

  • Beyond improved regret, UCB-CLB is computationally efficient.
  • UCB-CLB uses OMD for parameter estimation, requiring only O(K2d3)O(K^2 d^3) per round — independent of T and significantly lighter than MLE.
  • We also introduces an optimistic expected reward that allows selecting the cascade from only the top-KK arms, instead of exhaustively examining all (NK)\binom{N}{K} combinations.

Further Improvement upon Reviewer's Feedback

During the review and discussion, insightful questions were raised regarding the regret behavior in the large-pˉ\bar{p}, small-KK regime.

By refining our analysis with standard techniques from [1,2,3] and only minor adjustments to our proofs, we confirm that our regret bound never exceeds O~(dT)\tilde{\mathcal{O}}(d\sqrt{T}), even in this regime.

Moreover, we retain the following structure-aware regret upper bound: O~(min{KpˉK1,1}dT)\tilde{\mathcal{O}}\big(\min\lbrace K\bar{p}^{K-1}, 1\rbrace \cdot d\sqrt{T}\big), with a matching lower bound Ω(min{KpK1,1}dT)\Omega\big(\min\lbrace K\underline{p}^{K-1}, 1\rbrace \cdot d\sqrt{T}\big).

These results will be included as a formal corollary in the final version.


[1] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.

[2] Liu, Xutong, et al. "Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms." Advances in Neural Information Processing Systems 35 (2022): 14904-14916.

[3] Liu, Xutong, et al. "Contextual combinatorial bandits with probabilistically triggered arms." International Conference on Machine Learning. PMLR, 2023.

最终决定

This paper investigates the role of the cascade length KK in contextual cascading bandits. Contrary to previous works suggesting that regret increases with KK, the authors theoretically and empirically demonstrate that the regret can decrease as KK grows large. They propose a novel UCB-based algorithm, UCB-CLB, which utilizes Online Mirror Descent (OMD) for efficient parameter estimation, achieving a regret upper bound that can diminish with KK. They also provide an almost matching lower bound. Experiment results on a real-world dataset are also demonstrated.

As the reviewers have pointed out, this paper has made several major contributions. To the best of our knowledge, this is the first paper proving that a longer cascade length can improve the performance of a contextual cascading bandit. This is important and surprising. Moreover, the proposed algorithm leverages OMD for parameter estimation, which significantly improves computational efficiency compared to more standard MLE-based approaches. Overall, the paper is well-written.

All reviewers are on the acceptance side for this paper. After reading the paper, the reviews, and the discussions, I agree with the reviewers and recommend accepting this paper.