True Impact of Cascade Length in Contextual Cascading Bandits
We show that in contextual cascading bandits, regret vanishes as the cascade length grows, with nearly matching upper and lower bounds.
摘要
评审与讨论
The cascading bandit problem is widely used in recommendation systems, search engines, and online advertising. In this model, an agent selects an ordered list (cascade) of length from options (arms), and users sequentially browse the list, stopping at the first appealing item. Traditional studies suggest that regret—defined as the difference in expected rewards between the optimal and selected cascades—increases or remains constant with larger . However, this contradicts the intuition that longer cascades offer more feedback opportunities, potentially reducing regret. The paper challenges this view, proving that in contextual cascading bandits, regret decreases when is sufficiently large.
优缺点分析
Strengths
-
The paper challenges existing research on the impact of cascade length on regret in contextual cascading bandits, demonstrating that regret decreases when is sufficiently large. Specifically, the authors propose a new regret upper bound of , where is the upper bound of the feedback probability, is the dimension of the context vector, and is the total number of rounds. This bound shows that regret diminishes with increasing due to the exponential decay of ().
-
Additionally, the authors derive a matching lower bound of , where is the lower bound of the feedback probability. Together, these bounds fully characterize the true trend of regret with respect to , addressing a theoretical gap in contextual cascading bandits.
-
The paper introduces UCB-CLB (Upper Confidence Bound for Cascading Logistic Bandits), an algorithm based on Upper Confidence Bound (UCB) that integrates Online Mirror Descent (OMD) and a cascade restructuring technique (DO-SWAP). Operating under a logistic model, UCB-CLB achieves efficient parameter estimation and cascade selection. By avoiding the high computational cost of traditional maximum likelihood estimation (MLE), UCB-CLB significantly improves efficiency while maintaining theoretically optimal regret bounds. The DO-SWAP technique optimizes the exploration-exploitation trade-off by prioritizing items with higher uncertainty in the cascade.
Weaknesses
-
The theoretical results rely on specific assumptions, such as Assumption 5.1 (bounded norms of feature vectors and parameters) and Assumption 5.3 (a positive lower bound in the logistic model). These assumptions may not always hold in practice. For instance, a small (indicating significant deviation from a linear model) could impact algorithm performance, despite claims that UCB-CLB remains effective in such cases. However, the paper does not thoroughly explore how violations of these assumptions affect performance, potentially limiting the generalizability of its theoretical findings.
-
The analysis depends on , where the regret upper bound decreases monotonically with increasing . When , regret may initially increase before decreasing (as peaks at ). The paper does not adequately address the implications of in practical high-feedback-probability scenarios, which may restrict the applicability of its theoretical results.
-
Experiments only test cascade lengths and , without exploring a broader range of values.
-
The UCB-CLB algorithm relies on the DO-SWAP technique to restructure cascades, prioritizing items with higher uncertainty. While effective, this increases algorithmic complexity, potentially introducing additional computational overhead or implementation challenges in practice. The paper does not provide a detailed analysis of DO-SWAP’s computational costs or potential failure cases across different scenarios.
问题
-
The UCB-CLB algorithm leverages Online Mirror Descent (OMD) and DO-SWAP techniques to enhance computational efficiency and optimize regret. Can this approach effectively extend to dynamic environments, such as those with time-varying contextual features?
-
The paper introduces the combination of OMD and DO-SWAP as a novel proof technique to address the sequential feedback characteristics of cascading bandits. What new challenges might this integration introduce?
局限性
N/A
最终评判理由
The response from the authors resolves my concerns. I will maintain my positive rating.
格式问题
N/A
We sincerely thank the reviewer for the thoughtful and detailed comments on our work. Your feedback has greatly helped us refine and strengthen the paper.
1. Discussion of Regularity Conditions
Firstly, Assumption 5.1 is without loss of generality: any raw feature vector can be rescaled to lie inside the unit ball before learning, and relaxing the parameter norm bound to merely multiplies the theoretical regret by factor.
Secondly, we strongly believe that the existence of a non-zero should NOT be viewed as a weakness of our paper, since is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.
Lastly, we conduct additional experiments on the MovieLens- benchmark: reducing from to leads to no noticeable change in cumulative or average regret (see the below table). denotes the cumulative regret up to round , whereas is the average regret at round . We report to check whether cumulative regret is sublinear.
2. Discussion of
We first clarify meaning of large . Recall that \bar p is defined as an upper bound on the non-click probability: Consequently, a large corresponds to a high–negative-feedback scenario.
Even though our regret bound may increase in the small- regime with particularly large , we emphasize that this is the first regret bound to explicitly capture and improve with increasing , with the explicit dependence on both and . We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between and ), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small- regime with particularly large , the regret may initially increase with but eventually decrease as becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. It is rather the nature of the problem that our work presents for the first time as the true scaling.
Furthermore, we would like to highlight that in the larger bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) , and (2) , which additionally depends on the number of arms . When the number of arms is relatively small (e.g., ), the bound is tighter. Conversely, when is large, becomes the better guarantee.
Hence, it is not just a matter of which bound is tighter but rather what this work focuses is what is the true scale of the regret in cascading contextual bandit problem supported by both upper and lower bounds.
3. Additional Experiments with larger
Following your suggestion, we run the MovieLens- experiment with and . As the Table 6 shows, cumulative regret remains sub-linear (indicated by the steadily falling the average regret ). Crucially, the overall scale of the regret diminishes as increases, exactly reflecting our theoretical claim that regret decreases once the cascade length is long.
4. Time–Complexity of DO-SWAP
Contrary to your concern, the DO-SWAP subroutine imposes neither significant computational overhead nor implementation difficulty. With a single partial sort that runs in time, our agent selects the two most uncertain arms within the length- cascade . Our MovieLens-100K experiment with K=5 (See the table below) supports this claim, showing that DO-SWAP consistently consumes only of total cumulative runtime even as the round grows.
[Table Instruction]
- "Total": is the mean cumulative runtime of the full algorithm,
- "DO-SWAP": is the mean cumulative time spent in the DO-SWAP sub-routine,
- "Ratio": is the share of DO-SWAP within the total time. | | Total | DO-SWAP | Ratio | |-------|-------------|-------------|--------| | | | | % | | | | | % | | | | | % | | | | | % | | | | | % |
5. Applicability to Time-Varying Contextual Features
We would like to clarify that our current framework already supports time-varying contextual features. Section 3.2 (Problem Setting) explicitly models the environment using time-varying contextual features, confirming that our theoretical analysis already accommodates such dynamics.
Accordingly, the OMD + DO-SWAP algorithm is naturally applicable to dynamic environments where contextual features change over time. In our MovieLens-100K experiment, the agent observes time-varying features at every round : a new user arrives, and the agent chooses among movies . As reported in Section 7 (Numerical Study), UCB-CLB with DO-SWAP achieves sub-linear regret under this setting, showing in practice that the algorithm still performs well even as the contexts change over time.
6. Integrating OMD and DO-SWAP Introduces No Additional Challenges
OMD is used for parameter estimation, while DO-SWAP handles the cascading selection process. These two components operate independently of each other. For example, one could substitute OMD with an alternative estimator such as MLE, and DO-SWAP would remain fully functional without modification.
Thank you for your response, which resolves my concerns. I will maintain my positive rating.
Thank you for your kind response and for maintaining your positive rating. We truly appreciate your time and consideration.
This paper investigates the role of the cascade length in contextual cascading bandits—a learning framework relevant to search and recommendation systems. Contrary to previous works suggesting that regret increases with , the authors theoretically and empirically demonstrate that regret can decrease as grows large. They propose a novel UCB-based algorithm, UCB-CLB, which uses Online Mirror Descent (OMD) for efficient parameter estimation, achieving a regret upper bound of that diminishes with increasing . They also provide the first matching lower bound in the contextual setting, resolving a long-standing open problem about the impact of cascade length.
优缺点分析
Strengths
- The paper challenges and overturns a widely held belief in cascading bandits literature—that regret necessarily increases with cascade length. This is an important and surprising contribution.
- The UCB-CLB algorithm leverages OMD for parameter estimation, significantly improving computational efficiency compared to MLE-based approaches.
- The paper provides both upper and lower bounds that match up to logarithmic factors, a rare accomplishment in bandit theory.
- UCB-CLB is shown to be scalable and effective on real-world datasets like MovieLens, and avoids non-convex optimization bottlenecks.
- The paper is generally well-organized, with a thorough theoretical exposition, detailed algorithms, and compelling empirical results.
Weaknesses
- Some results rely on regularity conditions such as bounded features and logistic noise parameters, which may not hold in all real-world deployments.
- While the MovieLens experiments are convincing, broader validation across datasets or task domains would strengthen the claims.
- The lower bound uses a slightly different constant than the upper bound's , though the authors do acknowledge and justify this.
问题
- Clarification on vs : Can the authors comment on how large the gap between and might be in practice, and whether it's observable in experiments?
- OMD vs MLE Trade-offs: Could the authors elaborate on potential limitations of OMD versus MLE, especially in terms of statistical efficiency in low-noise regimes?
- Impact of Arm Pool Size : Even though the bounds are independent of , does increasing affect empirical performance (e.g., convergence speed)?
局限性
yes
格式问题
None observed. The paper follows NeurIPS formatting guidelines.
Thank you for taking the time to review our work. We sincerely appreciate your constructive feedback and provide our detailed responses below.
1. Discussion of Regularity Conditions
Could you kindly clarify what is meant by the "logistic noise parameters"? We assume this refers either to or to the curvature parameter . Accordingly, we will address the boundedness of the three relevant quantities: the feature vector , the true parameter , and the curvature parameter . Below, we explicitly show that these boundedness assumptions are made for convenience and can be easily relaxed or reformulated as definitions. Therefore, we believe they should not be viewed as weaknesses of our work.
1) Boundedness assumption on and (Assumption 5.1): We would like to clarify that this assumption is made purely for convenience and cleaner presentation. If and are bounded by a general constant (rather than 1), the regret bound simply scales with . For example, if and , the regret becomes , since the confidence parameter scales as .
2) Discussion about (Assumption 5.3): We would like to highlight that is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.
[1] Faury, Louis, et al. "Improved optimistic algorithms for logistic bandits." International Conference on Machine Learning. PMLR, 2020.
2, Additional Experiments: KuaiRand Pure Click Datasets [2] As the current NeurIPS rebuttal guidelines prohibit the inclusion of figures, we instead present the results in the table below. denotes the cumulative regret up to round , whereas is the average regret at round . We report to check whether cumulative regret is sublinear. Following the reviewer’s suggestion, we ran additional experiments with cascade lengths on the KuaiRand-Pure dataset.
KuaiRand-Pure is a real-world recommendation benchmark derived from the short-video platform Kuaishou; it contains anonymised user–item interaction logs with click indicators, which we adopt as the binary reward signal for our bandit task. Consistent with our theoretical findings, the regret decreases as increases, confirming that our method scales favourably with longer cascades and remains robust across datasets.
(i) UCB-CLB with K=5
(ii) UCB-CLB with K=10
[2] Gao, Chongming, et al. "Kuairand: An unbiased sequential recommendation dataset with randomly exposed videos." Proceedings of the 31st ACM international conference on information & knowledge management. 2022.
3, Clarification on vs
While our upper and lower bounds differ due to the gap between and , these quantities are often overly conservative in practice. As a result, the realized regret typically lies between the two—significantly lower than the upper bound and higher than the lower bound—indicating that our algorithm is robust to the worst-case values of and . To illustrate this, we provide a table below comparing the theoretical upper bound, theoretical lower bound, and the actual cumulative regret observed in an experiment with and . In the table:
- "Upper Bound" refers to the theoretical regret upper bound,
- "Lower Bound" refers to the theoretical regret lower bound, and
- "Cum. Regret" refers to the cumulative regret actually incurred by our algorithm on the MovieLens 100K dataset.
| Upper Bound | ||||
| Cum. Regret | ||||
| Lower Bound |
Currently, we do not yet know how to close the gap between and to achieve minimax optimality. One promising direction is to pursue instance-dependent regret bounds, as in Abeille et al. (2021). For example, by defining , where is an optimal arm, one could derive -dependent regret bounds instead of relying on worst-case quantities like and . While this is a meaningful direction, it falls outside the scope of the present work, and we leave it for future research. We thank the reviewer for this valuable and constructive suggestion. Moreover, note that our algorithm does not need to observe or know or at any point.
4. OMD vs. MLE Trade-offs in Low-noise Regimes
It remains unclear which method—OMD or MLE—is preferable. In a low-noise regime, click feedback is almost deterministic. Under the logistic model this means the parameter norm must be large (i.e., ), reflecting a strong signal or a wide parameter space.
Recall that when , our regret bound (that uses OMD) scales as , which deteriorates as increases. In contrast, for MLE-based approaches, -free regret bounds have been established in the GLM bandit literature (e.g., Lee et al., 2024 [3]). This suggests that a similar -free regret might be achievable in our setting by adopting MLE and applying similar techniques. Alternatively, an adaptive warm-up strategy, as proposed by Lee and Oh [4], could also eliminate the dependence on .
Because both MLE and carefully tuned OMD variants can avoid explicit dependence, deciding which approach is superior—even in low-noise scenarios—remains unresolved. Further work is required to pin down their precise statistical and computational trade-offs.
[3] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.
[4] Lee, Joongkyu, and Min-hwan Oh. "Improved Online Confidence Bounds for Multinomial Logistic Bandits." Forty-second International Conference on Machine Learning.
5. Impact of Arm Pool Size
Our regret bound—and the algorithm in practice—is -independent; to substantiate this, we conduct additional experiments. Using the MovieLens-100K benchmark with cascade length , we vary the arm pool from movies while keeping the same optimal arm across all settings to ensure a fair comparison. The resulting cumulative-regret, is reported in the table below.
Across all horizons the largest gap between any two is (at ), which is approximately of the regret scale in those rows. Thus, even when increases from to , the regret remains almost unchanged, supporting our theoretical results that regret does not depend on .
| 0.0233 | 0.0245 | 0.0307 | |
| 0.0356 | 0.0342 | 0.0409 | |
| 0.0432 | 0.0421 | 0.0492 | |
| 0.0506 | 0.0510 | 0.0581 | |
| 0.0593 | 0.0612 | 0.0683 |
Dear Reviewer 3rPH,
Thanks a lot for reviewing this paper!
It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?
Thanks!
Thank you for your response, which has clarified my concerns. I will maintain my original positive evaluation.
This paper addresses a long-standing and misunderstood problem in the field of recommendation systems: how the length of a recommended list (the cascade length.
优缺点分析
Strengths: They propose a new Upper Confidence Bound (UCB) algorithm for cascading logistic bandits. It uses Online Mirror Descent (OMD) for parameter estimation, which makes it computationally efficient and scalable, with per-round costs that are independent of the total time horizon. Specifically, they novelly achieve the decreasing for large K regret bound compared to others.
Weaknesses:
- They assume a lower bound for the vairances. This assumptions are not so standard. Could the authors discuss about this assumption, why do they need it and could their analysis degrade to deterministic settings where variances are zero?
问题
-
In the experiment, the K values are only 5, 10. Could the authors try a larger K, such as 50, 100? Also, it would be interesting and convincing if they could plot the curve between the regret and a range of K to validate their theories.
-
For \bar p in the bound, \bar p is small when all the arms in the space has a relatively high logit. It seems a little strict since there may exist x such that x^T\theta is very small. In this case, the bound only decreases when K is sufficienlty large. Could the authors disscuss more about the limitations of their result?
局限性
no
最终评判理由
The authors solve my questions, so I will raise my score accordingly.
格式问题
no
We appreciate your thoughtful feedback on our work. Below, we provide detailed responses to each comment and question.
1. Variance Lower-Bound Assumption
Before addressing the reviewer’s concern, we would first like to clarify the intended meaning of “variance” in the review. We assume the reviewer is referring to the parameter . In that case, we strongly believe that the existence of a non-zero should NOT be viewed as a weakness of our paper, since is always non-zero as long as the features and parameter vectors are bounded. In the logistic bandit literature, it is standard practice to define explicitly—rather than assume it—based on problem parameters (e.g., Equation (2) of Faury et al., 2020 [1]). Therefore, we believe it is more appropriate to formally define rather than assume its non-zeroness. For clarity, we will revise Assumption 5.3 to be a Definition in our final version.
To further elaborate on : this parameter captures the degree of non-linearity of the sigmoid function. Specifically, becomes smaller when the agent selects actions across a wider range of the sigmoid. As such, quantifies the difficulty of learning induced by the non-linearity of the logistic model. This dependence is not unique to our work—it has been unavoidable in much of the existing literature (Faury et al., 2020, 2021, 2022; Lee and Oh, 2024, 2025; Vial et al., 2022; Liu et al., 2024), whether it appears in the leading term or non-leading constants of the regret bounds.
In response to the reviewer’s question regarding the deterministic choice setting: such a setting corresponds to a degenerate case where each item is either (a) always clicked or (b) never clicked. Neither case is interesting from a learning or decision-making perspective. In case (a), the learner could simply select any items at random and incur zero regret, since every selected item leads to a click. In case (b), the learner also incurs zero regret, regardless of the selection, as the optimal reward is itself zero. Thus, the deterministic setting is uninformative and trivial in terms of both exploration and regret analysis.
[1] Faury, Louis, et al. "Improved optimistic algorithms for logistic bandits." International Conference on Machine Learning. PMLR, 2020.
2. Additional Experiments: larger
In accordance with this year’s NeurIPS rebuttal guidelines, we are not permitted to include plots; thus, we present the results in tabular form below. denotes the cumulative regret up to round , whereas is the average regret at round . We report to check whether cumulative regret is sublinear.
As per reviewer's request, we have conducted additional MovieLens benchmark experiments with larger to evaluate the robustness of our method. Consistent with our theoretical findings, the regret decreases as increases, demonstrating that our method scales favorably with the cascade length. The scale difference is enormous: for every round in the table, at is roughly times smaller than at .
3. Large Case
Even though our regret bound may increase in the small- regime with particularly large , we emphasize that this is the first regret bound to explicitly capture and improve with increasing , with the explicit dependence on both and . We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between and ), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small- regime with particularly large , the regret may initially increase with but eventually decrease as becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. it is rather the nature of the problem that our work presents for the first time as the true scaling.
Furthermore, we would like to highlight that in the bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) , and (2) , which additionally depends on the number of arms . When the number of arms is relatively small (e.g., ), the bound is tighter. Conversely, when is large, becomes the better guarantee.
In the same spirit, our newly established bounds are optimal in the small- regime, while for large , the existing bound of remains optimal. We will incorporate this clarification into the final revision.
Additionally, we conduct an experiment to demonstrate the robustness of our algorithm to the value of in practice. Importantly, the actual performance depends on the no-click probabilities for items in the selected set (i.e., ), rather than on the worst-case value used in the regret analysis. As a result, the realized regret does NOT typically scale with —the worst-case analysis can be overly conservative.
The following table illustrates that **even when \bar{p** is large, the observed regret decreases monotonically as increases}, highlighting the practical stability of our method.
Dear Reviewer Rqrd,
Thanks a lot for reviewing this paper!
It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?
Thanks!
This paper considers contextual cascading bandits where an agent select ordered lists of arms from total arms, receiving sequential feedback until a criterion is met.
The authors propose UCB with online mirror descent for cascading logistic bandits, which requires only per-round computation, independent of .
This work is the first to prove that longer cascades can improve performance
优缺点分析
Strengths:
- With the help of OMD subroutine, the proposed algorithm is computationally efficient.
- The first to prove that longer cascades can improve performance
- Provides first lower bound derivation for contextual cascading bandits
Weaknesses:
-
There is still s gap between the upper and lower bound. It would be better if the authors discuss more on how to close the gap in the future.
-
The regret bound increases when is relatively small, especially when is large.
问题
Q1. In the description of Algorithm 2 on page 5, there is no explanation on how to choose and .
Q2. When is not large enough, how can you argue that your regret bound is the tightest compared to previous results (on Page 2)?
局限性
Yes
最终评判理由
The authors addressed most of my concerns. I will keep my score.
格式问题
NA
Thank you for your time to review our paper and for your valuable feedback. Here are our responses to each comment and question:
1. Gap between the Upper-Lower Bound Gap
As the reviewer pointed out, closing the gap between the current upper and lower bounds—i.e., matching and —is indeed an important direction for future research. At present, we do not yet know how to achieve minimax optimality in this regime. One potential approach is to pursue instance-dependent regret bounds, as done by Abeille et al. (2021). For example, by defining , where is an optimal arm, one could try to derive -dependent regret bounds instead of relying on worst-case probabilities ( and ). While promising, this line of work is beyond the current scope and we leave it for future work. We thank the reviewer for this valuable and constructive suggestion.
2. Discussion of Regret Bound for Small and Large
Even though our regret bound may increase in the small- regime with particularly large , we emphasize that this is the first regret bound to explicitly capture and improve with increasing , with the explicit dependence on both and . We believe this constitutes a significant and meaningful contribution to the community. This theoretical result is NOT just an artifact of the loose analysis. That is because we establish the same scale of lower bound (modulo the difference between and ), which further supports our theoretical findings and implications. That is, it may be possible in the worst-case scenario for the small- regime with particularly large , the regret may initially increase with but eventually decrease as becomes sufficiently large although such scenarios may not appear in practice. We sincerely believe that this finding should NOT be considered as weakness. It is rather the nature of the problem that our work presents for the first time as the true scaling.
Furthermore, we would like to highlight that in the bandit literature, it is common for different regret bounds to coexist, with each being tighter under different conditions. For example, in linear bandits, two widely recognized optimal regret bounds are: (1) , and (2) , which additionally depends on the number of arms . When the number of arms is relatively small (e.g., ), the bound is tighter. Conversely, when is large, becomes the better guarantee.
In the same spirit, our newly established bounds become much tighter than existing bounds for cascading contextual bandits, as increases, and optimal in the large- regime. We will incorporate this clarification into the final revision.
Additionally, we conduct an experiment to demonstrate the robustness of our algorithm to the value of in practice. Importantly, the actual performance depends on the no-click probabilities for items in the selected set (i.e., ), rather than on the worst-case value used in the regret analysis. As a result, the realized regret does NOT typically scale with —the worst-case analysis can be overly conservative.
The following table illustrates that even when is large, the observed cumulative regret decreases monotonically as increases, highlighting the practical stability of our method.
3. Selection Rule for and
Thank you for pointing out this omission. The exact rule for choosing the two swap arms is stated on page 7, line 226:
$
i_t^{\scriptscriptstyle (1)} = \arg\max_{i\in C_t} \dot\sigma_1 \bigl(x_{t,i}^{\top}\theta_t\bigr) \lVert x_{t,i}\rVert_{H_t^{-1}}, \quad i_t^{\scriptscriptstyle (2)} = \arg\max_{i\in C_t} \lVert x_{t,i}\rVert_{H_t^{-1}}.
$
To make this clear in the final version, we will (i) assign an explicit equation number to the formula above, and (ii) add a reference to the equation directly inside Algorithm 2 (page 5). We appreciate your careful reading and for helping us improve the clarity.
Dear Reviewer C937,
Thanks a lot for reviewing this paper!
It seems that you have not responded to the authors' rebuttal. Given there are fewer than three days remaining for the author-reviewer discussion period, could you respond to it soon?
Thanks!
Thanks the authors for their detailed response. I will keep my score. Please make sure to include these discussions in the final version.
Motivated by some of the reviewer’s insightful comments and questions regarding the regret behavior in the large- and small- regime, we are pleased to share that our analysis can be naturally extended to yield even stronger results with only minor modifications.
Specifically, we can show that the regret does not increase with at all even in the regime of the large- and small- setting, i.e., when . By leveraging standard techniques from prior works [2, 3], we establish that our algorithm (without any modification) also satisfies the instance-independent bound . As a result, we obtain the following refined instance-dependent upper bound , along with a matching lower bound of up to and logarithmic factors.
Please note that these additional results are obtained through a minor refinement of our existing proof techniques. We provide a brief sketch of the argument below and will include a formal corollary and full proof in the final version of the paper.
1. Supplementary Regret Upper Bound of UCB-CLB
Step 1: Introduce a worst-curvature parameter .
We adopt the intermediary parameter inspired by Lee et al. [1] as follows:
This ensures that the gram matrix satisfied the following lower bound:
$ H\_t \succeq \sum\_{\tau=1}^{t-1} \sum\_{i \in O\_\tau} \dot{\sigma}\_1(x\_{\tau, i}^\top \tilde{\theta}\_{\tau, i}) x\_{\tau,i}x\_{\tau,i}^\top + \lambda I\_d =: L\_t. $ \tag{1}Step 2: Control the deviation of the derivative.
Combine Lemma E.3 of Lee et al. [1] with our concentration (Corollary 5.5), we get:
$ \dot{\sigma}\_1(x\_{t,i}^\top \theta^\star) \le \dot{\sigma}\_1(x\_{t,i}^\top \tilde{\theta}\_{t,i}) + \frac{\beta\_{T,K}(\delta)}{4} \Vert x\_{t,i} \Vert\_{H\_t^{-1}}. $ \tag{2}Step 3: Upper bound on the prediction error via the curvature-weighted norm.
Starting from the 2-nd Order Taylor Expansion, and applying Equation (1) and Equation (2), we obtain the following:
$ \sigma\_1(u\_{t,i}) - \sigma\_1(x\_{t,i}^\top \theta^\star) \le 2\beta\_T(\delta)\dot{\sigma}\_1(x\_{t,i}^\top \tilde{\theta}\_{t,i}) \Vert x\_{t,i} \Vert\_{L\_t^{-1}} + \frac{7 \beta\_{T,K}^2(\delta)}{10} \Vert x\_{t,i} \Vert\_{H\_t^{-1}}^2. $ \tag{3}To simplify notation, we define , which corresponds to the first term on the right-hand side of the inequality.
Step 4: Applying Equation (3) in the Regret Decomposition.
Following the regret decompostion of our Theorem 5.3 and substituting Equation (3) for the bound on leads immediately to the following inequality for the regret :
$ \mathcal{R}(T) \le \underbrace{ \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in C\_t} \prod\_{l=1}^{k-1} \sigma\_0(x\_{t,i\_{tl}}^\top \theta^\star) \zeta\_{t, i\_{tk}} \prod\_{m=k+1}^{K} \sigma\_0(x\_{t,i\_{tm}}^\top \theta^\star) \right] }\_{\text{term 1}} + \text{2$^{\text{nd}}$ Order Term} $Step 5: -free Control of Term 1 Directly leveraging the proof technique of Lemma 19 in Liu et al. [2], we easily get the bound of term 1 as follows:
$ \text{term 1} \le \sqrt{ T \cdot \mathbb{E} \left[ \sum\_{t=1}^{T} \sum\_{i\_{tk} \in C\_t} \frac {\prod\_{l=1}^{k-1} \sigma\_0(x\_{t,i\_{tl}}^\top \theta^\star) \zeta\_{t, i\_{tk}}^2} {\sigma\_0(x\_{t,i\_{tk}}^\top \theta^\star)\sigma\_1(x\_{t,i\_{tk}}^\top \theta^\star)} \right] } $ .Then, substituting the definition of and using the fact that holds with high probability due to the concentration result in Corollary 5.5 yield the followings:
Step 6: Applying the triggering-probability equivalence technique
Next, the direct application of the triggering-probability equivalence (TPE) technique introduced by Liu et al. [3], which allows us to replace the summation over the cascade weighted by the product of non-click probabilities with an equivalent expectation over the set of observed arms , yields the followings:
where the last eqaulity arises from applying the well-known elliptical potential lemma.
2. Lower Bound of Contextual Cascading Bandits
By the formulation of our instance in Appendix D.1, , hence It follows that the worst-case regret admits . Together with our upper bound , the bound matches up to the gap between and .
[1] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.
[2] Liu, Xutong, et al. "Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms." Advances in Neural Information Processing Systems 35 (2022): 14904-14916.
[3] Liu, Xutong, et al. "Contextual combinatorial bandits with probabilistically triggered arms." International Conference on Machine Learning. PMLR, 2023.
We hope that these additional results help clarify the improvements in our regret bounds over prior works and address the reviewer’s concerns regarding the strength of our results. We would greatly appreciate it if these clarifications could be considered in your final assessment. Thank you again for your time and effort in serving as a reviewer.
We sincerely thank all reviewers for their helpful feedback and unanimously positive assessments. We especially appreciate the thoughtful discussions during the review period, which helped clarify and strengthen our work. As we wrap up this rebuttal phase, we would like to highlight the main contributions of our paper.
Summary of Our Work
Cascading bandits have a rich theoretical history and are increasingly used in practice (many mobile social-media recommenders adopt a cascading interface); however, in the contextual setting, there has been a fundamental gap in understanding how the cascade length governs both algorithms and problem difficulty.
Our paper present the whole package of results: to our knowledge, the first to characterize how regret scales with ; pin down the precise - and -dependence with nearly matching upper and lower bounds; and deliver a practical, computationally efficient algorithm that exploits this structure. We believe these results close a longstanding gap since contextual cascading bandits were first introduced and provide algorithmic design principles for practical cascade systems (as commonly seen on today’s social-media platforms).
Key Contributions
1. Large-cascade-friendly algorithm:
- We propose UCB-CLB, the first contextual cascading bandit algorithm to show that regret can decrease with larger cascade length , challenging prior beliefs.
- We identify the fundamental factor of regret dynamics. Our upper bound explicitly depends both cascade length and the worst-case non-click probability .
2. Lower Bound Matching Upper Bound Scaling in :
- We provide the first lower bound for contextual cascading bandits matching our upper bound up to a gap of and , indicating that the UCB-CLB achieves a nearly tight regret upper bound.
2. Efficiency in computation and cascade selection:
- Beyond improved regret, UCB-CLB is computationally efficient.
- UCB-CLB uses OMD for parameter estimation, requiring only per round — independent of T and significantly lighter than MLE.
- We also introduces an optimistic expected reward that allows selecting the cascade from only the top- arms, instead of exhaustively examining all combinations.
Further Improvement upon Reviewer's Feedback
During the review and discussion, insightful questions were raised regarding the regret behavior in the large-, small- regime.
By refining our analysis with standard techniques from [1,2,3] and only minor adjustments to our proofs, we confirm that our regret bound never exceeds , even in this regime.
Moreover, we retain the following structure-aware regret upper bound: , with a matching lower bound .
These results will be included as a formal corollary in the final version.
[1] Lee, Junghyun, Se-Young Yun, and Kwang-Sung Jun. "A Unified Confidence Sequence for Generalized Linear Models, with Applications to Bandits." Advances in Neural Information Processing Systems 37 (2024): 124640-124685.
[2] Liu, Xutong, et al. "Batch-size independent regret bounds for combinatorial semi-bandits with probabilistically triggered arms or independent arms." Advances in Neural Information Processing Systems 35 (2022): 14904-14916.
[3] Liu, Xutong, et al. "Contextual combinatorial bandits with probabilistically triggered arms." International Conference on Machine Learning. PMLR, 2023.
This paper investigates the role of the cascade length in contextual cascading bandits. Contrary to previous works suggesting that regret increases with , the authors theoretically and empirically demonstrate that the regret can decrease as grows large. They propose a novel UCB-based algorithm, UCB-CLB, which utilizes Online Mirror Descent (OMD) for efficient parameter estimation, achieving a regret upper bound that can diminish with . They also provide an almost matching lower bound. Experiment results on a real-world dataset are also demonstrated.
As the reviewers have pointed out, this paper has made several major contributions. To the best of our knowledge, this is the first paper proving that a longer cascade length can improve the performance of a contextual cascading bandit. This is important and surprising. Moreover, the proposed algorithm leverages OMD for parameter estimation, which significantly improves computational efficiency compared to more standard MLE-based approaches. Overall, the paper is well-written.
All reviewers are on the acceptance side for this paper. After reading the paper, the reviews, and the discussions, I agree with the reviewers and recommend accepting this paper.