Combinatorial Reinforcement Learning with Preference Feedback
We study combinatorial reinforcement learning with multinomial logistic (MNL) preference feedback and propose a computationally efficient algorithm that establishes the first regret bound in this framework.
摘要
评审与讨论
This paper studies combinatorial reinforcement learning (Combinatorial RL) under Multinomial Logit (MNL) preference feedback, where the agent selects a subset of items (an "assortment") at each step, and user choices follow an MNL model. Unlike traditional MNL bandits, which optimize single-step rewards, this work extends the problem to reinforcement learning (RL) settings, where decisions impact long-term rewards through state transitions.
给作者的问题
-
The action space in combinatorial reinforcement learning grows exponentially, even with Online Sensitivity Sub-sampling. Can the proposed method efficiently compute the optimal policy? While the paper claims to avoid exponential computation, does it truly achieve polynomial-time optimization?
-
Your method is designed for recommendation systems, yet all experiments use synthetic data. Why did you not test MNL-VQL on real-world datasets like MovieLens or Amazon Reviews?
-
The paper claims Online Sensitivity Sub-sampling improves efficiency. What is the formal computational complexity of MNL-VQL?
-
How well does MNL-VQL scale to large action spaces?
论据与证据
Most of the paper’s key claims are well-supported by theoretical analysis and synthetic experiments. Some claims, particularly those regarding computational efficiency and variance-weighting, require additional empirical validation.
方法与评估标准
The proposed methods are well-motivated and aligned with the problem setting. While Online Sensitivity Sub-sampling is proposed to reduce computation, there is no formal complexity analysis comparing its efficiency with baselines like LSVI-UCB.
理论论述
The theoretical claims and proofs are mostly correct, with well-supported regret bounds and sound reasoning behind optimism-pessimism switching.
Some extensions (e.g., generalized Eluder dimension) require further justification or comparison, e.g., compare regret bounds using standard vs. generalized Eluder dimension since it is unclear if this extension significantly improves regret bounds compared to existing Eluder dimension results.
实验设计与分析
-
No empirical validation for the optimistic-pessimistic strategy; perhaps add an ablation study comparing optimistic-only, pessimistic-only, and alternating strategies.
-
The experiments only use synthetic data; should also run experiments on real-world datasets to test robustness under realistic conditions.
-
The action space in the experiments seems relatively small compared to real-world combinatorial RL settings.
-
The paper introduces multiple novel techniques (e.g., variance-weighted Q-learning, Online Sensitivity Sub-sampling, alternating optimism-pessimism) but does not test their individual contributions. Does each component improve performance, or is the gain from just one or two?
-
The results lack confidence intervals or statistical significance tests.
补充材料
The supplementary material consists of codes, but not thoroughly checked.
与现有文献的关系
The work extends MNL bandits (Multinomial Logit Bandits), a well-studied framework for recommendation systems and assortment optimization. Instead of focusing on single-step rewards (as in MNL bandits), the paper extends the model to long-term reinforcement learning settings.
The paper builds on literature in function approximation for RL, particularly when Q-values are learned in complex, structured action spaces.
遗漏的重要参考文献
The paper generalizes Eluder dimension but does not compare its new regret bounds with prior function approximation methods in RL.
The paper proposes Online Sensitivity Sub-sampling but does not compare it with deep RL’s structured exploration techniques.
其他优缺点
Other Strengths:
- The use of preference feedback in long-term decision-making is a meaningful contribution, as most prior MNL studies focus on single-step decision-making (bandits) rather than RL settings.
- The paper establishes tight upper and lower regret bounds.
Other Weaknesses:
- The experiments are conducted only in a synthetic online shopping environment, without real-world datasets.
- The results largely build on existing work in combinatorial RL/bandits under choice models, and RL with function approximation. How does this approach go beyond simply combining prior methods?
- The paper lacks intuitive explanations for certain formulas, such as: a. Why is the generalized Eluder dimension more suitable for this problem compared to the standard Eluder dimension? b. How do the key mathematical principles of variance-weighted Q-learning influence its convergence?
其他意见或建议
-
The proof outlines failure cases where a purely optimistic approach can lead to biased Q-value estimation due to state transitions. The theoretical argument is strong, but an empirical ablation study would reinforce this claim.
-
The model assumes fixed user preferences, which is unrealistic in many real-world settings. Many RL-based recommenders address evolving user interests, but this paper does not discuss or experiment with preference drift.
-
The condition for switching between optimistic and pessimistic utilities is not clearly motivated. Add intuition behind this condition.
-
Algorithm 1 Pseudocode: Step descriptions are too compact to follow easily.
Thank you very much for your positive review! Below, we provide our responses to your comments and concerns.
Experiment
-
Additional real-world data experiments: In response to the reviewer’s suggestion, we conducted experiments on the real-world MovieLens dataset (Harper & Konstan, 2015). Due to space limitations, we refer the reader to our response to reviewer ofJq for details on the experimental setup.
- A link to the results is provided here: [Link]
-
Request for additional ablations: While we appreciate the reviewer’s suggestion regarding ablation studies, we believe they may NOT be essential for the following reasons:
-
Optimistic- (or pessimistic-) only variants: Given the theoretical nature of our work, we believe it is sufficient to empirically evaluate algorithms that are either provably guaranteed or previously proposed, such as LSVI-UCB and Myopic (OFU-MNL+). However, if the reviewer strongly believes that ablation studies on "optimistic-only" or "pessimistic-only" variants are necessary to support our theoretical findings, we are open to conducting these experiments and will include the results in our final response.
-
Variance-weighted Q-learning / Online sensitivity sub-sampling: We believe there may be a misunderstanding—these techniques are NOT original contributions of our work. As stated in Section 4 and Appendix B, we adopted them from prior literature (e.g., Agarwal et al., 2023; Kong et al., 2021), adapting them to our problem setting. Therefore, we see no strong motivation to perform ablation studies specifically on these components.
-
-
Confidence interval: In Figure 1, the shaded area represents standard deviation.
Computational efficiency
Our algorithm enjoys polynomial computational complexity with respect to the number of items , making it highly efficient both computationally and statistically.
The computational complexity per episode is . The first term arises from the assortment selection (see Remark 4.2), and the second term comes from the online sensitivity sub-sampling procedure (see Proposition B.2). For further details regarding the computational cost of the sub-sampling, please refer to Theorem 1 in Kong et al. (2021), as space is limited here.
Moreover, we'd like to clarify that the computational cost of the online sensitivity sub-sampling does NOT scale with the action space size nor with the number of items . (The generalized eluder dimension may depend on , though not on .) Therefore, even if the action space grows exponentially, the method remains highly effective.
Additionally, we reported the runtime per round in Table G.1 (and provided results for larger item sets at [Link]), which shows that the runtime of our algorithm indeed scales polynomially with , and is significantly more efficient than LSVI-UCB, whose computational cost is .
Techinical novelties
Our key technical contributions are as follows:
- A novel assortment selection strategy that alternates between optimistic and pessimistic utility estimates.
- The first to incorporate both unknown item values (rewards) and nonzero outside-option item values in the MNL choice model.
- A improvement over naïve summation of bandit regrets.
- A fine-grained regret analysis for general function approximation.
For further details, please refer to Section 5.2.
Generalized Eluder dimension
The generalized Eluder dimension is applicable to weighted regression, while the standard Eluder dimension applies only to non-weighted regression. Thus, the generalized Eluder dimension is the appropriate choice for our setting. If we were to use unweighted regression with the standard Eluder dimension, it is straightforward to see that this would yield a looser regret bound for general function approximation—similar to what arises in -LSVI (Wang et al., 2020). For a detailed comparison between the two Eluder dimensions, we refer the reader to Theorem 4.6 in Zhao et al. (2023), as mentioned in Line 184.
Others
-
Explanation of variance-weighted Q-learning: This method adaptively adjusts update weights using variance estimates, giving more importance to low-variance (hard) regions. This also allows for tighter confidence intervals and stronger theoretical guarantees (see Agarwal et al., 2023).
-
Fixed user preference: We do NOT assume fixed user preferences; instead, the state can capture dynamic factors like satisfaction and interaction history, allowing preferences to evolve over time.
-
Intuition behind switching utilities: We design an optimistic strategy where the estimated value is likely greater than the true value.
The authors claim that they propose "a novel assortment selection strategy that alternates between optimistic and pessimistic utility estimates." However, the explanation provided in the "Intuition behind switching utilities" section feels odd. At the very least, it fails to convince me of why switching between the two is necessary, as opposed to consistently adhering to either an optimistic or a pessimistic strategy.
Thank you for your prompt response to our rebuttal. In our initial reply, we focused primarily on addressing your main concerns—such as the additional experiments and computational efficiency—which we hope were satisfactorily resolved, as no further questions were raised on those points. Unfortunately, due to space limitations, we were unable to provide a detailed explanation of the "intuition behind the switching utilities." We're happy to address that now as an additional comment!
We believe the reviewer may be unclear about which specific quantity the adjectives "optimistic" or "pessimistic" refer to. In our paper, we consider three types of quantities:
(1) MNL utilities,
(2) item-level Q-values, and
(3) Q-values.
Here, the item-level Q-values, denoted by , represent the expected cumulative reward given a state and a base action at horizon . In contrast, the Q-values, denoted by , represent the expected cumulative reward given a state and an assortment at horizon . When we say that the strategy "alternates between optimistic and pessimistic utility estimates," the relevant quantity is the (1) MNL utilities. In contrast, in our initial response regarding the "intuition behind switching utilities," the relevant quantity was the (3) Q-values.
The key idea is that by alternating between optimistic and pessimistic MNL utility estimates, we can induce optimism in the Q-value estimates. Specifically, the Q-value estimate (as defined right below Equation (8)) is greater than the true Q-value with high probability. In other words, our switching MNL utilities strategy serves as a sufficient condition for ensuring optimism in the Q-value estimates. This is the key intuition.
The switching technique is essential for guaranteeing optimism in the Q-value estimates, as we do not have access to the true item-level Q-values. Moreover, the item-level Q-value for the outside option (i.e., not choosing any item) can be non-zero—and potentially even the largest among all options—which contrasts with the standard MNL bandit setting. In MNL bandits, it is typically assumed that the item-level values (often referred to as revenue parameters) are known, and that the value for the outside option is zero. Therefore, our setting is strictly more general than the standard MNL bandit setting and introduces additional challenges.
To address these challenges, we first optimistically estimate the item-level Q-values, denoted as (or ), which incorporate uncertainty. Based on these estimates, we then choose to use either optimistic or pessimistic MNL utilities according to the rule in Equation (7) to ensure optimism in the Q-value estimates.
-
Case 1: When the estimate (or ) for the outside option is not the largest—that is, there exists some such that for a given state —then using optimistic MNL utilities is sufficient to ensure optimism in the Q-value estimates. The proof is quite technical and difficult to present in this limited space; for formal details and a complete justification, please refer to Lemma D.15 (Optimism) and the supporting results in Lemmas D.5 and D.14.
-
Case 2: When the outside option's item-level Q-value is the largest, i.e., for all , using optimistic MNL utilities alone is not sufficient. In this case, we find that using pessimistic MNL utilities can actually induce optimism in the Q-value estimates. This case is more intuitive: since the outside option is always included in the assortment and has the highest estimated item-level Q-value, increasing its MNL choice probability—while reducing the choice probabilities of the other items—can result in a higher Q-value estimate for the assortment. This can be achieved by using pessimistic MNL utilities. The pessimistic MNL utility is smaller than its true value (with high probability), which increases the choice probability of the outside option (which has the highest item-level Q-value). This leads to a higher Q-value estimate for the assortment. For a detailed analysis, please refer again to Lemma D.15.
We sincerely hope that these additional clarifications help convey the novelty of our MNL utility switching technique. To the best of our knowledge, this is the first work to consider unknown Q-value estimates (or revenue parameters in the MNL bandit setting) within the existing literature on bandits (or RL) with MNL preference models. We believe that both the algorithmic design and the proof techniques are novel in guaranteeing optimism of the Q-value estimates, and we hope this work contributes meaningfully to the community.
The paper addresses the problem of combinatorial reinforcement learning with preference feedback, where a learning agent offers an assortment of multiple items (an action) to a user, whose preferences follow a multinomial logistic model. This framework is particularly relevant for applications like recommender systems and online advertising, where long-term user engagement is key. The paper identifies two main challenges: the unknown value of each item and the difficulty of ensuring optimism while maintaining tractable assortment selection in the combinatorial action space.
To tackle these challenges, the authors propose an algorithm called MNL-VQL, which is both computationally and statistically efficient. The algorithm estimates optimistic item values using point-wise optimism under general function approximation and selects assortments that ensure sufficient exploration and optimism. The paper also establishes the first regret lower bound for linear MDPs with MNL preference feedback and shows that MNL-VQL achieves nearly minimax-optimal regret.
Main contributions from the paper are:
- Introduction of MNL-VQL, an algorithm that addresses the challenges of combinatorial RL with preference feedback, ensuring computational and statistical efficiency.
- Regret upper bound and Minimax-Optimal regret using MNL-VQL (linear MDPs) is the feature dimension of the linear MDPs, and establishes a matching regret lower bound, showing the minimax-optimality of the algorithm.
- Development of analytical techniques to prove optimism and related results, avoiding naive combinatorial enumeration by reformulating the optimization problem as a linear program.
给作者的问题
see Other Strengths And Weaknesses
论据与证据
Most of claims reported in the paper are supported by clear and convincing evidence. Notwithstanding, there are a few areas where the evidence could be strengthened.
** ''Full'' Supported Claims**
- Introduction of MNL-VQL algorithm:
- The paper provides a detailed description of the MNL-VQL algorithm, including its steps and theoretical foundations. The algorithm is well-supported by the mathematical formulations and the step-by-step explanation provided in the paper.
- Regret Upper Bound:
- The paper claims that MNL-VQL achieves a regret upper bound. This claim is supported by Theorem 5.1 and the accompanying proof in Appendix D. The theoretical analysis appears rigorous and thorough.
- Minimax-Optimal regret for linear MDPs:
- The claim that MNL-VQL achieves nearly minimax-optimal regret for linear MDPs is supported by Theorem 5.2 and the corresponding proof. The paper provides a detailed comparison with existing work to highlight the novelty and effectiveness of their approach.
** "Weaker" Claims **
- First theoretical guarantee in combinatorial RL with preference feedback:
- The paper claims to be the first to provide statistical guarantees in combinatorial RL with preference feedback. While the paper does provide strong theoretical results, it would benefit from a more comprehensive review of existing literature to ensure that no prior work has addressed this problem. The claim could be seen as overstated without a thorough literature review.
- Empirical validation:
- The paper focuses heavily on theoretical contributions and provides limited empirical validation. While the theoretical results are strong, more extensive empirical experiments demonstrating the practical effectiveness of the MNL-VQL algorithm would strengthen the claim. The lack of more extensive empirical results may contrast the benefits of the proposed approach
方法与评估标准
Yes, the proposed methods and evaluation criteria in the paper make sense for the problem at hand.
理论论述
No
实验设计与分析
The paper primarily focuses on theoretical contributions and does provide limited empirical analyses.
补充材料
Yes. The appendix
与现有文献的关系
The key contributions of the paper are closely related to several areas within the broader scientific literature, particularly in the fields of reinforcement learning (RL), multi-armed bandits, and preference modelling.
Combinatorial RL involves selecting combinations or subsets of actions from a set of possible actions. Concerning prior works on the topics, previous studies have addressed problems within this setting, particularly in deep RL (e.g., Sunehag et al., 2015; He et al., 2016; Swaminathan et al., 2017; Metz et al., 2017; Ryu et al., 2019; Ie et al., 2019; Delarue et al., 2020; McInerney et al., 2020; Vlassis et al., 2021; Chaudhari et al., 2024). However, the authors of the paper formalize the concept of combinatorial RL with preference feedback, which had not been theoretically defined in prior work.
遗漏的重要参考文献
Yes
其他优缺点
Overall, the paper makes significant and original contributions to the field of combinatorial RL with preference feedback. The theoretical foundations are strong, and the proposed algorithm addresses important challenges.
Strengths
-
Novel framework: the paper introduces a novel framework for combinatorial reinforcement learning (RL) with preference feedback, which had not been theoretically defined before. The development of the MNL-VQL algorithm is a notable contribution, addressing specific challenges in combinatorial RL with preference feedback that were not previously tackled. Such a framework is especially relevant for applications such as recommender systems and online advertising.
-
Theoretical contributions: the paper provides strong theoretical contributions, including the first regret lower bound for linear MDPs with MNL preference feedback and nearly minimax-optimal regret bounds.
-
Detailed explanations: the paper provides detailed explanations of the algorithm steps, theoretical foundations, and proofs. This clarity helps readers understand the complex concepts and methods used.
** Weaknesses **
- Lack of experiments: the main issue of the paper is the lack of an extensive empirical validation of the proposed algorithm. Including experiments on benchmark datasets and comparisons with more baseline methods would strengthen the claims and demonstrate the practical effectiveness of the algorithm.
其他意见或建议
see Other Strengths And Weaknesses
We appreciate your time to review our paper and your valuable feedback.
It seems that your main concern is the lack of empirical validation for the proposed algorithm. However, we would like to emphasize that this is a theoretical paper, submitted to the theory category. Our work introduces a new framework—combinatorial RL with preference feedback—in a principled way, supports non-linear function approximation, and achieves near-optimal regret guarantees in linear MDPs. We believe these contributions represent a substantial advancement in the foundations of RL and, on their own, are sufficient to merit acceptance.
Nevertheless, in response to the reviewer’s request, we have also conducted experiments on the large-scale, real-world MovieLens dataset (Harper & Konstan, 2015) to provide additional empirical validation.
Additional real-world data experiments
The MovieLens dataset contains 25 million ratings on a 5-star scale for 62,000 movies (base items ) provided by 162,000 users (). We define the state as the number of movies a user has watched after entering the system, denoted by , where is the number of movies watched during the session. We interpret the ratings as representing MNL utilities.
In each episode , a user () is randomly sampled and arrives at the recommender system, initiating the state . The agent offers a set of items with a maximum size of . If the user clicks on an item, they receive a reward of and transition to the next state . If no item is clicked, the user receives no reward and remains in the current state (). In addition, certain junk items—such as those with a provocative title and poster but poor content—can cause users to leave the system immediately. This is modeled as a transition to an absorbing state, where no further rewards are received and the state remains unchanged regardless of future actions. We believe the presence of such junk items is quite natural and reflective of real-world recommendation environments.
For our experiments, we use a subset of the dataset containing users and a varying number of movies, . To construct MNL features, we follow a similar experimental setup as in [1], employing low-rank matrix factorization. For linear MDP features, we apply the same approach as used in our synthetic data experiments. We set the parameters as follows: (including the absorbing state), (MNL feature dimension), (Linear MDP feature dimension), (number of base items) and |\mathcal{A}|= \sum_{m=1}^{M-1}$$N \choose m . The proportion of junk items is set to .
- We provide an anonymous link to the experimental results here: [Link]
The results demonstrate the superior performance of our algorithm, highlighting its practicality for real-world scenarios.
[1] Shuai Li, Tor Lattimore, and Csaba Szepesvari. Online learning to rank with features. In International Conference on Machine Learning, pages 3856–3865, 2019.
Comparison to other baselines in the experiments
The reviewer also suggested including comparisons with more baselines. However, to the best of our knowledge, our framework is the first of its kind, and there are NO directly comparable baselines available. We believe it is sufficient to compare against the state-of-the-art myopic algorithm (OFU-MNL+, Lee & Oh, 2024) and a representative linear MDP algorithm (LSVI-UCB, Jin et al., 2020), and to demonstrate the limitations of these existing approaches within our more practical and general setting.
First theoretical guarantee in combinatorial RL with preference feedback
We strongly believe that our claim of providing the "first theoretical regret guarantee" is NOT overstated and therefore should not be considered a weakness of the paper. This is because, to the best of our knowledge, no prior theoretical work has addressed the framework we propose—combinatorial RL with preference feedback—which underscores the novelty of our contribution.
Although there have been recent significant advances in RL theory, we are not aware of any existing study that considers this framework. The most closely related line of research is cascading RL (Du et al., 2024); however, that setting involves presenting items sequentially, one at a time, rather than simultaneously as sets, which represents the key distinction of our framewokr.
We sincerely believe that we have addressed all of your concerns. If you agree, we would greatly appreciate it if you could kindly reconsider your initial evaluation. Additionally, please feel free to reach out if you have any further questions or require clarification on any point.
This paper considers the combinatorial RL with a MNL preference distribution, where given an combinatorial action(assortment), the final action is sampled from a linear MNL model. In this setting, the learner needs to estimate both the underlying MNL model parameter and the transition dynamics, as in the standard RL task. Authors provide an algorithm that combines the weighted regression routine for value estimation and online mirror descent for learning MNL parameters. This algorithm works for general non-linear value function classes with finite generalized Eluder dimension. Corresponding lower bound results are also provided in linear value approximation setting to illustrate the optimality.
给作者的问题
I don't have additional questions.
论据与证据
Yes. Most statements made in this paper are clear and well-supported.
方法与评估标准
The proposed algorithm is well-founded, the authors also provide an LP formulation for assortment optimization to address computational challenges, making the algorithm applicable to real-world problems.
理论论述
I didn't go through the proof in detail since it is too long to check. However, the theoretical guarantee provided for the algorithm make sense for me as both its value estimation block and the MNL parameter estimation block are well supported by existing literature. And the final regret bound can be decomposed to the summation of these two parts.
实验设计与分析
No.
补充材料
No.
与现有文献的关系
The topic studied in this paper is novel to both combinatorial RL and MNL bandits, I believe it contributes to both areas.
遗漏的重要参考文献
No.
其他优缺点
Other Strength:
It is somewhat surprising that the authors directly establish a relatively complete set of results in a newly proposed setting, which allows for non-linear function approximation and achieves nearly optimal regret. This makes the contributions of this paper solid.
其他意见或建议
No.
We sincerely appreciate your positive support and recognition of the value of our work! We truly hope this research helps to shed light on a new direction for the RL community, particularly in the area of combinatorial RL. Please don’t hesitate to reach out if you have any further questions.
This paper studies a combinatorial reinforcement learning setting in which an agent repeatedly offers a subset (assortment) of items and observes the user’s choice according to a multinomial logistic (MNL) model. Key challenges include (1) learning long-term (multi-step) item values rather than merely single-step rewards, and (2) preserving computational efficiency in selecting an assortment optimistically from an exponentially large set. To address these, the authors propose MNL-VQL, an algorithm that uses a linear (contextual) MNL model for user preferences and general function approximation for item values. They provide theoretical guarantees on regret, including a nearly minimax-optimal bound for the linear MDP case.
给作者的问题
(1) Is the exploration method (Line 20) effective? compared to other classic exploration methods like UCB and TS?
(2) The following work are also related to combinotorial bandits, can author add them to related work as well?
[1] Multi-facet contextual bandits: A neural network perspective. KDD'21
[2] Combinatorial neural bandits. ICML'23
论据与证据
Authors claim "This is the first work to provide statistical guarantees in combinatorial RL with preference feedback." From the literature review, this claim is true while some important related works are missing.
方法与评估标准
This paper proposes the algorithm, MNL-VQL, that learns both: the MNL preference parameters (how likely each item is to be chosen), and the long-term value of each item using value-based RL techniques (with function approximation). Ensures computational feasibility by carefully separating the MNL-utility estimation from the state-value estimation. The key novelty is how it maintains “optimistic” and “pessimistic” estimates of item Q-values under unknown item values, and then constructs optimistic MNL utilities in a way that still admits polynomial-time optimization over the combinatorial action set.
理论论述
I didn't carefully check the proof.
实验设计与分析
The paper includes an experiment in a synthetic “online shopping with budget” environment. Each state represents a user’s budget level; the agent recommends an assortment, observes which item is selected (if any), and the user’s budget transitions based on the purchase. However, the experiments don't include the real-world datasets and bandit-based baselines.
补充材料
No,
与现有文献的关系
Provide new theoretical insights to the combinatorial RL/Bandit literature.
遗漏的重要参考文献
The following two works also study combinatorial action space in bandits and provide the theoretical guarantee. Authors should include them into related works.
[1] Multi-facet contextual bandits: A neural network perspective. KDD'21 [2] Combinatorial neural bandits. ICML'23
其他优缺点
The algorithm’s regret bounds are clearly laid out, with a special focus on minimax-optimal results in linear settings; The paper avoids intractable enumeration of subsets by formulating the assortment selection via an LP-based approach, making it feasible for moderate I. However, the empirical evaluation is relatively weak, given the synthetic data and a few number of baselines.
其他意见或建议
No
Thank you for acknowledging the value of our work and providing a positive evaluation! We will address your questions below.
Additional real-world data experiments
As per the reviewer’s request for real-world experiments, we have additionally conducted experiments on the large-scale, real-world MovieLens dataset (Harper & Konstan, 2015) to provide further empirical validation.
The MovieLens dataset contains 25 million ratings on a 5-star scale for 62,000 movies (base items ) provided by 162,000 users (). We define the state as the number of movies a user has watched after entering the system, denoted by , where is the number of movies watched during the session. We interpret the ratings as representing MNL utilities.
In each episode , a user () is randomly sampled and arrives at the recommender system, initiating the state . The agent offers a set of items with a maximum size of . If the user clicks on an item, they receive a reward of and transition to the next state . If no item is clicked, the user receives no reward and remains in the current state (). In addition, certain junk items—such as those with a provocative title and poster but poor content—can cause users to leave the system immediately. This is modeled as a transition to an absorbing state, where no further rewards are received and the state remains unchanged regardless of future actions. We believe the presence of such junk items is quite natural and reflective of real-world recommendation environments.
For our experiments, we use a subset of the dataset containing users and a varying number of movies, . To construct MNL features, we follow a similar experimental setup as in [1], employing low-rank matrix factorization. Specifically, we randomly split the user set into two subsets, and , where and . The rating matrix from users in is used to derive user and movie embeddings via singular value decomposition (SVD), with both user and movie feature dimensions set to . We define the user-movie feature as the outer product of the corresponding user and movie embeddings, resulting in a -dimensional vector. The final MNL feature vector is then given by . For linear MDP features, we apply the same approach as used in our synthetic data experiments.
We set the experimental parameters as follows: (including the absorbing state), (MNL feature dimension), (Linear MDP feature dimension), (number of base items) and |\mathcal{A}|= \sum_{m=1}^{M-1}$$N \choose m . The proportion of junk items is set to .
- We provide a link to the experimental results here: [Link]
The results demonstrate the superior performance of our algorithm, highlighting its practicality for real-world scenarios.
[1] Shuai Li, Tor Lattimore, and Csaba Szepesvari. Online learning to rank with features. In International Conference on Machine Learning, pages 3856–3865, 2019.
Additional realted works
We will ensure that combinatorial bandits are discussed in the Related Work section and that the papers suggested by the reviewer are included in the final version of the paper. Specifically, we plan to add the following discussion:
"Our work is related to prior studies on combinatorial bandits with semi-bandit or cascading feedback [appropriate references]. However, our framework differs significantly from these approaches. In the semi-bandit and cascading settings, user feedback on a specific item depends only on that item’s context, independent of other items offered simultaneously. Thus, these approaches do not account for substitution effects among items. In contrast, our work utilizes an MNL choice model where user feedback depends explicitly on the entire assortment presented, leading to additional analytical challenges."
Question
Is the exploration method (Line 20) effective? compared to other classic exploration methods like UCB and TS?
We can evaluate the effectiveness of the exploration method (Eqn.(9)) from two perspectives: statistical and computational efficiency.
-
Statistical efficiency: The proposed exploration strategy ensures that overly optimistic item -estimates (which have larger bonuses) are used infrequently, resulting in tighter regret bounds compared to classical methods.
-
Computational efficiency: Since our exploration method employs a non-Markovian policy, it requires computations involving the entire history up to horizon . Consequently, its computational cost is roughly times greater than that of UCB or TS.
All reviewers appreciates the novel framework and the theoretical results derived in this work. Details of algorithms and proofs are provided which make it a solid theoretical machine learning work. While the theoretical contribution may suffice for a conference paper, I would highly encourage the authors apply the proposed algorithm in some real tasks for the next version of this work (maybe a journal paper).