PaperHub
5.8
/10
Poster4 位审稿人
最低4最高7标准差1.1
6
4
7
6
3.5
置信度
正确性3.0
贡献度2.8
表达2.3
NeurIPS 2024

Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

OpenReviewPDF
提交: 2024-04-23更新: 2024-11-06
TL;DR

We propose randomized exploration model-based RL algorithms with multinomial logistic function approximation.

摘要

We study reinforcement learning with _multinomial logistic_ (MNL) function approximation where the underlying transition probability kernel of the _Markov decision processes_ (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. For our first algorithm, $RRL-MNL$, we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency and establish that $RRL-MNL$ is both _statistically_ and _computationally_ efficient, achieving a $\tilde{\mathcal{O}}(\kappa^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T})$ frequentist regret bound with constant-time computational cost per episode. Here, $d$ is the dimension of the transition core, $H$ is the horizon length, $T$ is the total number of steps, and $\kappa$ is a problem-dependent constant. Despite the simplicity and practicality of $RRL-MNL$, its regret bound scales with $\kappa^{-1}$, which is potentially large in the worst case. To improve the dependence on $\kappa^{-1}$, we propose $ORRL-MNL$, which estimates the value function using local gradient information of the MNL transition model. We show that its frequentist regret bound is $\tilde{\mathcal{O}}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + \kappa^{-1} d^2 H^2)$. To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve both computational and statistical efficiency. Numerical experiments demonstrate the superior performance of the proposed algorithms.
关键词
Reinforcement learningFunction approximationMultinomial logistic regressionRegret analysis

评审与讨论

审稿意见
6

This work proposes the two randomized RL algorithms, RRL-MNL and ORRL-MNL, for the MNL transition model that for the first time achieve both computational and statistical efficiency without stochastic optimism. The complexity of ORRL-MNL has better dependence on the large problem-related constant κ1\kappa^{-1} than RRL-MNL.

优点

MNL parameterization is important and convenient since the parameterized transition kernel is always a probability distribution. This work is computationally more efficient than the existing work [35] on MNL parameterization.

缺点

Some details of the algorithms are not easily understood and are thus better to explain intuitively, as mentioned in questions 2-4 and 6-7 below. The regrets of both proposed algorithms are higher than O~(dH32T)\widetilde{O}(dH^{\frac{3}{2}}\sqrt{T}) of [35]. Experiment reproducibility could be improved as shown in question 9 below.

问题

(1) Assumption 4 is equivalent to infθBd(Lθ)Pθ(ss,a)κ,s,a,s\inf _ {\theta\in \mathcal{B} _ d(\mathcal{L} _ {\theta})} P_{\theta}(s'|s,a)\ge \sqrt{\kappa}, \forall s,a,s', since infs,s~Pθ(ss,a)Pθ(s~s,a)=[infsPθ(ss,a)]2\inf _ {s',\widetilde{s}}P _ {\theta}(s'|s,a)P _ {\theta}(\widetilde{s}|s,a)=[\inf_{s'}P _ {\theta}(s'|s,a)]^2.

(2) In eq. (2), how to select LθL_{\theta}? Does eq. (2) aim to minimize the lost function k=1Nk,h(θ)\sum_{k=1}^N\ell_{k,h}(\theta)? Why do you use Newton step instead of first-order algorithm?

(3) Could you explain the intuition of eq. (3)? Is it inspired by the Hessian of the loss k,h(θ)\ell_{k,h}(\theta)?

(4) In the paragraphs about ``Stochastically optimistic value function'' after eq. (3), it seems that most words are about the drawbacks of existing works on optimistic estimated value function, and then you adapt optimistic sampling technique instead. It would be more clear to me if you introduce your eq. (4) first with more explanations, for example, what adjustments are made to the original optimistic sampling technique [7, 52, 37, 36] and why, why is Aj,k1A_{j,k}^{-1} used instead of Euclidean norm in s^\hat{s} definition, and then list its advantage over existing works (maybe with less words than the current version?). Also, is your Remark 2 an advantage over relevant works like [35]? If yes, you may claim this advantage.

[35] Taehyun Hwang and Min-hwan Oh. Model-based reinforcement learning with multinomial logistic function approximation. In Proceedings of the AAAI conference on artificial intelligence, pages 7971–7979, 2023.

(5) In eqs. (4) and (7), is it tighter to truncate at level Hh+1H-h+1 instead of HH?

(6) In eq. (6), is there an intuition why 2k,h\nabla^2\ell_{k,h} has coefficient η\eta while 2i,h\nabla^2\ell_{i,h} (i=1,,k1i=1,\ldots,k-1) have coefficient 1?

(7) Is there an intuitive explanation of νk,hrand(s,a)\nu_{k,h}^{\rm rand}(s,a) defined after eq. (7)?

(8) You could also define TT in Theorem 1 to facilitate readers. The total number of steps is T=KHT=KH, yes? Is yes, is it better to use the equivalent sample complexity κ1d32H2K\kappa^{-1}d^{\frac{3}{2}}H^2\sqrt{K}?

(9) To ensure reproducible experiment, some hyperparameters are missing such as MM, λ\lambda, κ\kappa, η\eta, σk\sigma_k, βk\beta_k, etc.

(10) Could you compare your work with contemporary work [a]?

[a] Li, Long-Fei, et al. "Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation." ArXiv:2405.17061 (2024).

(11) Is it possible to extend your work to stochastic policy and unknown reward? For example, will ϵ\epsilon-greedy policy also encourage exploration more than greedy action?

(12) Typo at the beginning of Section 2: {PhP_h} h=1H_ {h=1}^H instead of {PP} h=1H_ {h=1}^H.

局限性

The final appendix section mentions the limitation of realizability assumption. I agree with the authors' claim that there is no negative societal impacts because this work focuses on theoretical aspects.

评论

[Q11] Extension to stochastic policy and unknown reward

We believe that designing a stochastic policy in MNL-MDPs is beyond the scope of this paper. However, note that although our algorithms do not provide the probability of selecting an action in a state as a distribution, since the estimated value function itself is randomized, there is inherent randomness in the way actions are selected at each step. The epsilon-greedy algorithm is one of the simplest randomized algorithms used in bandit or RL literature; however, it is not statistically efficient because it continues to explore with a fixed probability even as learning progresses.

For the unknown reward setting, we may apply eluder dimension-based general function approximation algorithm [61] to estimate the reward function. Estimating the general reward function while learning the MNL transition model is an interesting research topic for future work but we believe it is beyond the scope of this paper. Note that if the reward function has a parametric form, such as a linear function of features, we can extend our algorithm by adding a step of optimistic reward estimation, similar to LinUCB. This would generate an extra O~(dT)\tilde{O}(d \sqrt{T}) regret, which is a lower-order term compared to the current regret bounds. We remind you that the known reward rr assumption is widely used in the model-based RL literature [69, 9, 70, 79, 35].


[Q12] Typo

Thank you for pointing out the typo. We will correct it in the revised version.

评论

Hello authors,

Reviewer u5d2 is satisfied with the authors' response and would like to increase rating to 6. If convenient, you may add the intuitive algorithm explanations in your response to your revised paper.

Reviewer u5d2

评论

Thank you so much for your response to our rebuttal and support! With the help of your feedback, we believe that our revised version will be strengthened. If there are any additional questions or comment, we would be more than happy to address them.

评论

[Q6] Coefficient η\eta

The main reason for multiplying η\eta by 2_k,h(θ~_hk)\nabla^2 \ell\_{k,h}(\tilde{\mathbf{\theta}}\_h^k) is to utilize the properties of implicit online mirror descent (OMD) (Kulis & Bartlett, 2010; Campolongo & Orabona, 2020). Note that the update rule θ~_hk+1=argmin_θB_d(L_θ)12ηθθ~_hk_B~_k,h2+θ_k,h(θ~_hk)\tilde{\mathbf{\theta}}\_{h}^{k+1} = \arg\min\_{\mathbf{\theta} \in \mathcal{B}\_d(L\_{\mathbf{\theta}})} \frac{1}{2 \eta} || \mathbf{\theta} - \tilde{\mathbf{\theta}}\_h^k ||\_{\tilde{\mathbf{B}}\_{k,h}}^2 + \mathbf{\theta}^\top \nabla \ell\_{k,h}(\tilde{\mathbf{\theta}}\_h^k) (Eq.(6)) is known as a standard OMD. And the rule θ~_hk+1=argmin_θB_d(L_θ)12ηθθ~_hk_B_k,h2+_k,h(θ)\tilde{\mathbf{\theta}}\_{h}^{k+1} = \arg\min\_{\mathbf{\theta} \in \mathcal{B}\_d(L\_{\mathbf{\theta}})} \frac{1}{2 \eta} || \mathbf{\theta} - \tilde{\mathbf{\theta}}\_h^k ||\_{\mathbf{B}\_{k,h}}^2 + \ell\_{k,h}(\mathbf{\theta}) is referred to as implicit OMD, which updates on the original loss function instead of the linearized approximation.

By the definition of B~_k,h\tilde{\mathbf{B}}\_{k,h}, Eq. (6) can be represented in the implicit OMD form: θ~_hk+1=argmin_θB_d(L_θ)12ηθθ~_hk_B_k,h2+~_k,h(θ)\tilde{\mathbf{\theta}}\_{h}^{k+1} = \arg\min\_{\mathbf{\theta} \in \mathcal{B}\_d(L\_{\mathbf{\theta}})} \frac{1}{2 \eta} || \mathbf{\theta} - \tilde{\mathbf{\theta}}\_h^k ||\_{\mathbf{B}\_{k,h}}^2 + \tilde{\ell}\_{k,h}(\mathbf{\theta}), where ~_k,h(θ)\tilde{\ell}\_{k,h}(\mathbf{\theta}) is the second-order approximation of the original loss _k,h(θ)\ell\_{k,h}(\mathbf{\theta}). Now, we can use the property (Proposition 4.1 of Campolongo & Orabona, 2020) of implicit OMD to derive the proposed results.

Campolongo and Orabona. "Temporal variability in implicit online learning."" NeurIPS 2020.

Kulis and Bartlett. "Implicit online learning."" ICML 2010.


[Q7] Intuitive explanation of νrand_k,h\nu^{\text{rand}}\_{k,h}

In Eq.(7), r(s,a)+_sS_s,aP_θ~hk(ss,a)V~k_h+1(s)r(s,a) + \sum\_{s’ \in \mathcal{S}\_{s,a}} P\_{\tilde{\mathbf{\theta}}^k_h} (s’ | s,a) \tilde{V}^k\_{h+1} (s’) represents the action value function induced by the estimated transition core θ~hk\tilde{\mathbf{\theta}}^k_h. The randomized bonus term νk,hrand\nu_{k,h}^{\text{rand}} perturbs this estimated action value. Note that, as shown in Lemma 18, the optimistic sampling technique ensures the randomized bonus term makes the optimistic randomized value function more optimistic than the true optimal value function with at least a constant probability.


[Q8] Total number of steps TT

The total number of steps is T=KHT = KH as summarized in Appendix B. We will clearly denote this in the main paper. We would like to remind you that representing the regret bound by the total number of interactions the agent has with the environment is a common notation in the related literature [41, 70, 71, 37, 9, 35], and we have followed this convention.


[Q9] Hyperparameters

The hyperparameters are specified in the supplementary material within the "algorithms.py" file, which allows for reproducibility. However, for the sake of clarity, we will additionally list these hyperparameters in the main text or appendix of the revised version.


[Q10] Contemporary work

The policy of NeurIPS related to comparison to concurrent work states that "Papers appearing less than two months before the submission deadline are generally considered concurrent to NeurIPS submissions. Authors are not expected to compare to work that appeared only a month or two before the deadline." After checking the arXiv submission history of Li et al. (2024), we found that their paper was uploaded to arXiv after the NeurIPS main paper deadline. Therefore, we remind you that our NeurIPS submission was completed before their paper was made publicly available. Nevertheless, briefly comparing our work with the concurrent works, Li et al. focus on the analysis of deterministic algorithms, while our research encompasses a broader scope, including both deterministic and randomized algorithms.

作者回复

Thank you for your time to review our paper and for your valuable feedback. Here are our responses to each comment and question:


[W1] Regret bound

It is generally well-known that the regret bound of randomized algorithms is higher than that of optimism-based algorithms. For example,

  • in the linear bandit setting: UCB O~(dT)\tilde{O}(d \sqrt{T}) [1] vs. TS O~(d3/2T)\tilde{O}(d^{3/2} \sqrt{T}) [2];
  • in the MNL-bandit setting: UCB O~(dT)\tilde{O}(d \sqrt{T}) [53] vs. TS O~(d3/2T)\tilde{O}(d^{3/2} \sqrt{T}) [52];
  • in linear MDPs: UCB O~(d3/2H3/2T)\tilde{O}(d^{3/2} H^{3/2} \sqrt{T}) [41] vs. TS O~(d2H2T)\tilde{O}(d^2 H^2 \sqrt{T}) [71].

This gap originates from the fact that randomized algorithms require the perturbations to have sufficient variance to guarantee optimism of the estimated reward (or value) with a constant probability. Note that although randomized algorithms have a higher regret bound than UCB-based algorithms, randomized exploration not only exhibits superior performance in empirical evaluations but also necessitates a more rigorous proof technique to obtain a theoretical guarantee. Hence, proving a particularly frequentist regret bound for TS algorithms in any given bandit or RL problem has been widely recognized as making significant contributions to both RL and bandit literature.


[Q1] Assumption 4

We intended to define κ\kappa for ss~s'\ne\tilde{s}. In this case, Assumption 4 is not equivalent to what you mentioned. We will clarify this in the revised version.


[Q2] Eq. (2)

  • LθL_\theta: LθL_\theta is specified in Assumption 2, which is common in the literature on RL with function approximation [41, 70, 71, 37, 35], to ensure the regret bounds are scale-free.
  • Eq.(2): Eq.(2) is designed as an online update scheme to find an approximate solution for minimizing the cumulative loss function _k=1K_k,h(θ)\sum\_{k=1}^K \ell\_{k,h} (\theta) since computing the exact solution for MLE can be inefficient in terms of time and space complexity.
  • Newton step: We believe there may be a misunderstanding. The optimization method used to estimate Eq.(2) is not an exact Newton method because there is no calculation of the Hessian matrix. Instead, we use the lower bound of the Hessian matrix (i.e., λI_d+i=1k12k,h(θ)Ak,h\lambda \mathbf{I}\_d + \sum_{i=1}^{k-1} \nabla^2 \ell_{k,h}(\theta) \succeq \mathbf{A}_{k,h}), which is why we mention that we use a variation of the online Newton method. Note that this approach is inspired by online algorithms for logistic bandits [72] and MNL contextual bandits [53]. We applied these techniques to estimate the transition core in MNL-MDPs, and as a result, provided both computationally and statistically efficient randomized algorithms for MNL-MDPs.

[Q3] Eq. (3)

The typical form of Eq.(3) is referred to as the Gram matrix, which is widely used in the existing feature-based bandit and RL literature [cite here]. It aggregates the feature information based on the decisions made by the algorithm in previous episodes. Once more, we would like to note that the Gram matrix in Eq.(3) is a lower bound of the sum of the regularized Hessian matrix.


[Q4] Stochastically optimistic value function

To help understand the concept of a stochastically optimistic value function, we will provide additional explanations. In this paragraph, our intention is not merely to list the drawbacks of related literature. Instead, we introduce the key challenges of regret analysis for randomized algorithms, explain how previous works have overcome these challenges, and then describe why the techniques from previous works cannot be applied to MNL-MDPs. In the revised version, we will clarify this further to make it easier to understand.

For the definition of s^\hat{s}, at the end of the proof of Lemma 4, the prediction error can be upper bounded by the maximal inner product of the feature and the difference between the estimator, i.e., Δhk(s,a)HmaxsS_s,aφ_s,a,s(θk_hθ_h)\Delta^k_h(s,a) \le H \max_{s' \in \mathcal{S}\_{s,a}} | \varphi\_{s,a,s'}^\top (\theta^k\_h - \theta^*\_h)|. We apply the Cauchy-Schwarz inequality with respect to _Ak,h||\cdot||\_{A_{k,h}} because we have the concentration result for θk_h\theta^k\_h with respect to _A_k,h||\cdot||\_{A\_{k,h}} in Lemma 1. That is why the dominating feature φ^(s,a)\hat{\varphi}(s,a) is defined with respect to the _A_k,h||\cdot||\_{A\_{k,h}} norm rather than the _2||\cdot||\_2-norm.

For Remark 2, as the reviewer mentioned, it explains why RRL-MNL is computationally efficient compared to relevant works in MNL-MDPs [35]. We respectfully remind you that we clearly state our proposed algorithms are both computationally and statistically efficient in the main contributions.


[Q5] Clipping in Equation (4) and (7)

It is possible to truncate QhkQ^k_h to Hh+1H-h+1, but it does not change our analysis or result in a tighter regret bound.


Due to the space limit, we will leave responses to the remaining comments in the following official comment.

审稿意见
4

The paper considers the MNL MDP setting and provides several regret bounds. The MNL setting is one where the transition probabilities are modeled as a soft-max over a linear transformation of state-action-next-state features. The work improves the dependence on an important problem parameter. The main results are presented using randomized exploration but a deterministic exploration scheme is also included in the appendix. Concretely, κ\kappa is a problem dependent constant that lower bounds any individual transition probability estimate. The authors explain that it relates to information gain in the transition kernel regression problem through the Fisher information matrix.

优点

  1. The new algorithm does not need to know κ\kappa (this is not the case in previous works)
  2. The obtained regret bound is improved from κ1K\approx \kappa^{-1}\sqrt{K} to κ1+K\approx \kappa^{-1} + \sqrt{K}
  3. the memory and computational requirements per episode are reduced from linear to constant in the number of episodes.

缺点

Setting: While the MNL model is quite intuitive, it has several shortcommings that, in my opinion, limit its usefulness.

  1. It does not seem to allow for a compact characterization of the value or policy. This is because these depend strongly on the size of the next state sets Ss,a\mathcal{S}_{s,a}. While instances where these sets are small exist this is still very limiting.
  2. The sets Ss,a\mathcal{S}_{s,a} must be known to the algorithm.
  3. It is very hard to imagine a setting where κ1\kappa^{-1} is not exponential in natural problem parameters.

Presentation:

  1. The paper is written in such a way that makes it hard to appreciate its contributions beyond the objective improvement to the regret bound. There are some discussions as to what makes the analysis more involved compared to standard TS and MNL bandits but I find it vague. Moreover, there are no proof sketches of any kind, which contributes to this issue. Overall, this makes it hard to appreciate the soundness of the paper and whether it contains meaningful technical contributions or not.

Results: 5. Most importantly, while the paper improves computational efficiency, It seems to me that it still has at least UH|\mathcal{U}|^H computations per round (to calculate the Q function along the policy trajectory). This exists in past algorithms as well and makes all of them inefficient (not polynomial in the natural problem parameters). Is this the case? 6. I do not see the randomized exploration as significant contribution. It both complicates the analysis significantly and deteriorates the regret. I appreciate the potential practical appeal but I do not think this strengthens the paper significantly. I personally think the focus of the paper should have been on the improved regret with the standard deterministic bonuses. This would also simplify the presentation. 7. There is a recent concurrent paper ([1]) on this topic that seems to obtain similar results. While it is concurrent, I would appreciate if you provide a comparison in your response.

[1] Li, Long-Fei, et al. "Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation." arXiv preprint arXiv:2405.17061 (2024).‏

问题

See above.

局限性

N/a

作者回复

Thank you for your time to review our paper and for your valuable feedback. Here are our responses to each comment and question:


[W1] Compact characterization of the value or policy

We believe that the lack of a compact characterization of the value or policy should NOT be considered a weakness of our work. Including almost every generic model-based setting [9, 21, 4, 35], even in tabular RL or similar settings like linear mixture MDPs, a compact characterization of the value or policy is not always achievable.


[W2] Known reachable states

We respectfully disagree that this assumption constitutes a weakness when compared to other function approximation approaches such as linear MDPs and general function classes. While we acknowledge this requirement, it's important to note that other approaches also come with their own inherent limitations. Recent studies have highlighted fundamental limitations of linear MDPs:

  1. As introduced in [35], a linear function approximating the transition model must ensure that the function output is within [0, 1] and that the probabilities over all possible next states sum to 1. This restrictive assumption limits the set of feature representations of state-action pairs that are admissible for the transition model (Proposition 1 in [35]).

  2. Additionally, Theorem 1 from Lee & Oh (2024) shows a fundamental limitation of linear MDPs. In linear MDPs, the ambient feature dimension dd is lower bounded by S/U|\mathcal{S}|/\mathcal{U}. This implies that unless the size of the reachable states is proportional to the total number of states, i.e., U=Θ(S)\mathcal{U} = \Theta(|\mathcal{S}|), the feature dimension inherently depends on the size of the state space.

Furthermore, to the best of our knowledge, there are no computationally tractable algorithms that can handle the general function classes [46, 40, 19, 21, 28, 42, 18] (See Line 640).

In contrast, MNL-MDPs do not require the restrictive linearity assumption present in linear MDPs and can be effectively addressed with computationally tractable algorithms, such as those we propose in our paper. For these reasons, MNL-MDPs have received significant attention within the community, leading to several follow-up works (e.g., Park & Lee, 2024; Li et al., 2024). We note that under this well-established problem setup, our main contribution is to present both computationally and statistically efficient RL algorithms.

Lee and Oh. "Demystifying Linear MDPs and Novel Dynamics Aggregation Framework." ICLR 2024.

Li et al. "Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation." arXiv 2024.

Park and Lee. "Reinforcement Learning for Infinite-Horizon Average-Reward MDPs with Multinomial Logistic Function Approximation." arXiv 2024.


[W3] Discussion on kappa

This is the exact motivation behind our second algorithm. To address this issue, we proposed ORRL-MNL (Algorithm 2) and UCRL-MNL+ (Algorithm 3), each achieving a frequentist regret guarantee with improved dependence on κ\kappa. These algorithms can mitigate the worst-case exponential dependence on κ\kappa.


[W4] Presentation

Due to page limitations, we could not include the proof sketch in the main text. Instead, we have provided the main theorem and the necessary lemmas step-by-step in the appendix, with detailed and clear proofs. We hope this will be helpful for your understanding. In the revised version, we will include the proof sketch in the main text.

For the proof sketch, the frequentist regret analysis of the proposed algorithm follows the flow of linear MDPs if stochastic optimism is guaranteed. Specifically, we decompose the regret into an estimation part and a pessimism part, and then bound each part. The estimation part is upper bounded by the Bellman error of the sample trajectory according to the agent's policy in each episode. The Bellman error is further bounded by the prediction error and the bonus term due to randomized exploration. Here, we use the elliptical potential lemma to bound the estimation part (Lemmas 10 & 21). For the pessimism part, as shown in Lemma 11, the pessimism term can be upper bounded by the estimation term times the inverse of the probability that the estimated value function is optimistic. This is why we need to ensure that the probability that the estimated value function is optimistic is lower bounded by a constant value, regardless of the problem instance. As shown in Lemmas 6 & 18, we show that the probability that the estimated value function is optimistic than the true optimal value function is lower bounded by Φ(1)/2\Phi(-1)/2, which is an absolute constant. Combining all the results, we can achieve the proposed regret bound.


[W5] Computational efficiency

We believe there may be a misunderstanding. Our proposed algorithms do not need to calculate the Q function along the policy trajectory; rather, we calculate the Q function before proceeding with the episode and choose the greedy action with respect to the estimated Q function. Therefore, the computational cost is polynomial in the natural parameters of the problem and constant per episode. In this case, the amount of computation for these Q functions is not necessarily more expensive than those in linear MDPs. For instance, eq (3) in [9] includes the integral over all states to estimate the action-value of a given state-action pair. Therefore, except in the case where the set of reachable states for each state-action pair exactly matches the entire state space, it is difficult to conclude that the computational complexity of our algorithms is more expensive compared to other linear model algorithms.

Due to the space limit, we will leave responses to the remaining comments in the following official comment.

评论

[W6] Contribution of randomized exploration

We also believe that achieving a κ\kappa-improved regret guarantee using optimistic exploration alone (Algorithm 3, Corollary 1) should be considered a significant contribution. Considering that randomized exploration generally requires more complex regret analysis, we believe our contribution to randomized algorithms should be seen as an additional merit, NOT a weakness.

As the reviewer acknowledged, compared to UCB, randomized exploration not only exhibits superior performance in empirical evaluations but also necessitates a more rigorous proof technique to obtain a theoretical guarantee. Proving a particularly frequentist regret bound for randomized algorithms in any given bandit or RL problem has been widely considered an open problem, as the extension is non-trivial in frequentist regret. As a result, randomized algorithms have been widely recognized as making significant contributions in both RL and bandit literature, despite seeming similarities in the algorithmic setting after the introduction of UCB-based exploration algorithms. For example,

  • In linear bandits: UCB (Li et al., 2010) → TS (Agrawal & Goyal, ICML 2013)
  • In GLM bandits: UCB (Filippi et al., 2010) → TS (Kveton et al., AISTATS 2019)
  • In neural bandits: UCB (Zhou et al., 2020) → TS (Zhang et al., ICLR 2021; Jia et al., ICLR 2022)
  • In tabular MDPs: UCB (Jaksch et al., 2010) → TS (Agrawal & Jia, NeurIPS 2017)
  • In linear MDPs: UCB (Jin et al., 2020) → TS (Zanette et al., AISTATS 2020; Ishfaq et al., ICML 2021)
  • In MNL-MDPs: UCB (Hwang & Oh, 2023) → TS (This work)

We strongly believe that our work provides more than sufficient contribution. This is not just because we fill in the gap (which itself is a non-trivial milestone) but also because our adaptation of TS-based exploration (even for the first algorithm) in MNL-MDPs is much more complicated and involved than that of TS methods in other bandit or RL problems.


[W7] Comparison with concurrent paper

The policy of NeurIPS related to comparison to concurrent work states that "Papers appearing less than two months before the submission deadline are generally considered concurrent to NeurIPS submissions. Authors are not expected to compare to work that appeared only a month or two before the deadline." After checking the arXiv submission history of Li et al. (2024), we found that their paper was uploaded to arXiv after the NeurIPS main paper deadline. Therefore, we remind you that our NeurIPS submission was completed before their paper was made publicly available. Nevertheless, briefly comparing our work with the concurrent works, Li et al. focus on the analysis of deterministic algorithms, while our research encompasses a broader scope, including both deterministic and randomized algorithms.

评论

We truly appreciate your valuable feedback and comments. We hope we have addressed your questions and provided the needed clarification in our responses. With our recently posted comment ("Key Contributions") in mind, we would like to know if you have any additional questions or comments that we can address. If there are any further questions or comments, we would be more than happy to address them. If our responses have sufficiently addressed your concerns, based on the points you highlighted and the strengths you mentioned, we sincerely hope you will reconsider your assessment, reflecting the value of our work. Thank you.

评论

Thank you for the comprehensive response. Unfortunately, my concerns regarding the paper remain mostly unchanged. I will refer here only to the key points that mostly affect my decision:

[W2] Known reachable states: I reviewed the provided references ((Proposition 1 in [35]) and Theorem 1 from Lee & Oh (2024)). I do not agree with your interpretation of these results, which borders on misleading. Proposition 1 in [35] states that there exists an instance where the claim happens. This is crucially different than saying that any linear MDP cannot express a tabular MDP unless the condition holds. As for Theorem 1 from Lee & Oh (2024), their claim is quite weak, again saying that for a specific algorithm there might be a hidden dependency on S|S|. Again the claim is not that this happens for every instance and algorithm. Overall, while I am also unsure regarding the expressive power of linear MDPs, the provided claims do not convince me of their limitations. The known reachable states assumption thus remains a significant limitation.

[W3] Kappa: Notice that the regret bound is non-trivial only for T(d3/2H1/2+κ1d2H)2T \ge (d^{3/2}H^{1/2} + \kappa^{-1} d^2 H)^2. This means that the exponential dependence on κ\kappa is still pretty bad. I still appreciate the improved dependence on κ\kappa but It's still essentially exponential most of the time. This is not a major contributor to my decision.

[W5] Computational efficiency: This is my biggest concern. I am fully aware that, given the entire QQ function, there are only constant additional computations. Computing the entire QQ function is exactly the problem as this depends polynomially on the size of the entire state space S|S|. Even computing only the necessary parts of Q would still require at least UH|\mathcal{U}|^H computations per episode. Linear MDPs avoid this dependence by storing all past samples and thus have computational cost scaling linearly with the number of samples. This is not ideal but is regarded as computationally efficient (unlike scaling with S|S|, which can be arbitrarily large, or UH|\mathcal{U}|^H, which is exponential). If I missed something and you can convince me that QQ can be computed efficiently then I might reevaluate my assessment.

[W7] Thanks for the clarification regarding the difference between the works. As I mentioned in my review, I'm treating this as concurrent work, i.e., it does not affect my evaluation.

I hope that you have enough time to reply. Regardless, I will be happy to engage in additional discussion with the other reviewers during the discussion period.

评论

[W5]

The reviewer states that the computational efficiency "is [their] biggest concern" of the evaluation. We strongly dispute this evaluation because the computational efficiency is NOT an inherent disadvantage of our method in particular. Rather, the computational cost is commonly originated from the planning of all model-based RL. Note that planning refers to the process of using a model of the environment to compute a value function. Therefore computing value functions (or action-value functions) for a given model instance is a common and inherent characteristic of a wide range of model-based RL literature. To list just a few of them (and there are much more),

  • Yang and Wang, 2020 [70] published at ICML 2020
  • Ayoub et al., 2020 [9] published at ICML 2020
  • Sun et al., 2019 [a] published at COLT 2019
  • Agarwal et al., 2020 [b] published at NeurIPS 2020
  • Zhou et al., 2021 [79] published at COLT 2021
  • Uehara et al., 2021 [c] published at ICLR 2022
  • Li et al., 2022 [d] published at NeurIPS 2022

To the best of our knowledge, there is NO efficient method for planning that avoids dependency on S|S| or that is not exponential in HH.

Therefore, if the reviewer underestimate our paper due to concerns about the computational cost of planning—a common issue in all model-based algorithms—we believe this would be an unfair evaluation. Additionally, if the reviewer can suggest a model-based algorithm that enables more efficient planning, it would be extremely helpful for us in improving our paper.

Moreover, it seems the reviewer is comparing our (model-based) algorithm to a model-free algorithm, such as LSVI-UCB [Jin et al., 2020 [41]]. We believe this comparison is invalid because each problem setup has its own unique characteristics. For instance, in LSVI-UCB, the algorithm needs to recalculate previous targets (rh(xhτ,ahτ)+maxaQk_h+1(xh+1τ,a)r_h(x^\tau_h, a^\tau_h) + \max_a Q^{k}\_{h+1}(x^\tau_{h+1}, a) for τ=1,,k1\tau = 1, \dots, k-1) based on the current QQ-function to estimate the parameters (Please refer to Line 5 in Algorithm 1 of Jin et al., 2020. Note that while their QQ-functions do not explicitly display a kk index, a careful examination of the algorithm reveals that the kk index is implicitly incorporated within the QQ-functions.). This recalculation results in a computational cost of O(k)\mathcal{O}(k) and makes online parameter updates impossible. In contrast, our approach allows for online parameter updates because the target (transition response variable yhky^k_h) remains unchanged.

To sum up, the reviewer's major concern about the planning cost applies to all model-based RL. It is an inherent and unavoidable characteristic of all aforementioned model-based approaches. If one treats this as a major drawback and reason for a negative evaluation, then the rich history/body of model-based RL would be entirely disregarded. It is crucial to distinguish between the limitations specific to our paper and those inherent to the entire model-based framework. With this in mind, we respectfully and kindly request that the reviewer reconsider the evaluation of our paper.

[a] Sun, Wen, et al. "Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches." Conference on learning theory. PMLR, 2019.

[b] Agarwal, Alekh, et al. "Flambe: Structural complexity and representation learning of low rank mdps." Advances in neural information processing systems 33 (2020): 20095-20107.

[c] Uehara, Masatoshi, Xuezhou Zhang, and Wen Sun. "Representation learning for online and offline rl in low-rank mdps." International Conference on Learning Representations (2022).

[d] Li, Gene, et al. "Exponential family model-based reinforcement learning via score matching." Advances in Neural Information Processing Systems 35 (2022): 28474-28487.

[e] Szita, István, and Csaba Szepesvári. "Model-based reinforcement learning with nearly tight exploration complexity bounds." Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010.


Since the author-reviewer discussion is coming to an end, if you have any last-minute questions, we are more than willing to address them. Please don't hesitate to reach out to us.

评论

Thank you for your response to our rebuttal and for the opportunity to address your concerns.


[W2]

First, we would like to clarify some misunderstandings in the reviewer’s interpretation of our rebuttal about the limitations of linear MDPs.

Proposition 1 in [35] states that there exists an instance where the claim happens. This is crucially different than saying that any linear MDP cannot express a tabular MDP unless the condition holds.

We do NOT make any claims about the expressiveness of linear MDPs to tabular MDPs. Proposition 1 in Hwang & Oh [35] states that "For an arbitrary set of features of state and actions of an MDP, there exists no linear transition model that can induce a proper probability distribution over next states". This means that there exists a state-action feature map that cannot induce linear MDPs. The issue is that when a feature map is provided, it is impossible to determine whether linear MDPs can be constructed using that feature map. Furthermore, the reviewer may also refer to Proposition 2 in Hwang & Oh (2023) [35], which states that such restrictive linearity assumptions on the transition model can impact the regret analysis of the algorithms for RL with linear function approximation. This is indeed a crucial limitation of linear MDPs, as well-explained in [35].

As for Theorem 1 from Lee & Oh (2024), their claim is quite weak, again saying that for a specific algorithm there might be a hidden dependency on S|\mathcal{S}|

We believe there may be a misunderstanding regarding Theorem 1 in Lee & Oh (2024). Theorem 1 in Lee & Oh (2024) states that "For any given linear MDP, the dimension of the feature map dd is lower bounded by S/U|\mathcal{S}| / \mathcal{U}." As a corollary for this theorem, "unless the size of the reachable states is proportional to the total number of states, i.e., U=Θ(S)\mathcal{U} = \Theta(|\mathcal{S}|), the feature dimension inherently depends on the size of the state space." (Corollary 1 in Lee & Oh (2024))

We believe this argument clearly highlights the inherent limitations of linear MDPs. Can you think of any real-world scenario where a single state can transition to all other states, i.e., S=U|\mathcal{S}| = \mathcal{U}? Lee & Oh (2024) provide examples demonstrating that in most real-world environments, U\mathcal{U} is much smaller than S|\mathcal{S}|. This implies that in linear MDPs for most real-world problems, the dimension of feature map dd must be proportional to S|\mathcal{S}|. Thus, there are very few cases where algorithms for linear MDPs are statistically efficient.

The known reachable states assumption thus remains a significant limitation.

We respectfully disagree with the reviewer's assessment that having prior knowledge about reachable states is a significant limitation. In practice, it is common to have prior knowledge of the reachable states ("single-step" reachability) for a given state-action pair, as seen in examples like gridworlds, Atari games, Go and navigation tasks. This prior knowledge about reachable states makes MNL-MDPs valid for any arbitrary state-action feature sets. More importantly, even if one considers prior knowledge about reachable states a limitation of MNL-MDPs, we strongly believe this should NOT be viewed as a weakness of our algorithms. We believe the reviewer understands the difference between problem settings (MNL-MDPs) and algorithms (RRL-MNL, ORRL-MNL, UCRL-MNL+). Note that the main focus of our paper is to provide algorithms that are both statistically and computationally efficient for the MNL-MDPs, which are well-established in Hwang and Oh, 2023 [35].


[W3]

Please note that the reviewer's argument that given regret bound is valid only for a specific TT applies not only to bandit literature for κ\kappa-improved regret bounds [Faury et al., 2020 [23], Abeille et al., 2021 [3], Perivier and Goyal, 2022 [59], Agrawal et al., 2023 [6], Zhang and Sugiyama, 2023 [74], Lee and Oh, 2024 [48]], but also to cases where κ\kappa is a multiplicative factor in TT [Filippi et al., 2010 [26], Li et al., 2017 [49], Oh and Iyengar, 2019 [52], Amani and Thrampoulidis, 2021 [8], Oh and Iyengar, 2021 [53]]. Moreover, this argument has NOT been viewed as a weakness in the publication of previous works at venues like ICML and NeurIPS. In all fairness, we strongly believe that the more seriously the reviewer takes κ\kappa, the more contributions we provide! Our work eliminates the previous dependence on κ\kappa in the leading term, which we see as a significant contribution. We respectfully disagree with the notion that this contribution of reducing κ\kappa dependence (while we even perform online computation with constant compute per round!) somehow should be penalized (just because it was deemed insufficient?).

审稿意见
7

This paper studies reinforcement learning of an MDP with multinomial logistic (MNL) function approximation, which approximates the unknown transition kernel of the underlying MDP. The setting is the finite horizon episodic MDPs. Two algorithms are proposed. The first algorithm is statistically efficient with a provable frequentist regret bound, and is computationally efficient with constant-time computational cost per episode. To further improve the dependence on problem constant of the first algorithm, the second algorithm utilizes gradient information of the MNL model of the unknown transition. These two algorithms are the first randomized algorithms for the MNL transition model with both statistical and computational efficiency. Furthermore, numerical simulations show the good performance of the proposed algorithms.

优点

Significance:

1. The MNL transition model is a reasonable model for approximation to the unknown transition kernel and thus has potential practical usage.

2 The two algorithms are provably statistically efficient, which is further corroborated by numerical simulations.

Originality: This paper has originality. It proposes to use multinomial distribution for modeling the transition kernel of MDP, and proposes the first efficient randomized algorithms for the MNL transition model.

Clarity: the main text of the paper is clearly presented: algorithms are explained well (e.g. the intuition behind stochastically optimistic value function is provided); the theorems are presented with in-depth discussion of its relation to existing works, and the comparison of its proof techniques with other algorithmic guarantees in the area.

缺点

I did not find any major errors or technical flaws. However, the following can be improved.

Weaknesses

1. The theoretical analysis, especially the appendix, is a bit hard to read. This is caused by two issues: (i) there are many displayed equations that are very long; (ii) some derivations/steps are missing a proper explanation.

2. The formatting of the equations in the appendix should be improved. There are many lines of equations which significantly exceed the length limit within a line. Probably should consider breaking the equations into combinations of shorter ones with proper explanation.

Other Minor issues

1. This paper so far only demonstrates the efficacy of the multinomial logistic model for approximating a synthetic MDP environment. Although the result from the numerical simulation is convincing, it would be very interesting to see how it works if applied to more realistic environment. This might be an interesting future direction.

问题

My concerns are included in the Weaknesses session.

局限性

Yes.

作者回复

Thank you for your time to review our paper and for your valuable feedback. Here are our responses to each comment and question:


[W1] Presentation

there are many displayed equations that are very long

The comprehensive nature of our analysis required the inclusion of detailed proofs, resulting in lengthy equations. Some of these equations slightly exceeded the line limit to enhance readability. In the revised version, we will ensure these equations are displayed more clearly. Additionally, we have organized the definitions and notations used in our analysis in Appendix B for convenience, so please refer to it.

some derivations/steps are missing a proper explanation

If there are specific derivations or steps that were not clearly explained, we will address these in the revised version. We would greatly appreciate it if you could point out any particular areas that were unclear, so we can improve them.


[W2] Format

We will pay closer attention to the formatting in the revised version, breaking long equations into shorter ones with proper explanations. Thank you for pointing this out.


[M1] Experiment on realistic environment

Among dozens (if not hundreds) of papers on provable RL with function approximation, there are very few papers with numerical experiments. The vast majority of the papers do not contain any experiments at all. To the best of our knowledge, the ones with the experiments (with regret guarantees) are only [9, 35, 37] -- we would be more than happy to add to this list if the reviewer knows of any other related theory RL work that contains more extensive experiments. Hence, we would like to point out that we conducted a significant number of experiments compared to the existing theoretical RL papers [41, 66, 69, 70]. As this is a theoretical paper, we respectfully request that it be mainly assessed based on its theoretical merit. Yet, as the number of states increases, our proposed methods are the ones that remain competitive and superior to the existing methods, showing the efficacy of MNL function approximation and empirical evidence for our theoretical claims for our algorithms.

评论

We truly appreciate your valuable feedback and comments. We hope we have addressed your questions and provided the needed clarification in our responses. With our recently posted comment ("Key Contributions") in mind, we would like to know if you have any additional questions or comments that might support increasing your rating. If there are any further questions or comments, we would be more than happy to address them. If our responses have sufficiently addressed your concerns, based on the points you highlighted and the strengths you mentioned, we sincerely hope you will reconsider your assessment, reflecting the value of our work. Thank you.

评论

I thank the authors for their response. Please make sure to address the aforementioned issues in the paper revision.

Best, Reviewer

评论

Thank you so much for your response to our rebuttal and support! With the help of your feedback, we believe that our revised version will be strengthened. We will incorporate your valuable suggestions in the revised version of the paper. If there are any additional questions or comment, we would be more than happy to address them.

审稿意见
6

The paper introduces a computationally efficient randomized algorithm for MNL-MDPs, addressing limitations of previous exploration algorithms. It presents the RRL-MNL algorithm, focusing on online transition core estimation and stochastically optimistic value function design. This paper provides a frequentist regret analysis for MNL-MDPs with theoretical comparisons to other algorithms.

优点

  1. The paper is well-structured and effectively communicates complex theoretical concepts.

  2. The proposed algorithm, ORRL-MNL, is interesting, and it regret bound has an attractive improvement.

缺点

  1. The focus on multinomial logistic function approximation and specific types of state transitions (inhomogeneous) may restrict the generalizability of the proposed algorithms.

  2. The paper lacks some discussion on the setting, MNL-MDPs to show it worth studying.

问题

  1. Can you give a more detailed comparison between MNL-MDPs and Linear MDPs settings? Can you give a detailed discussion on the setting MNL-MDPs and show it worth studying? What is the connection between this setting and generalized linear MDPs?

  2. Can you summarize the theoretical challenges of randomized algorithms in MNL-MDPs compared with that in Linear MDPs?

  3. Can you explain the necessity of Assumption 4, which seems lack realizability motivation? I find that you mention this assumptions also use in generalized linear bandits, can you compare the assumptions in generalized linear bandit and in MNL-MDPs. Also, how about the generalized linear MDPs?

  4. The regret bound of ORRL-MNL is interesting. Do you have any new theoretical analysis techniques to obtain this?

局限性

This paper lacks more discussions on the setting, MNL-MDPs to show it worth studying. I hope the authors can answer my questions.

作者回复

Thank you for your time to review our paper and for your valuable feedback. Here are our responses to each comment and question:


[W1] Generalizability of the proposed algorithms

We respectfully disagree that the stated point is a weakness of our work. For instance, in the contextual bandit setting, the logistic bandit assumes that the expected reward for an action is given by a logistic function of the action's context features. The logistic bandit addresses a specific type of reward setting but can also handle problem instances that the linear bandit cannot. Additionally, substantial research [3, 23, 24, 49, 43, Lee et al. (2024)] has been conducted to develop improved algorithms tailored to this specific logistic reward setting. We would like to emphasize that our main contribution in this paper is also to present an improved algorithm for MNL-MDPs, overcoming several limitations of linear MDPs, which we will detail in the second bullet point.

Lee et al. "Improved Regret Bounds of (Multinomial) Logistic Bandits via Regret-to-Confidence-Set Conversion." AISTATS 2024.


[W2 & Q1] Discussion on MNL-MDPs & Comparison with Linear MDPs

We are happy to address this issue. Although improved algorithms for linear MDPs continue to be proposed, linear MDPs inherently have fundamental limitations:

  1. As introduced in [35], a linear function approximating the transition model must ensure that the function output is within [0, 1] and that the probabilities over all possible next states sum to 1. This restrictive assumption limits the set of feature representations of state-action pairs that are admissible for the transition model (Proposition 1 in [35]).

  2. Additionally, Theorem 1 from Lee & Oh (2024) shows a fundamental limitation of linear MDPs. In linear MDPs, the ambient feature dimension dd is lower bounded by S/U|\mathcal{S}|/\mathcal{U}. This implies that unless the size of the reachable states is proportional to the total number of states, i.e., U=Θ(S)\mathcal{U} = \Theta(|\mathcal{S}|), the feature dimension inherently depends on the size of the state space.

To overcome these limitations, [35] proposed a new class of MDPs (MNL-MDPs) using multinomial logistic function approximation. MNL-MDPs have garnered significant attention in the RL community, leading to many follow-up works (Park & Lee, 2024; Li et al, 2024) and numerous open research questions.

If the generalized linear MDPs you mentioned refer to [67], they assume the Bellman backup of any value function to be a generalized linear function of feature mapping and propose a model-free algorithm under this assumption. In contrast, in MNL-MDPs, the state transition model is given by a multinomial logistic model based on state-action features, and we propose a model-based algorithm.

Lee and Oh. "Demystifying Linear MDPs and Novel Dynamics Aggregation Framework." ICLR 2024.

Li et al. "Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation." arXiv 2024.

Park and Lee. "Reinforcement Learning for Infinite-Horizon Average-Reward MDPs with Multinomial Logistic Function Approximation." arXiv 2024.


[Q2] Theoretical challenges of randomized exploration in MNL-MDPs

  • Absence of the closed-form expression of value function in MNL-MDPs Unlike linear MDPs, where the value function is linear in the feature map, in MNL-MDPs the value function is no longer linearly parameterized. Therefore, we cannot control the perturbation of the estimated value function in MNL-MDPs by perturbing the estimated parameter directly as in linear MDPs. However, as shown in Lemma 4 (RRL-MNL) or Lemma 16 (ORRL-MNL), we were able to express the prediction error in terms of the estimated error of the transition parameter, providing a basis for how we can control the perturbation of the value function.
  • Ensuring the stochastic optimism of the estimated value function The main technical challenge in analyzing the frequentist regret of randomized algorithms is ensuring the estimated value function is optimistic with sufficient frequency. However, the substitution effect of the next state transition in the MNL model makes ensuring the stochastic optimism much more challenging. If the probability of the estimated value function being optimistic at horizon hh is denoted as pp, this would result in the probability that the estimated value function in the initial state is optimistic being on the order of pHp^H, implying that the regret can increase exponentially with the length of the horizon HH. Instead we establish that the difference between the estimated value and the optimal value is expressed by the sum of Bellman errors along the sample path obtained by the optimal policy (Lemma 4) and adapt the optimistic sampling technique to ensure the stochastic optimism with sufficient frequency (Lemma 6 & 18). Please note that while [37] presented a frequentist regret analysis for general function class using eluder dimension under the assumption of stochastic optimism, our work deals with non-linear function approximation and directly demonstrates stochastic optimism without any additional assumptions.

Due to the space limit, we will leave responses to the remaining comments in the following official comment.

评论

[Q3] Necessity of Assumption 4

First, we respectfully request additional clarification regarding the terms "realizability motivation".

As far as we know, Assumption 4 is a standard regularity assumption in the generalized linear bandit and MNL bandit literature to ensure the Fisher information matrix is invertible. Please refer to Appendix A in [52] for a detailed discussion on this assumption. In fact, we would like to clarify that it is more natural to describe the existence of κ\kappa in Assumption 4 as a definition rather than an assumption. In other words, we define κ\kappa as a specific value as κ:=inf_θB_d(L_θ)inf_s,a,s,s~P_θ(ss,a)P_θ(s~s,a)\kappa := \inf\_{\theta \in \mathcal{B}\_d(L\_\theta)} \inf\_{s,a,s',\tilde{s}} P\_{\theta} (s' \mid s, a) P\_{\theta} ( \tilde{s} \mid s, a). Note that by the definition of the MNL transition model, for any reachable state s~\tilde{s} of a given state-action pair (s,a)(s,a) and for any parameter θ\theta, Pθ(s~s,a)0P_\theta(\tilde{s} \mid s,a) \ne 0, which ensures the positivity of κ\kappa. As in the previous generalized linear bandit [26, 49, 23, 3, 24] or MNL bandit literature [52, 8, 53, 59, 6, 74, 48], our first algorithm also requires the prior knowledge of κ\kappa to estimate the transition core θ\theta^*. However, we would like to emphasize that, as mentioned in Remark 4, our second algorithm does not require the prior knowledge of κ\kappa and achieves a regret with a better dependence on κ\kappa (i.e., ORRL-MNL does not require Assumption 4).

For the generalized linear MDPs [67], they assume the Bellman backup of any value function is given by a generalized linear function (denoted by ff in [67]) of the feature map and impose the regularity condition on the GLM link function (ff) and the prior knowledge of κ\kappa is required to ensure the regret bound of the proposed algorithm.


[Q4] New theoretical analysis techniques for ORRL-MNL

We appreciate your interest in our regret analysis. Below are our new theoretical ingredients:

  • Establishing κ\kappa-independent upper bound of prediction error:
    With the technical difficulties of randomized exploration in MNL-MDPs, the technical challenges for achieving statistically improved regret bound in MNL-MDPs involve integrating the κ\kappa-independent concentration result from MNL contextual bandits with Bellman error while ensuring that the regret bound is unaffected by the size of the set of reachable states U\mathcal{U}. As introduced in Appendix D.2, directly applying recent κ\kappa improved MNL contextual bandit technique to MNL-MDPs leads to a loose regret by a factor of U\sqrt{\mathcal{U}}. By adapting the feature centralization technique [48], the Hessian of the per-round loss is represented by the centralized feature and we establish that the prediction error is upper bounded without depending on κ\kappa (Lemma 16). We believe that the reviewer understands that the involved regret analysis results not from a single new technique but from the careful incorporation of several techniques.

  • Ensuring stochastic optimsim:
    Based on these upper bounds, determining how and to what extent to perturb the value function to ensure the stochastic optimism of the estimated value function still remains. Since the feature of each reachable state affects the prediction error (the first term on the right-hand side in the result of Lemma 16), this implies that the probability of stochastic optimism can be exponentially small, not only in the horizon HH but also in the size of the reachable states U\mathcal{U}. However, as shown in Lemma 18, this challenge has been overcome by using a sample size MM that logarithmically increases with U\mathcal{U}. We believe that the reviewer understands that the involved regret analysis results not from a single new technique but from the careful incorporation of several techniques.

评论

We truly appreciate your valuable feedback and comments. We hope we have addressed your questions and provided the needed clarification in our responses. With our recently posted comment ("Key Contributions") in mind, we would like to know if you have any additional questions or comments that we can address. If there are any further questions or comments, we would be more than happy to address them. If our responses have sufficiently addressed your concerns, based on the points you highlighted and the strengths you mentioned, we sincerely hope you will reconsider your assessment, reflecting the value of our work. Thank you.

评论

I thank the authors for the detailed response, which addresses my questions and concerns. After my further consideration, I decide to raise my score to 6. I suggest the authors adding some discussions based on the rebuttal to make the paper clearer in the revised version. Thanks.

评论

Thank you so much for your response to our rebuttal and support! With the help of your feedback, we believe that our revised version will be strengthened. We will incorporate your valuable suggestions in the revised version of the paper. If there are any additional questions or comment, we would be more than happy to address them.

评论

We sincerely appreciate the valuable feedback and comments from all reviewers, as well as the time and effort you invested in reviewing our paper and participating in the discussion. We hope our responses have adequately addressed your main concerns. If you have any further questions or comments, we would be more than happy to address them. In the meantime, we kindly wish to highlight some key points:


  • Both computationally & statistically efficient algorithms in MNL-MDPs:

    Beyond restrictive linearity assumption inherent in linearMDPs, recently MNL-MDPs are introduced to overcome such limitations and have garnered significant attention in the RL community. Under these non-linear parametric MDPs, we propose computationally tractable randomized algorithms (RRL-MNL & ORRL-MNL), and these algorithms achieve both computational and statistical efficiency.

  • First frequentist regret analysis for RL with non-linear function approximation without assuming stochastic optimism:

    Ensuring the stochastic optimsim with sufficient frequency is a cruical challenge in the frequentist regret analysis of randomized algorithm. Previous randomized algorithms for RL with non-linear function approximation either assume stochastic optimism [37] or are computationally intractable [4, 5, 73]. However, we directly derive the stochastic optimism of our proposed algorithms without any additional assumptions. We believe these technical ingredients can be of interest in the randomzied algorithm in RL with non-linear function approximation.

  • Statistical improved algorithms (ORRL-MNL & UCRL-MNL+) for MNL-MDPs:

    Improved dependence on problem-instance κ\kappa is a research topic that has recently garnered significant attention in logistic bandits and MNL bandits. However, due to the fundamental differences between the settings of bandits and RL, the techniques from previous work [59, 74] result in sub-optimal regret in MNL-MDPs. By establishing a κ\kappa-independent upper bound on prediction error, we demonstrated that our proposed algorithms achieve a κ\kappa-improved regret bound without requiring prior knowledge of κ\kappa.

  • Superiority of our algorithms in experiments:

    Note that only few papers on provable RL with function approximation include numerical experiments. While our work is mainly theoretical, our extensive experiments, more thorough than many theoretical RL papers, validate our methods. As the number of states increases, our algorithms demonstrate superior statistical and computational performance in MNL-MDPs.


Given the significance of our work, we strongly believe that our submission well surpass the threshold of sufficient contribution. With your constructive feedback and the contents of the main rebuttal in mind, we kindly hope you will reconsider your score to more accurately reflect the potential and impact of our work.

最终决定

A summary of the strengths and weaknesses based on the reviews and rebuttal (including the follow-up discussion and that among the reviewers) is provided below:

STRENGTHS

  1. This paper proposes novel randomized algorithms for MNL-MDPs that are efficient statistically.

  2. The authors have provided a regret analysis and an improved regret bound for their proposed algorithms without assuming stochastic optimism.

  3. The main paper is well-organized and clearly presented.

WEAKNESSES

  1. As mentioned by Reviewer rYza, we agree that it is not correct for the authors to claim that their method is computationally efficient. This is misleading. Like the authors have said, "there is NO efficient method for planning that avoids dependency on S|S| or that is not exponential in HH", though anytime algorithms have been used to mitigate the latter issue. In such a case, we would like the authors to withdraw all instances of mentioning that their method is computationally efficient since it is not.

  2. This paper initially lacks a discussion of the positioning and challenges (in particular, the theoretical analysis) of its proposed work w.r.t. linear MDPs. The authors have adequately addressed this concern in the rebuttal.

  3. Multiple reviewers have agreed that the ease of readability of the derivations and proofs has room for improvement, such as reducing lengthy equations with shorthand notations, giving the results used for each step of the proof/derivation, giving proof sketches, etc.

Nonetheless, the pros of this work outweigh the cons.

Also, the authors are requested to make clear the following points, as requested by the reviewers during the discussion:

  • The authors explained that Eq. (2) aims to minimize k=1Nk,h(θ)\sum_{k=1}^N\ell_{k,h}(\theta), why only k1,h\nabla\ell_{k-1,h} is used? Similar question applies for Eq. (5).

  • Also, the new notation si1,,siSkmhs_{i_1},\ldots,s_{i_{\mathcal{S}_{kmh}}} in line 170 should be defined.

As a minor comment, while we agree that the main contributions of this work are theoretical and the authors have provided numerical experiments (unlike several prior works), the authors can ask themselves why stop at that and not go beyond to more realistic environments (is it simply because no other work has done it?). Why not be one of the first few works to move the needle by demonstrating the practicality of model-based RL? If not, the authors can discuss what stands in their way to be able to achieve this and their position on whether there is any hope of doing so by pursuing this line of work. We would strongly urge the authors to reconsider their position here.

The authors are encouraged to revise their paper based on the above feedback and that of the reviewers.