6.3

/10

Poster4 位审稿人

最低6最高7标准差0.4

3.3

置信度

正确性3.0

贡献度3.3

表达2.3

NeurIPS 2024

Offline Oracle-Efficient Learning for Contextual MDPs via Layerwise Exploration-Exploitation Tradeoff

Jian Qian,Haichen Hu,David Simchi-Levi

OpenReview PDF

提交: 2024-05-15更新: 2024-12-21

TL;DR

This paper presents a statistical and computational reduction from CMDPs to offline density estimation with little overhead.

摘要

Motivated by the recent discovery of a statistical and computational reduction from contextual bandits to offline regression \citep{simchi2020bypassing}, we address the general (stochastic) Contextual Markov Decision Process (CMDP) problem with horizon $H$ (as known as CMDP with $H$ layers). In this paper, we introduce a reduction from CMDPs to offline density estimation under the realizability assumption, i.e., a model class $\mathcal{M}$ containing the true underlying CMDP is provided in advance. We develop an efficient, statistically near-optimal algorithm requiring only $O(H \log T)$ calls to an offline density estimation algorithm (or oracle) across all $T$ rounds. This number can be further reduced to $O(H \log \log T)$ if $T$ is known in advance. Our results mark the first efficient and near-optimal reduction from CMDPs to offline density estimation without imposing any structural assumptions on the model class. A notable feature of our algorithm is the design of a layerwise exploration-exploitation tradeoff tailored to address the layerwise structure of CMDPs. Additionally, our algorithm is versatile and applicable to pure exploration tasks in reward-free reinforcement learning.

关键词

reinforcement learningcontextual MDPoffline density estimationcomputational efficiency

评审与讨论

审稿意见

评分: 6置信度: 32024-06-18

In this paper, the authors study the problem of learning stochastic contextual MDPs using a realizable and finite models class $\mathcal{M}$ accessed via an offline density estimation oracle, which have the guarantee of minimizing the expected error measured by the squared Hellinger distance. Under the assumption that the oracle is efficient, the presented algorithm is efficient in terms of both running time and oracle call complexity. It achieves a regret bound of $\widetilde{O}(\sqrt{H^7 S^4 A^3 T \log (|\mathcal{M}|/\delta)})$ where $T$ is the number of episodes, $H$ is the horizon, $S,A$ are the finite cardinalities of the state and action spaces, respectively. In this bound, the dependency in $T, |\mathcal{M}|$ is optimal while the dependency in $H,S,A$ is non-optimal. Their approach is a direct extension of the Inverse-Gap-Weighting (IGW) [Foster and Rakhlin (2020), Simchi-Levi and Xu (2021)] to CMDPs, where using the value function and policy cover instead of the immediate reward to define the approximated suboptimality gap. For the model approximation, they use an offline density estimation oracle (rather than a regression oracle used in precious works on CMAB).

The authors also show their method applies to the reward-free setting. This is possible as their approch applies strong exploration, which costs in high dependency in $H,S,A$ .

优点

This paper mainly discusses the question of deriving regret bounds for stochastic CMDPs using offline oracles which is an interesting open question in RL theory.
The proposed algorithm successfully extends IGW technique to derive rate-optimal regret bound for CMDPs. Due to the high exploration property of their method, the authors were able to show it also has well performance for the problem of rewards-free exploration.

缺点

In my opinion, the writing requires some improvements. For instance, the algorithm is hard to understand due to some definitions that appear after it (instead of before it). An intuitive explanation of the use in the trusted occupancy measures set is missing, the use of them to derive a multiplicative guarantee is not trivial and not well explained.
The authors focus on their logarithmic oracle complexity as their main novelty, whereas in CMDPs, as previous literature shows, oracle complexity of $O(T)$ is considered efficient and acceptable, and the focus should be on using the minimal assumptions regarding the oracle and function class.
In continuing to point 2, in this work the authors assume the oracle has a convergence guarantee w.r.t the expected squared Hellinger distance applied on CMDPs class directly, and in Appendix C.1 they provide an implementation using maximum likelihood estimation, which can be hard to optimize for many function classes. Also, throughout the paper, the oracle efficiency is not discussed at all, and no examples of implementations of it for specific easy classes such as linear functions, tabular MDPs, and more are not given. My main concern is that this oracle is not practical, where the truly interesting question in stochastic CMDPs is to derive regret bound using oracles that potentially can be implementable and efficient (at least for simple classes such as linear functions). Moreover, the squared Hellinger distance has a strong relation to the log loss (as established by Foster et al. 2021), and any reference or use in that relation does not appear in this paper. Instead, the authors use an oracle that its efficiency is unclear and not discussed. I would appreciate it if the authors would provide an explanation as to why they chose to use a convergence guarantee w.r.t the squared Hellinger distance rather than other convex loss functions (such as the log loss for instance), that clearly yields an efficient oracle for simple function classes such as linear functions.
Regarding the computational complexity - in each round the authors compute policy cover for each layer, state, and action. This has a significant cost that is completely ignored. I would expect that the authors will present their algorithms running time complexity, excluding the oracle's complexity.
There are recent, relevant works on CMDPs that are not mentioned in the related literature review. See [1,2,3,4] for references. I also think that a comparison with the results of [2,3,4] might be relevant.

[1] Contextual Markov Decision Processes. Hallak et al. (2015).

[2] Contextual Markov Decision Processes using Generalized Linear Models. Modi and Tewari. (2019).

[3] Efficient rate optimal regret for adversarial contextual MDPs using online function approximation. Levy et al. (2023).

[4] Eluder-based Regret for Stochastic Contextual MDPs. Levy et al. (2024).

问题

In line 4 of Algorithm 1, the segment loop is unclear to me. Can you please explain what is used for? As I understand it, the segment loop is running over the CMDPs layers and hence should be inside the rounds loop.
The oracle guarantee in Definition 2.1 is w.r.t the predicted model $\hat{M}$ and the true model $M_\star$ . It is unclear to me how it directly implies Lemma 3.2, unless the oracle is actually more powerful than described in definition 2.1 and assumed to optimize simultaneously the squared Hellinger distance of both a rewards function class and a dynamics function class. If this is indeed the situation, it is unclear to me, as the rewards do not define a distribution, and the squared Hellinger distance is defined for distributions. Also, in this case, it is unclear to me why the squared Hellinger distance is an appropriate loss for the rewards approximation. Previous works on CMAB (e.g., Simchi-Levi and Xu 2021, Foster and Rakhlin 2020) show that the least squares (that are less sensitive than the squared Hellinger distance in [0,1]) is an appropriate loss choice for the rewards approximation.
Please refer to the concerns regarding the oracle raised in weakness, points 2 and 4. Specifically, I would appreciate an explanation of why you chose this specific oracle and whether it can be implemented efficiently.
I do not understand how the authors derived the fully multiplicative lower bound over the trusted occupancy measures. As I understand it, there is a missing additive term. I would appreciate it if the authors could explain that point as if it does not hold true, the whole analysis is incorrect.

局限性

The regret bound is not tight in $H,S,A$ .
The oracle might be inefficient.

作者回复

2024-08-02

In my opinion, the writing requires some improvements. For instance, the algorithm is hard to understand due to some definitions that appear after it (instead of before it). An intuitive explanation of the use in the trusted occupancy measures set is missing, the use of them to derive a multiplicative guarantee is not trivial and not well explained.

Thanks for the suggestions! We deferred the definition due to its length, but we will add an informal definition upfront for clarity. At a very high level, the intuition behind the multiplicative guarantee is as follows: when estimating a Bernoulli variable, if the empirical mean is greater than O(1/sample size), then with high probability, the empirical mean can be upper bounded by a constant multiple of the true mean.

The authors focus on their logarithmic oracle complexity as their main novelty, whereas in CMDPs, as previous literature shows, oracle complexity of O(T) is considered efficient and acceptable, and the focus should be on using the minimal assumptions regarding the oracle and function class.

We respectfully disagree on this point. There are two significant advantages to using an oracle: We would like to emphasize again the exponential difference between O(T) and O(\log⁡T). With T known beforehand, the LOLIPOP algorithm achieves an oracle complexity of O(\log⁡\log⁡ T). We believe this is a substantial improvement since the gap can be double exponential. The second advantage is the type of oracle required. Previously, to achieve near-optimal regret, algorithms required an online oracle. This is disadvantageous since online oracles are less straightforward to implement and necessitate updating the prediction for each round, which can be costly in practice. The offline oracles that the LOLIPOP algorithm uses can be implemented by ERM on the log loss. This is beneficial in practice since regression is much better understood than online updates, and the learner can choose to update only when a sufficient amount of new data has been gathered. Overall, we believe our algorithm is more practical.

In continuing to point 2, in this work the authors assume the oracle has a convergence guarantee w.r.t the expected squared Hellinger distance applied on CMDPs class directly, and in Appendix C.1 they provide an implementation using maximum likelihood estimation, which can be hard to optimize for many function classes. Also, throughout the paper, the oracle efficiency is not discussed at all, and no examples of implementations of it for specific easy classes such as linear functions, tabular MDPs, and more are not given. My main concern is that this oracle is not practical, where the truly interesting question in stochastic CMDPs is to derive regret bound using oracles that potentially can be implementable and efficient (at least for simple classes such as linear functions).

We thank the author for sharing their vision. Continuing our argument, the MLE oracle is essentially ERM on log loss, making it more implementable and efficient than the online oracles typically considered in the literature. Additionally, due to the “online-to-batch” conversion, any online oracle can be converted to an offline one.

We are happy to discuss further the efficiency of offline oracles versus online ones. However, we note that this topic is somewhat outside the scope of this paper, as the efficiency and implementation of MLE is a classical statistical issue [5] rather than a decision-making one. Regarding linear functions, I am unsure what the reviewer means since we are focused on density estimation. The most popular class, logistic loss corresponding to the softmax distribution on states, is convex and thus allows for efficient implementation.

[5] Maximum likelihood estimation: Logic and practice, SR Eliason, 1993

Moreover, the squared Hellinger distance has a strong relation to the log loss (as established by Foster et al. 2021), and any reference or use in that relation does not appear in this paper. Instead, the authors use an oracle that its efficiency is unclear and not discussed. I would appreciate it if the authors would provide an explanation as to why they chose to use a convergence guarantee w.r.t the squared Hellinger distance rather than other convex loss functions (such as the log loss for instance), that clearly yields an efficient oracle for simple function classes such as linear functions.

The Hellinger distance has become quite standard in considerations related to MDP since Foster et al. (2021), as referenced by all the compared works. We are happy to add a citation to credit Foster et al. (2021). We chose the Hellinger distance for the same reason as the references: it can be effectively controlled through log-loss, specifically by MLE (log-loss regression), for our purpose.

评论- Further Questions

2024-08-07

I thank the authors for their rebuttal. However, my concerns below have not been satisfied, and I would be happy if the authors could elaborate more.

Orcale:

I wold be happy if the authors could provide an example for function class of MDPs as specified in definition 2.1 for which the oracle can be implemented efficiently.

Regarding Lemma 3.4. I have not claimed there is a mistake, but rather wish to understand intuitivly how such a result is possible. I read the proof in the appendix, and it was hard to follow. For that reason, I could not point out any specific falut. It is very non-trivial and non-intuitive to derive a multiplicative lower bound over occupancy measures that holds w.p. 1 that is based over a gurantee that holds in expectation over funciton approximation. At least, I would expect to have a logarithmic term that follows from a martinagle-related concentation bound. Also, as the context is stochastic and the oracle gurantee is in expection over the contexts, the contditions in this lemma are non trivial. I would be happy to have the autors explanation to that point.

Best, The reviewer

评论- Further Rebuttal

2024-08-08

We thank the reviewer for his/her/their further clarification! It helps greatly for us to understand the questions!

I wold be happy if the authors could provide an example for function class of MDPs as specified in definition 2.1 for which the oracle can be implemented efficiently.

Regarding this question, we would like to progressively mention three points (a bit repetitive following our rebuttal, but bear with us):

Since any online estimation oracle can be transferred to an offline one efficiently, even if one is only in possession of an efficient online oracle, our result still improves the number of oracle calls from $O(T)$ to $O(\log T)$ , which is already interesting.
Online estimation oracles for two examples are presented in [1], i.e., multinomial logit model and linear combination of MDPs. Following point 1, our algorithm improves on the number of oracle calls immediately.
For the two aforementioned examples, direct optimization methods (rather than the online optimization algorithm considered in [1]) on a batch can be applied to objective (5) in [1] since we only need offline guarantees. Since objective (5) is smooth and strongly convex, numerous efficient optimization methods are known and ready to apply.

[1] Contextual Markov Decision Processes using Generalized Linear Models, Aditya Modi, Ambuj Tewari.

Regarding Lemma 3.4. I have not claimed there is a mistake, but rather wish to understand intuitivly how such a result is possible. I read the proof in the appendix, and it was hard to follow. For that reason, I could not point out any specific falut. It is very non-trivial and non-intuitive to derive a multiplicative lower bound over occupancy measures that holds w.p. 1 that is based over a gurantee that holds in expectation over funciton approximation. At least, I would expect to have a logarithmic term that follows from a martinagle-related concentation bound.

Also, as the context is stochastic and the oracle gurantee is in expection over the contexts, the contditions in this lemma are non trivial. I would be happy to have the autors explanation to that point.

Thanks for your clarification! These are two great questions!

We answer the second question first. So notice the statement of Lemma 3.4 actually has an assumption on the context $c$ under consideration; that is, the multiplicative bounds are only true for contexts where the expected Hellinger distance is small given the context. For contexts where this assumption is not satisfied, in line 514, we bound the divergence in value functions by 1.

The first question is quite non-trivial, so we hope to provide enough intuition to convince you. Line 479 coincides with the concentration inequality, with the Hellinger distance being the additive term. The intuition here is that if the Hellinger distance is small, as required by the assumption in Lemma 3.4, say as small as $\log(M)/n$ , then if we require Phat to be larger than $(H+1) \log(M)/n$ , then we would have $Phat \leq (1+1/H)^2 P_\star$ . And indeed, the trusted transitions are those larger ones. This argument serves only as intuition because we can not have a per-state-action pair control over the Hellinger distance even with the assumption in Lemma 3.4 due to the weighting from the occupancy measure. To address this issue, we developed a proof by contradiction from line 480 to line 488. This proof by contradiction is also non-trivial; we are happy to elaborate more to convince you, or we are happy to walk you through step-by-step at the poster if we get in.

Best regards, Authors

2024-08-08

I thank the authors for their response. I would be happy to engage with them some more on the above points.

Regarding reference [1], unfortunatly this is not answering my question. As far as I understnad the paper, they have no oracle approach. In [1] the authors derive optimistic algorithms for GLM models (that applies to the cases you specified) by defining confidence bounds that based on similar principles to linear regression. They have not used or proved a convergence gurantee as in your definition. This makes me return to my original question.

Regarding Lemma 3.4. I would be happy if the authors could elaborate more. Especially, if the oracle is used as black-box, how a gurantee such as $\hat{P} \leq (1+ 1/H)^2 P_\star$ could be promised, without an intervention in the oracle's choice?

I thank the authors in advance for the interesting disscusion.

评论- Rebuttal (continued 2)

2024-08-02

Please refer to the concerns regarding the oracle raised in weakness, points 2 and 4. Specifically, I would appreciate an explanation of why you chose this specific oracle and whether it can be implemented efficiently.

Thanks for the question. Please refer to our rebuttal corresponding to points 2 and 4.

I do not understand how the authors derived the fully multiplicative lower bound over the trusted occupancy measures. As I understand it, there is a missing additive term. I would appreciate it if the authors could explain that point as if it does not hold true, the whole analysis is incorrect.

Could you please elaborate on your question? We are not sure we understand what you mean when you say we are missing an additive term. This is a serious accusation, so we ask you to reconsider and point out exactly where you believe the flaw is, if there is one. For a technical overview, please refer to our general rebuttal and the rebuttal to Reviewer 2.

评论- Rebuttal (continued 1)

2024-08-02

Regarding the computational complexity - in each round the authors compute policy cover for each layer, state, and action. This has a significant cost that is completely ignored. I would expect that the authors will present their algorithms running time complexity, excluding the oracle's complexity.

We do mention the total computational costs right after our main result of Theorem 3.2 with a paragraph titled “Computational efficiency” in line 147: “Thus, the computational complexity is O(log T) oracle calls over T rounds, with an additional per-round cost of O(poly(H, S, A, log T)).” We also have mentioned right after Lemma 3.1 in line 185 that “The computation for the policy π ^{h(t),s,a}_{m(t),c_t} for any t, s, a can be computed in poly(H, S, A, log T) time by formulating it as a linear fractional programming problem. We defer the details to Appendix G.” We also have an appendix G for detailed computation. We can try to emphasize these points more.

There are recent, relevant works on CMDPs that are not mentioned in the related literature review. See [1,2,3,4] for references. I also think that a comparison with the results of [2,3,4] might be relevant.

Thanks for reminding us. We have mentioned [1] in line 21, as it proposed the concept of CMDP. We thank the reviewer for bringing up [2, 3, 4]; we will add comparisons to them. In short, [2] are developed for the generalized linear model class, [3] requires an online oracle and needs O(T) oracle call, and the regret bound in [4] scales with eduler dimension of the model class.

In line 4 of Algorithm 1, the segment loop is unclear to me. Can you please explain what is used for? As I understand it, the segment loop is running over the CMDPs layers and hence should be inside the rounds loop.

The data collected for the h-th segment is used to estimate the model at the layer h. In each round of the h-th segment, the learner interacts with the environment with a full H-step interaction, generating a trajectory.

The oracle guarantee in Definition 2.1 is w.r.t the predicted model \hat{M} and the true model M_\star. It is unclear to me how it directly implies Lemma 3.2, unless the oracle is actually more powerful than described in definition 2.1 and assumed to optimize simultaneously the squared Hellinger distance of both a rewards function class and a dynamics function class. If this is indeed the situation, it is unclear to me, as the rewards do not define a distribution, and the squared Hellinger distance is defined for distributions. Also, in this case, it is unclear to me why the squared Hellinger distance is an appropriate loss for the rewards approximation. Previous works on CMAB (e.g., Simchi-Levi and Xu 2021, Foster and Rakhlin 2020) show that the least squares (that are less sensitive than the squared Hellinger distance in [0,1]) is an appropriate loss choice for the rewards approximation.

In our setting, we assume the reward to follow a distribution, line 82 “reward distribution” and line 95 “r_t\sim R”. Then, since the distribution of a trajectory includes the distribution of the reward, line 97 and line 98 “to denote the distribution of the trajectory c_1, π_1, s^1_1 , a^1_1 , r^1_1 , . . . , s^H_1 , a^H_1 , r^H_1”, the squared Hellinger distance between the two models will include the squared Hellinger distance between the reward distributions. And yes, the squared loss is less sensitive than the squared Hellinger distance in [0,1]. However, it is not our purpose to find the most accurate reward approximation but to illustrate how to deal with the main difficulty which is the uncertainty of the transition kernel. In fact, it is standard to transfer our proof replacing the Hellinger distance between the reward distributions to squared loss where the offline oracle for the reward function is replaced by a square loss regression guarantee. We are happy to include a remark on this fact.

In our setting, we assume the reward to follow a distribution, as stated in line 82 “reward distribution,” and line 95 “ $r_t \sim R$ ”. Then, since the distribution of a trajectory includes the distribution of the reward, line 97 and line 98 “to denote the distribution of the trajectory $c_1, \pi_1, s^1_1, a^1_1, r^1_1, \ldots, s^H_1, a^H_1, r^H_1$ ”, the squared Hellinger distance between the two models will include the squared Hellinger distance between the reward distributions.

And yes, the squared loss is less sensitive than the squared Hellinger distance in [0,1]. However, it is not our purpose to find the most accurate reward approximation but to illustrate how to deal with the main difficulty, which is the uncertainty of the transition kernel. In fact, it is standard to transfer our proof, replacing the Hellinger distance between the reward distributions with squared loss. And replace the offline oracle with a square loss regression oracle for the reward function. We are happy to include a remark on this fact.

评论- Further rebuttal

2024-08-09

Regarding the first question:

We apologize for bringing up the discussion over the methods of [1]. You are right. They are not online oracle based algorithms. Let us rephrase the two points of interest:

Since any online estimation oracle can be transferred to an offline one efficiently, even if one is only in possession of an efficient online oracle, our result still improves the number of oracle calls from $O(T)$ to $O(\log T)$ , which is already interesting.
For the multinomial logit model in [1], we claim there is an efficient implementation of an offline oracle using the following simple facts. Note that we know MLE is an offline oracle through our Appendix C. The MLE for the multinomial logit model suffices to solve an ERM on the log loss with the form for each $s,a$ : $argmin_{W_{sa,1},...,W_{sa,S}} - \sum_{i=1}^n\log( \frac{\exp( W_{sa,s_i}^\top c_i ) }{\sum_{s'=1}^S \exp( W_{sa,s'}^\top c_i ) } )$ , where the loss function is convex.

Regarding Lemma 3.4:

The offline oracle is used as a black box. We do not intervene in its choice. It is only required to satisfy $E_{c \sim D, \pi \sim p(c)}\left[D_{\mathrm{H}}^2\left(\widehat{M}(\pi, c), M_{\star}(\pi, c)\right)\right] \leq \mathcal{E}_{\mathcal{M}, \delta}(n)$ .

Lemma 3.4 states if the context $c$ in epoch $m$ satisfies for all $h$ $E_{\pi \sim p_m^h(c)} E^{M_{\star}, \pi, c}[D_{H}^2(P_m^h(s_1^h, a_1^h ; c), P_{\star}^h(s_1^h, a_1^h ; c))] \leq H / \gamma_m$ , then the multiplicative bounds on the $Phat(\cdot ;c)\leq (1+1/H)^2 P_\star(\cdot ;c)$ holds for this specific context $c$ .

We are not sure why any intervention in the choice of oracle is needed. Again, we are happy to elaborate. But I have the feeling that I am confusing you. Do you understand now why the additive term might be absorbed into a multiplicative guarantee? Or are you referring to some other additive term? In particular, we do not do a concentration on the distribution of contexts. Could you elaborate on your question again?

2024-08-09

I thank the authors for their response.

Regarding 1. To the best of my knowledge, the general formulation of online estimation oracles (as specified by Foster et al. 2021) implies an unclear (where refering to implementaion) oracle. This is the reason I insisted for an actual example. The multimonial model for linear compination of MDPs seems to satisfied that. I recommand the authors to state this example and more in the paper.

Regarding 2. I understnand that from the oracles gurantee you have a promise in expectation over the contexts and policy. I unserstand that in the analysis you seperate the contexts set to context for which the assumptions of Lemma 3.4 holds and then prove the multuplicative lower bound (I do not fully understand how can you derive that bound without any concentation inequalities). Then, for the other contexts you bound the hellinger distance trivially by 1. The thing I am still not convinced about is as follows. Intuitivly, to derive non-trivial regret bound (even if it only holds in expectation over the contexts) you need the mass of the contexts for which the assumptions of Lemma 3.4 hold will be significent. How can you lower bound the probability that a context is suitable for Lemma 3.4? This probability dependes both on the distribution over the contexts and the oracle itself. Otherwise, how you combine the two contexts set into one conclution regarding the regret?

Thank you in advence for the further clarifications.

2024-08-10

We thank the reviewer for the further clarification!

Regarding 1: We will make sure to include a more detailed discussion of the practical examples.

Regarding 2:

I unserstand that in the analysis you seperate the contexts set to context for which the assumptions of Lemma 3.4 holds and then prove the multuplicative lower bound (I do not fully understand how can you derive that bound without any concentation inequalities).

To illustrate why we don't need concentration, the inequality in line 479 concerning Hellinger distance holds by definition through arithmics (rather than conditioning on good events). In particular, a straigtforward arithmic intuition goes the following: For any $0\leq p,q\leq 1$ , we have $p = (\sqrt{p} - \sqrt{q} + \sqrt{q})^2 = q +(\sqrt{p} - \sqrt{q} )^2 + 2 (\sqrt{p} - \sqrt{q} )\sqrt{q} \leq (1+1/H) q + (H+1)^2 (\sqrt{p} - \sqrt{q} )^2$ . The take $p = \hat{P}$ and $q=P_\star$ , we will obtain $\hat{P}\leq (1+1/H)P_\star+(H+1)D_{H}^2(\hat{P},P_\star)$ with one more data processing inequality. I am sorry I mentioned "Line 479 coincides with the concentration inequality". This is confusing, since mathematically, line 479 does not come from any concentration inequality. Then, the clarification above will continue:

If we have $D_{H}^2(\hat{P},P_\star)\le \frac{\log M}{n}$ and $\hat{P}\ge(H+1)\log M|/n$ , by the inequality above, we have $(H+1)\log M/n\le\hat{P}\le (1+1/H)P_\star+\log M/n$ , thus $(1/H)(1+1/H)P_{*}\geq\log M/n$ .

Plug this back to our inequality, we get $\hat{P}\le (1+1/H)P_\star+\log M/n\le (1+1/H)P_\star+(1/H)(1+1/H)P_\star=(1+1/H)^2P_\star$ .

The thing I am still not convinced about is as follows. Intuitivly, to derive non-trivial regret bound (even if it only holds in expectation over the contexts) you need the mass of the contexts for which the assumptions of Lemma 3.4 hold will be significent. How can you lower bound the probability that a context is suitable for Lemma 3.4?

Thanks a lot for the clarification! This is very helpful!

So the aim for us is to show as in the first inequality in line 515: For any context $c$ , whether or not the context satisfies the assumption in Lemma 3.4, the following inequality holds: $|\hat{V}(c) - V_\star(c)| \leq \frac{1}{20} \widehat{reg(c)} + 76 e\sqrt{H^6S^4A^3\cdot \cE_m} + \frac{2\gamma_m}{H} E_{\pi \sim p_m^h(c)} E^{M_{\star}, \pi, c}[D_{H}^2(P_m^h(s_1^h, a_1^h ; c), P_{\star}^h(s_1^h, a_1^h ; c))]$ .

For the contexts that satisfy the assumption in Lemma 3.4, the multiplicative bounds apply, and we are good. If not then it is actually an easier case, since $\frac{2\gamma_m}{H} E_{\pi \sim p_m^h(c)} E^{M_{\star}, \pi, c}[D_{H}^2(P_m^h(s_1^h, a_1^h ; c), P_{\star}^h(s_1^h, a_1^h ; c))] \geq 1 \geq |\hat{V}(c) - V_\star(c)|$ as stated in line 514.

The second inequality in line 515 is a typo (and we apologize if this causes the confusion). We actually need to take expectation immediately in order to bound $\frac{2\gamma_m}{H} E_{c\sim D} E_{\pi \sim p_m^h(c)} E^{M_{\star}, \pi, c}[D_{H}^2(P_m^h(s_1^h, a_1^h ; c), P_{\star}^h(s_1^h, a_1^h ; c))] \leq \gamma_m\cdot \cE_m$ by the offline guarantee to derive line 517.

2024-08-11

I appreciate the authors' response and thank them very much. I recommend the authors make the paper's technical parts clearer, especially proofs of key lemmas. I postpone my decision on the final assessment of the paper after the discussion with the other reviewers and AC.

评论- Final rebuttal

2024-08-12

We sincerely appreciate the reviewer’s clarification of the questions. In response, we will include in the main body a section/paragraph that describes the key technical challenges and lemmas raised in the discussion. As a final remark, we acknowledge that the paper is indeed technical, and we have made significant efforts to clearly present in the main body the main results, consequences, all parts of the algorithm, a proof sketch, and an interesting application on the exploration task. We kindly ask that you take this into consideration when making your final decision.

2024-08-09

For clarification, we elaborate on the aforementioned intuition.

First, by the information theory inequality and some algebra such as AM-GM (line 477,478), we could have $\hat{P}\leq (1+1/H)P_\star^h+(H+1)D_{H}^2(\hat{P},P_\star)$ .

This inequality is from pure mathematical analysis and has nothing to do with oracle. Thus in order to have $\hat{P}\le(1+1/H)^2P_\star$ , we need the Hellinger distance to be small to have this multiplicative bound.

In other words, if we have $D_{H}^2(\hat{P},P_\star)\le \frac{\log M}{n}$ , then for all $\hat{P}$ such that $\hat{P}\ge(H+1)\log M|/n$ , by the inequality above, we have $(H+1)\log M/n\le\hat{P}\le (1+1/H)P_\star+\log M/n$ , thus $(1/H)(1+1/H)P_{*}\geq\log M/n$ .

Plug this back to our inequality, we get $\hat{P}\le (1+1/H)P_\star+\log M/n\le (1+1/H)P_\star+(1/H)(1+1/H)P_\star=(1+1/H)^2P_\star$ .

Therefore, our multiplicative bound holds. Here $M=|\mathcal{M}|$ is the model class complexity.

审稿意见

评分: 6置信度: 32024-07-08

This work studies low-rank contextual decision processes (CMDPs) in offline settings. The authors introduce a novel algorithm that leverages the structure of low-rank CMDPs to achieve efficient learning with limited data. The proposed method, O-LRCDP, is designed to minimize the dependence on large-scale data by utilizing an oracle to provide efficient estimations. The paper presents theoretical guarantees for the performance of O-LRCDP and demonstrates its effectiveness through rigorous analysis and experimental results.

优点

The paper introduces a novel algorithm (LOLIPOP) for efficiently learning optimal policies in contextual Markov decision processes (CMDPs) with near-optimal regret guarantees.
The rigorous theoretical analysis provides strong regret bounds of $O(\sqrt{H^7S^4A^3T \log{|M|}})$ that match lower bounds up to polynomial factors in H, S, and A. The algorithm achieves computational efficiency by requiring only O(H log T) or O(H log log T) calls to an offline density estimation oracle, which is a significant improvement over prior work requiring O(T) oracle calls.
The paper provides a clear comparison to current algorithms and highlights the key advantages of LOLIPOP over existing ones (Table 1). The algorithm and analysis are versatile, extending beyond regret minimization to the reward-free reinforcement learning setting with near-optimal sample complexity.
This paper has a well-organized structure. It adequately discusses relevant work and provides an informative introduction to the basics. I find understanding the main parts of this work quite smooth.
The technical approach introducing "trusted occupancy measures" to address the multi-layer structure of CMDPs is novel and insightful.

缺点

While the theoretical guarantees are strong, experimental results demonstrating the practical performance and computational efficiency would strengthen the paper significantly, even in a synthetic setting.
The discussion of the algorithm's potential limitations or failure cases is limited.
The paper uses a standard realizability assumption (Assumption 2.1), which is common in theoretical RL work. While this assumption is reasonable for analysis, a brief discussion of its practical implications and potential ways to verify or approximately satisfy it in real-world scenarios could enhance the paper's applicability.
While the extension to reward-free RL is interesting, this section feels somewhat disconnected from the main focus of the paper. The motivation, connection to previous sections, and practical implications of this part could be elaborated further.
The presentation of technical details, particularly in Section 3.2, is quite dense and may impede overall readability. While the mathematical rigor is appreciated, the main text could benefit from focusing more on intuitive ideas and key insights that directly address the technical challenges. Consider moving detailed lemmas and heavy mathematical derivations to the appendix. To enhance accessibility and understanding, it would be beneficial to incorporate more high-level explanations and possibly graphical illustrations in the main body.

问题

Have the authors had any chance to perform any empirical investigations comparing LOLIPOP to existing algorithms like E2D or CMDP-VR (even in a synthetic setting)? If so, what were the key findings?
The algorithm relies on an offline density estimation oracle. How sensitive is the performance to the choice of this oracle? Are there particular implementations you would recommend in practice?
As for the regret bound, are there particular aspects of these dependencies that you think could potentially be improved, or do you believe they are essentially tight given current techniques? Are there specific challenges in tightening the bounds with respect to the state space size S or the model class size |M|?
How well do you expect LOLIPOP to perform in settings where the realizability assumption is violated? Are there natural extensions to handle model misspecification?
The extension to the reward-free RL setting is interesting. Do the authors see potential applications or connections to other decision-making problems as well (e.g., constrained RL)?

局限性

The authors adequately address the limitations of their work.

作者回复

2024-08-02

The presentation of technical details, particularly in Section 3.2, is quite dense and may impede overall readability. While the mathematical rigor is appreciated, the main text could benefit from focusing more on intuitive ideas and key insights that directly address the technical challenges. Consider moving detailed lemmas and heavy mathematical derivations to the appendix. To enhance accessibility and understanding, it would be beneficial to incorporate more high-level explanations and possibly graphical illustrations in the main body.

Thank you for your valuable suggestions. Please refer to the general rebuttal for a technical overview of our methodology.

Have the authors had any chance to perform any empirical investigations comparing LOLIPOP to existing algorithms like E2D or CMDP-VR (even in a synthetic setting)? If so, what were the key findings?

This paper is purely theoretical, but we hope to see empirical investigations in its possible extension to a journal version or future works.

The algorithm relies on an offline density estimation oracle. How sensitive is the performance to the choice of this oracle? Are there particular implementations you would recommend in practice?

The first to try would definitely be the MLE oracle as shown in Example C.1 in appendix C to enjoy theoretical guarantees. Concretely, since the MLE is indeed the ERM on log loss (i.e., cross-entropy loss), the first thing to try is probably ERM on cross-entropy loss.

As for the regret bound, are there particular aspects of these dependencies that you think could potentially be improved, or do you believe they are essentially tight given current techniques? Are there specific challenges in tightening the bounds with respect to the state space size S or the model class size |M|?

The current technique needs to build the trusted occupancy measure, which unfortunately brings in the dependence of the state space S. For a problem with large state space, the guarantee is not ideal, but this could be alleviated through representation learning if the CMDP under consideration is intrinsically low dimensional, e.g. [1]. The dependence on the model class |M| is already logarithmic, which to our knowledge, is state-of-the-art. However, further structures in the model class might be helpful. We hope to explore these directions in future works.

[1] Contextual Markov Decision Processes using Generalized Linear Models. Modi and Tewari. (2019).

How well do you expect LOLIPOP to perform in settings where the realizability assumption is violated? Are there natural extensions to handle model misspecification?

It is not obvious how misspecification would affect the performance. On the positive side, LOLIPOP follows the stream of algorithms that adapt to misspecification, as seen in [2] and [3], which shows that IGW (which is a base component for LOLIPOP) is robust against misspecification. On the negative side, the proof technique requires the trusted occupancy measure to be multiplicatively upper-bounded by the occupancy measure. It might be non-trivial to maintain this guarantee with misspecification. Overall, we hope to see extension in this regard in the future.

[2] Bypassing the monster: A faster and simpler optimal algorithm for contextual bandits under realizability D Simchi-Levi, Y Xu

[3] Adapting to Misspecification in Contextual Bandits. Dylan J. Foster, Claudio Gentile, Mehryar Mohri, Julian Zimmert

The extension to the reward-free RL setting is interesting. Do the authors see potential applications or connections to other decision-making problems as well (e.g., constrained RL)?

At a high level, the LOLIPOP algorithm promotes layerwise exploration-exploitation tradeoffs and planning with trusted transitions; it is quite conceivable to have implications in relevant topics. Specifically for constrained RL, one could imagine doing exploration-exploration tradeoffs in a constrained policy set. The LOLIPOP algorithm should have similar regret guarantees compared to the optimal policy in the constrained policy set.

2024-08-09

Thanks for the response. While layerwise exploration-exploitation tradeoffs are interesting and can be relevant to several topics, my concerns about their limited applicability remain. I tend to keep my current score.

审稿意见

评分: 7置信度: 42024-07-13

The paper studies the stochastic Contextual MDP problem under an offline function approximation oracle. Concretely, the authors assume access to an offline density estimation oracle for a (realizable) class of CMDPs. Under this (minimal) assumption, they prove a rate-optimal regret bound, I.e., that scales with $\sqrt{T \log |\mathcal{M}|}$ where $T$ is the number of episodes and $|\mathcal{M}|$ is the size of the CMDP class. The algorithm is based on a combination of Inverse Gap Weighing (IGW), policy cover, and trusted occupancies and transitions. The latter is where most of the novelty lies. The regret's dependence on H,S,A seems suboptimal but, importantly, is polynomial in these parameters. Additionally, the regret is obtained with a small number of oracle calls ( $\log T$ or $\log \log T$ depending on assumptions). Finally, the authors present an application of their algorithm to reward free learning of CMDPs where they show that using $\epsilon^{-2}$ samples they can obtain a model such that given a reward function, any policy can be estimated to $\epsilon$ accuracy.

优点

The paper resolves an important open problem (learning CMDPs efficiently with offline oracles and rate optimal regret)
The general idea of trusted occupancies is interesting and may find use elsewhere.
The paper has non-trivial technical novelty.

缺点

The offline density oracle is a bit vague. Can it be implemented using ERM on the log loss as in previous works? If not, how is it different than previous works such as Foster et al, Levy et al?
The paper is very technical. While not a weakness itself, this complicates the presentation, especially given the limited space of the conference format. It seems that the authors made significant effort to explain the various moving parts in their work. Nonetheless, I have a few comments:

Why do you need to separate the data of each horizon for the model estimation?
Can you give some intuition regarding the definition of the trusted transitions?
Related to this, can you explain how the policy cover is chosen?
While the application to reward free CMDP is nice, I think it could be mentioned in a paragraph with the details deferred to the appendix.
In general, the explanations are rather technical and made it hard to build some intuition as to why things work. Even after reading a significant portion of the appendix I only have some technical understanding of how the result is obtained. I'm not sure how to address this, but it makes the paper less accessible. Perhaps showing where standard analysis would fail without components such as the trusted occupancies could be helpful.

问题

See above.

局限性

N/a

作者回复

2024-08-02

We greatly appreciate your comments and have written a detailed technical overview as a general rebuttal. We believe it would be helpful for you to check before the detailed rebuttal.

The offline density oracle is a bit vague. Can it be implemented using ERM on the log loss as in previous works? If not, how is it different than previous works such as Foster et al, Levy et al?

Yes, it can be implemented using ERM on the log loss. In fact, since ERM on log loss is MLE, we demonstrate in Example C.1 in appendix C that MLE is an offline oracle. This differs from Foster et al. and Levy et al. since they require online log loss guarantee.

Why do you need to separate the data of each horizon for the model estimation?

Could you clarify your question? I am reading two questions here:

The first one is, why not use all the data for model estimation, say at the end of each epoch? For this question, we separate the data so as to satisfy the i.i.d. input data assumption of the offline oracle. To perform the layerwise exploration-exploitation tradeoff, the trajectory data generated in one epoch are NOT i.i.d. due to adaptively chosen policies in different segments.

The second one is why, for each layer h \in [H], the estimator \widehat{P}^h and \widehat{R}^h are generated using separate data. This relates to the core challenge of our problem. If the learner only plans according to \widehat{d}, which is the empirical estimation of the occupancy measure, the divergence between \widehat{d} and the true occupancy measure will accumulate exponentially with respect to the horizon. To avoid such exponential blow-ups, we introduce the trusted transition in a layerwise fashion so that \widetilde{d}, which is the trusted occupancy measure, is upper bounded by the true occupancy measure up to a constant. To ensure this property, at each layer, the model has to be estimated according to the policy that depends on the trusted occupancy measure from the last layer.

Can you give some intuition regarding the definition of the trusted transitions?

Recall the core problem mentioned above: we would like the trusted occupancy measure to be upper-bounded by the true occupancy measure up to a constant. For simplicity, assume the regret is always 0, then trusted transitions, according to our definition, are the ones that are most visited. Intuitively, such transitions are the ones that can be estimated most accurately. Indeed, such transitions can be estimated up to a multiplicative factor as we demonstrate in the proof of Lemma 3.4, which results in the trusted occupancy measure being upper-bounded by the true occupancy measure up to a constant.

Related to this, can you explain how the policy cover is chosen?

For simplicity, assume regret is 0, then the policies in the policy cover maximize trusted occupancy measure which corresponds to exploration. When the regret is not 0, it balances the exploration-exploitation tradeoff.

While the application to reward free CMDP is nice, I think it could be mentioned in a paragraph with the details deferred to the appendix.

Thanks for the suggestion. We will try to improve the structure of our paper.

In general, the explanations are rather technical and made it hard to build some intuition as to why things work. Even after reading a significant portion of the appendix I only have some technical understanding of how the result is obtained. I'm not sure how to address this, but it makes the paper less accessible. Perhaps showing where standard analysis would fail without components such as the trusted occupancies could be helpful.

At a very high level, one intuitive technical logic chain goes as follows: Since we would like to use an offline guarantee, we need to use the guarantee in the form d D_H(Phat, P). But the learner only knows \hat{d} and can bound the divergence in value functions by \hat{d} D_H(Phat, P). One immediate thought is to bound \hat{d} D_H(Phat,P) by d D_H(Phat, P) + |d-\hat{d}|. However, it will suffer from exponentially accumulating errors since |\hat{d}-d| can only be written as summations of occupancy measures from the previous layers. So to avoid such an exponential explosion, we turn to multiplicative guarantees between d and \tilde{d}.

评论- response

2024-08-13

Thank you for the response. I do not have additional questions at this time.

审稿意见

评分: 6置信度: 32024-07-13

This paper studies a statistical and computational reduction from the general Contextual Markov Decision Process (CMDP) problem to offline density estimation. They propose an efficient algorithm called LOLIPOP which achieves a near-optimal regret with minimal oracle calls $\mathcal O(H log log T$ ) if $T$ is known in advance, leveraging a layerwise exploration-exploitation tradeoff. In addition, their algorithm is applicable to pure exploration tasks in reward-free reinforcement learning.

优点

Their proposed LOLIPOP achieves near-optimal regret while minimizing the number of oracle calls. It requires only $O(H log T)$ calls to an offline density estimation oracle, which can be further reduced to $O(H log log T)$ if the total number of rounds $T$ is known in advance. This is the first algorithm that achieves both statistical and computational efficiencies for general (stochastic) CMDPs.

缺点

Some minor typos, e.g. line 466
This paper is purely theoretical. It would be interesting if the authors provide some experimental results to demonstrate the performance of the proposed algorithm.

问题

In Eq (3), to construct the trusted transitions, how do you compute $\widehat{\operatorname{reg}}_{m-1}(\pi, c)$ ? From my understanding, you need to compute the optimal value function $\widehat{V}_{m-1}^1(c)$ , how do you implement it?
Is there any lower bound on the number of calls for this problem?
Compared to contextual bandits, can the authors explain more about the technical challenges of CMDPs?

局限性

Yes

作者回复

2024-08-02

This paper is purely theoretical. It would be interesting if the authors provide some experimental results to demonstrate the performance of the proposed algorithm. """

This paper presents a purely theoretical result. We look forward to seeing experiments in future works.

In Eq (3), to construct the trusted transitions, how do you compute $\widehat{\operatorname{reg}}{m-1}(\pi, c)$ From my understanding, you need to compute the optimal value function $\widehat{V}{m-1}^1(c)$ , how do you implement it?

The value \widehat{V}{m-1}^1(c) is the optimal value for dynamics \widehat{P}{m-1}(c) and reward function \widehat{R}{m-1}(c) which are known at the beginning of epoch m. Thus, it can be calculated through value iteration [1].

[1] Markov decision processes: discrete stochastic dynamic programming ML Puterman, 2014

Is there any lower bound on the number of calls for this problem?

There is a lower bound on the switching cost of the scale $\Omega(\log \log T)$ [2], where the switching cost is the number of switches in the learner’s randomized policy. A reasonable learner with an oracle would only switch its randomized policy after an oracle call. Thus, the number of oracle calls is larger than the number of switches, which implies a $\Omega(\log \log T)$ lower bound on the number of oracle calls. We are happy to include a remark on this fact. However, to rigorously state this result requires more information-theoretic constraints and we leave such result to future works.

[2] Zihan Zhang, Yuhang Jiang, Yuan Zhou, and Xiangyang Ji. Near-optimal regret bounds for multi-batch reinforcement learning. Advances in Neural Information Processing Systems, 35: 24586–24596, 2022.

Compared to contextual bandits, can the authors explain more about the technical challenges of CMDPs?

The main technical difficulty in general, for extending any result from contextual bandit to CMDPs is the unknown transition kernel. In fact, even if the reward function is given (which trivializes the contextual bandit problem since all rewards are known), the CMDPs are still hard to learn. Concretely, since the transition kernels are different in each layer, the learner has to ensure sufficient coverage to (nearly) ALL states-action pairs to estimate the corresponding transition kernel accurately. A more refined challenge in our setting is that if the learner only plans according to \widehat{d}, which is the empirical estimation of the occupancy measure, the divergence between \widehat{d} and the true occupancy measure will accumulate exponentially with respect to the horizon. To avoid such exponential blow-ups, we introduce the trusted transition in a layerwise fashion so that \widetilde{d}, which is the trusted occupancy measure, is upper bounded by the true occupancy measure up to a constant.

For a detailed version, we refer to our general rebuttal.

作者回复

2024-08-02

General rebuttal:

We thank the reviewers for their positive reviews and will make sure to address the writing issues mentioned. Here, we clarify the technical challenges and our methodology. We will include a paragraph/section regarding this in the updated version of this paper.

Technical challenge and methodology:

The main technical difficulty in extending any result from contextual bandits to CMDPs is the unknown transition kernel. In fact, even if the reward function is given (which trivializes the contextual bandit problem since all rewards are known), CMDPs are still hard to learn. Concretely, since the transition kernels are different in each layer, the learner has to ensure sufficient coverage of (nearly) all state-action pairs to accurately estimate the corresponding transition kernel.

A more refined challenge in our setting is that if the learner only plans according to $\widehat{d}$ , which is the empirical estimation of the occupancy measure, the divergence between $\widehat{d}$ and the true occupancy measure will accumulate exponentially with respect to the horizon. The exponential blow-up is due to the following reason: since we would like to use an offline guarantee, we need to use the guarantee in the form $d D_H^2(\widehat{P}, P)$ . However, the learner only knows $\widehat{d}$ and can bound the divergence in value functions by $\widehat{d} D_H^2(\widehat{P}, P)$ . One immediate thought is to bound $\widehat{d} D_H(\widehat{P}, P)$ by $d D_H^2(\widehat{P}, P) + |d - \widehat{d}|$ . However, it will suffer from exponentially accumulating errors since $|\widehat{d} - d|$ can only be written as summations of occupancy measures from the previous layers.

To avoid such an exponential explosion, we turn to multiplicative guarantees between $d$ and $\tilde{d}$ . Concretely, we introduce the trusted transition in a layerwise fashion so that $\widetilde{d}$ , which is the trusted occupancy measure, is upper bounded by the true occupancy measure up to a constant. To ensure this property, at each layer, the model has to be estimated according to the policy that depends on the trusted occupancy measure from the last layer. Intuitively, for simplicity, assume the regret is always zero; then trusted transitions, according to our definition, are the ones that are most visited. Such transitions are the ones that can be estimated most accurately up to a constant. At a very high level, when estimating a Bernoulli variable, if the empirical mean is greater than O(1/sample size), then with high probability, the empirical mean can be upper-bounded by a constant multiple of the true mean. Indeed, such transitions can be estimated up to a multiplicative factor, as we demonstrate in the proof of Lemma 3.4, which results in the trusted occupancy measure being upper-bounded by the true occupancy measure up to a constant. Meanwhile, to ensure sufficient coverage of all state-action pairs, we use the policy cover. For simplicity, assume regret is zero; then, the policies in the policy cover maximize the trusted occupancy measure, which corresponds to exploration. When the regret is not zero, the policy cover balances exploration and exploitation.

最终决定Accept (poster)

2024-09-25

The paper explores a method that reduces the general Contextual Markov Decision Process (CMDP) problem to an offline density estimation task. The authors introduce an algorithm named LOLIPOP, which efficiently achieves near-optimal regret with a minimal number of oracle calls. The algorithm can also be applied to pure exploration tasks within reward-free reinforcement learning scenarios. The technical novelty of the paper is commendable which resolves an important open problem of learning CMDPs efficiently with offline oracles and rate optimal regret. Some empirical evaluation would have made the submission stronger though. Please incorporate all the reviewers' feedback in the final version.