PaperHub
5.8
/10
Poster5 位审稿人
最低5最高7标准差0.7
5
6
5
7
6
3.0
置信度
正确性3.2
贡献度2.4
表达3.0
NeurIPS 2024

Local Linearity: the Key for No-regret Reinforcement Learning in Continuous MDPs

OpenReviewPDF
提交: 2024-05-13更新: 2024-11-06
TL;DR

We extend and improve several theoretical results about RL in continuous state-action spaces

摘要

Achieving the no-regret property for Reinforcement Learning (RL) problems in continuous state and action-space environments is one of the major open problems in the field. Existing solutions either work under very specific assumptions or achieve bounds that are vacuous in some regimes. Furthermore, many structural assumptions are known to suffer from a provably unavoidable exponential dependence on the time horizon $H$ in the regret, which makes any possible solution unfeasible in practice. In this paper, we identify _local linearity_ as the feature that makes Markov Decision Processes (MDPs) both _learnable_ (sublinear regret) and _feasible_ (regret that is polynomial in $H$). We define a novel MDP representation class, namely _Locally Linearizable MDPs_, generalizing other representation classes like Linear MDPs and MDPS with low inherent Belmman error. Then, i) we introduce **Cinderella**, a no-regret algorithm for this general representation class, and ii) we show that all known learnable and feasible MDP families are representable in this class. We first show that all known feasible MDPs belong to a family that we call _Mildly Smooth MDPs_. Then, we show how any mildly smooth MDP can be represented as a Locally Linearizable MDP by an appropriate choice of representation. This way, **Cinderella** is shown to achieve state-of-the-art regret bounds for all previously known (and some new) continuous MDPs for which RL is learnable and feasible.
关键词
RLregretsmoothness

评审与讨论

审稿意见
5

The paper proposes two novel classes of MDPs in the function approximation setting:

  • Mildly Smooth MDPs: in which the bellman optimality operator outputs smooth functions of degree ν\nu.
  • Locally Linearizable MDPs: in which there exist a state-action feature mapping into Rd\mathbb R^d and a finite partition of the state-action space that give rise to a Q function approximator class, which is linear in the feature mapping inside each of the partitions (the vector defining the linear mapping is allowed to change between partitions).

An online algorithm (CINDERELLA) is presented and a proof of sublinear regret is given for the class of Locally Linearizable MDPs. Subsequently, it is shown that for any Mildly Smooth MDP, there exist a feature mapping and a partition (of size exponential in dd) that constitute a Locally Linearizable MDP. As a result, a sublinear regret bound of CINDERELLA is derived for the class of Mildly Smooth MDPs: O~(HνdKν+2d2ν+2d\tilde O(H \nu_*^d K^{\frac{\nu + 2d}{2\nu + 2d}} +H2ν+2dν) + H^{\frac{2\nu + 2d}{\nu}}),

where ν:=ν1\nu_* := \lceil \nu - 1 \rceil.

优点

  • This work proposes a new class of continuous state-action MDPs in which sublinear regret is achievable
  • In particular, this new class is the currently the most general one that does not suffer from lower bounds exponential in HH

缺点

  • The regret bound has factors that are exponential in dd, and sublinearity in KK is weak (as to be expected in this level of generality).
  • Similar to other algorithms in this line of work, the CINDERELLA is not computationally efficient and far from being a practical algorithm.
  • I am not sure what we really learn from such a work; the technical details are complicated even at the level of exposition, and the results are very weak (as to be expected in this level of generality).

问题

  • Definition 1 does not actually contain any structural assumption (for any MDP, one can pick a partitioning and feature maps if we don't require them to satisfy anything), so it doesn't make sense calling such MDPs (that have no structural properties) linearizable.
  • Line 88 "(iii) we show that all families for which learnable and feasible RL is possible are included in a novel family of Mildly Smooth MDPs defined in this paper": This statement seems way to general - where did you show this?
  • Does NET-Q-LEARNING indeed obtain regret bound that is independent of ν\nu for the Strongly Smooth MDP setting? In this case is it strictly better than CINDERELLA for this MDP class?

局限性

NA

作者回复

Q: The regret bound has factors that are exponential in dd.

Lower bounds show that exponential dependence is dd cannot be avoided. If we look at Gaussian bandits with the squared-exponential kernel, a problem family that is strictly contained also in the Strongly Smooth class, [1] ensures an exp(d)\exp(d) regret lower bound (see Table 1 of [1]). Therefore, exponential scaling in dd should not be considered a limitation of our work: it is well-known that RL problems on continuous state-action space can be really hard even for moderately sized dd.

If, in linear MDP problems, it is well known that dd may be very large, almost of the order of KK, the same thing does not apply here. In fact, the lower bound, even for the simple case ν=1,H=1\nu=1,H=1, entails that dd is much smaller than log(K)\log(K): RK=Ω(Kd+1d+2)and dclog(K)    RK=Ω(Kclog(K)+1clog(K)+2)=Ω(K).R_K=\Omega(K^{\frac{d+1}{d+2}})\quad \text{and } \quad d\approx c\log(K) \implies R_K=\Omega(K^{\frac{c\log(K)+1}{c\log(K)+2}})=\Omega(K). This means that, unless dlog(K)d\ll \log(K), the problem is provably unlearnable.

Q: sublinearity in KK is weak

About KK, we do not only have sub-linearity: of large importance is that for ν\nu\to \infty one recovers regret of K\sqrt K as for tabular or linear MDPs. In fact, having ν+\nu\rightarrow +\infty is very common in physical scenarios: the presence of a Gaussian noise in the transition function is sufficient to guarantee this condition. On the other side "only sublinearity", meant as having a regret order that is slightly less than one, is what competitor works get (See Net-Q-Learning).

Q: I am not sure what we really learn from such a work; the technical details are complicated even at the level of exposition.

The idea of using a feature map that splits across a finite number of regions of the state space is rather intuitive, in our opinion.

Before this work, there were mainly two approaches to continuous MDPs. 1) Discretization-based (e.g. NET-Q_LEARNING) algorithms cannot fully exploit smoothness: the regret scales as K1+d2+dK^{\frac{1+d}{2+d}} even for ν1\nu\gg 1, which makes them suboptimal. 2) Linearization-based approaches [2], which are able to exploit the parameter ν\nu, achieving regret K\sqrt K for ν\nu \to \infty, but fail to give a sub-linear regret bound for d>2νd>2\nu. The reason can be found in a well-known problem of linear models with misspecification [3].

Q: Definition 1 does not actually contain any structural assumption (...), so it doesn't make sense calling such MDPs (that have no structural properties) linearizable.

As the reviewer correctly states, Definition 1 does not contain any structural assumptions. In fact, as stated in the paper, potentially, every MDP belongs to this class. The point is that the guarantees are "non-trivial" only if the approximation error I\mathcal I, which is built based on definition 1, is small. Having definition 1 + small I\mathcal I corresponds to the intuitive idea of locally linearizable MDP. We will modify the presentation to just define I\mathcal I and then call, heuristically, Locally Linearizable MDPs, the ones for which this quantity is small, similarly to what was done in [4].

Our novelty lies in combining the two approaches by splitting the parameters of the feature map (linearization) across different regions (discretization). In this way, we bypass the problem in [1], while having the same benefits of linearization approaches.

This technique partially solves a long-standing open problem in the Kernelized MDP community (see COLT 2024 Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning), as kernelized MDPs are a subclass of our setting.

Q: Line 88 "(iii) we show that all families for which learnable and feasible RL is possible are included in a novel family of Mildly Smooth MDPs defined in this paper": This statement seems way to general - where did you show this?

As we will clarify in the final version, this sentence does not mean to say "all families that exist" but rather "all families that were defined in the continuous RL literature, to the best of our knowledge". This sentence draws from some results in [2] saying that all the parametric families (LQRs, Linear MDPs, Kernelized MDPs, etc.) with a feasible regret bound are contained in the Strongly Smooth class, which is itself contained in the Mildly Smooth one.

Q: Does NET-Q-LEARNING indeed obtain regret bound that is independent of ν\nu for the Strongly Smooth MDP setting? In this case is it strictly better than CINDERELLA for this MDP class?

We remark that, as the lower bound decreases with ν\nu, having a regret guarantee which improves for higher ν\nu is a desirable feature. Yes, the regret of NET-Q-LEARNING is independent of ν\nu (for ν1\nu\ge 1), but this is a massive drawback w.r.t. our algorithm.

In fact, if one ignores the smoothness of the process, no better regret bound can be obtained than Kd+1d+2K^{\frac{d+1}{d+2}}, the one of NET-Q-LEARNING. But ignoring all the smoothness of the process reveals suboptimal: the higher the ν\nu, the lower the regret. Algorithms like ours, or the ones proposed in [2], as well as the ones coming from the Kernelized RL literature, are able to approach regret order K\sqrt K for ν\nu \to \infty, while NET-Q-LEARNING on MDPs cannot do better than K3/4K^{3/4} (as d2d\ge 2).

Furthermore, note that many real settings are characterized by a very high value for ν\nu: if the transition of the MDP is affected by a Gaussian noise, ν=+\nu=+\infty. Ideed, convolution of any function with a Gaussian density becomes infinite times differentiable.

[1] Vakili et al., On Information Gain and Regret Bounds in Gaussian Process Bandits.

[2] Maran et al., No-Regret Reinforcement Learning in Smooth MDPs.

[3] Lattimore et al., Learning with Good Feature Representations in Bandits and in RL with a Generative Model.

[4] Zanette et al., Learning Near Optimal Policies with Low Inherent Bellman Error

评论

Dear qoL4, please try to address the author's response and elaborate if it changed your opinion and score.

Thank you.

评论

Thank you for the detailed rebuttal.

To be clear, I do not contend the results in the paper did not require technical novelty. In addition, my complaints about the inefficiency and weak guarantees were not to say you could have improved them.

My concerns are all with regards to the fact that the problem studied involves many technical details adding a layer of complication over a setting which is already quite niche, one that fundamentally leads to inefficient algorithms and weak guarantees.

Hence, I will maintain my score (but certainly do not oppose acceptance).

审稿意见
6

This paper introduces a notion of local linearization, which is then applied to smooth MDPs. It generalizes the "Eleanor" algorithm into "Cinderella," which, by avoiding a "cardinality of N" term on the suboptimality with respect to the inherent Bellman error, gets sublinear regret for all classes of smooth problems.

EDIT: I appreciate the authors' detailed feedback. My score remains unchanged.

优点

This paper is written very clearly, exposes all the strengths and weaknesses of its contributions, and makes connections to prior work in a thorough and honest fashion. The paper explain, in particular, why Cinderella is inadequate, and how the parameter learning is decoupled in a way to permit efficient learning. This requires additional technical subtlety in the analysis, but with it comes a bound which would be intractable otherwise.

缺点

This paper is well executed and technically solid, but I believe its main weakness is that the results are not particular surprising to an expert in the area. It seems clear that one can decompose "smooth" MDPs into local linear ones (this has been done, e.g. with zooming bandits and zooming MDPs), and is rather unsurprising that a careful way of handling the analysis makes these objects learnable. I don't seem to see any particular new techniques or insights, and so it is hard for me to be incredibly excited about the result. Still, the paper is executed commendably and for that I lean towards acceptance.

There are also some limitations with presentation. Chief among them, the authors should do more to emphasize the algorithmic differences between Eleanor and Cinderalla (as they did so well with Eleanor's limitations for regret). The extent of the discussion seems limited to the sentence "Difference with ELEANOR stays in the fact that parameters relative to different regions are learned separately", which (a) bears elaboration, (b) is somewhat hidden in the mass of text, and (c) is not grammatically correct. I would encourate the authors to explain what techniques and ideas are needed to update the algorithm and analysis for the resultant guarantees.

问题

What, if any, are the new analysis techniques developed for this setting?

局限性

Novelty, somewhat incremental. Moreover, this paper has the computational inefficiency challenges associated with many algorithms in the field, and, like any non-parametric style method, solves exp(dimension, horizon). Lastly, there do not seem to be any lower bounds that characterize the correct exponents.

作者回复

Q: The extent of the discussion seems limited to the sentence "Difference with ELEANOR stays in the fact that parameters relative to different regions are learned separately" ... What, if any, are the new analysis techniques developed for this setting?

We thank the reviewer for giving us the opportunity to clarify this point. Section 3.1 does not explain the novelty of our approach, which is somehow hidden in the appendix; we are going to explain it here and add it to the final version.

Before this work, there were two main approaches to continuous MDPs.

  1. Discretization-based (e.g., the Zooming algorithm mentioned by the reviewer) algorithm cannot fully exploit smoothness: due to the "zero-order approximation" the regret scales as K1+d2+dK^{\frac{1+d}{2+d}} even for ν1\nu\gg 1, which makes them suboptimal.
  2. Linearization-based approaches [3], which are able to exploit the parameter ν\nu, achieving regret K\sqrt K for ν\nu \to \infty, but fail to give a sub-linear regret bound for d>2νd>2\nu. The reason can be found in a well-known problem of linear models with misspecification [1].

Our novelty lies in combining the two approaches by splitting the parameters of the feature map (linearization) across different regions (discretization). In this way, we are able to get the best of both worlds: using our local linear approximation, we move from the zeroth-order approximation by Zooming/Net-Q-Learning to an arbitrary order ν\nu, which gives sharper bounds for smooth functions. At the same time, the locality of the feature map bypasses the problem in [1], since it allows us to avoid using tools from linear algebra (like "optimal design" argument) [1,2], which has brought the same regret as [3]. This feature of Cinderella shows that the former techniques can be replaced with the "pigeonhole argument", widely employed in the field of tabular MDPs. As a result, we get the first regret bound for the setting that achieves K\sqrt K for ν\nu \to \infty while being always sub-linear.

In particular, we think that bridging these two approaches - which do not seem to have anything in common - and finding a strategy to elude the lower bound presented in [1] are relevant technical contributions to the learning theory community.

Q: this paper has the computational inefficiency challenges associated with many algorithms in the field, and, like any non-parametric style method, solves exp(dimension, horizon).

As the reviewer correctly states, the same computational challenges exists also for celebrated algorithms which work at a similar level of generality. We remark that [4, 5] require access to an optimization oracle over the space of candidate QQ functions. This means that if one is interested in approximating a CνC^\nu function, as in our case, an optimization over a space of O(2(1/ε)d)O(2^{(1/\varepsilon)^d}) functions is needed (if one is satisfied by ε\varepsilon-cover, otherwise ++\infty).

Q: Lastly, there do not seem to be any lower bounds that characterize the correct exponents.

As we write in Appendix E.7 Lower bounds, matching the lower bound for this setting is a challenging open problem. The best-known lower bound for the order in KK is ν+d2ν+d\frac{\nu+d}{2\nu+d}, which is inherited from smooth bandits. Nonetheless, even achieving it for Kernelized MDPs with Matern Kernel - a problem which is a subfamily of Strongly Smooth MDPs, which are in turn a subfamily of our setting - is an open problem (see COLT 2024 Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning). Note that, for the former class of processes, a regret order of ν+2d2ν+2d\frac{\nu+2d}{2\nu+2d} was never proved before this paper; therefore, our result is novel even when restricted to the family of Kernelized MDPs.

[1] Lattimore et al., Learning with Good Feature Representations in Bandits and in RL with a Generative Model

[2] Zanette et al., Learning Near Optimal Policies with Low Inherent Bellman Error

[3] Maran et al., No-Regret Reinforcement Learning in Smooth MDPs

[4] Jin et al. "Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms."

[5] Du et al. "Bilinear classes: A structural framework for provable generalization in rl."

评论

I thank the authors for their detailed feedback. I still don't feel that the novelty warrants a score increasing, but I do appreciate the authors' clarification.

审稿意见
5

The paper introduces the concept of Locally Linearizable MDPs, a class of MDPs that generalizes existing ones like Linear MDPs and MDPs with low inherent Bellman error. In this model, the state-action space is partitioned into NN regions, where the Q-functions belong to a class that allows the result of the Bellman optimality operator to be well approximated by a linear function within each region, up to some "Inherent Bellman Error." The authors propose CINDERELLA, a no-regret (computationally inefficient) algorithm designed for this class, achieving regret bound of NdK+dIKNd\sqrt{K} + \sqrt{d}\mathcal{I}K, where NN is the number of partitions, dd is the dimension of the state-action feature space, KK is the number of episodes and I\mathcal{I} is the Inherent Bellmann Error.

Next, they present the class of "Mildly Smooth MDPs". This class is similar yet slightly less general than "Weakly Smooth MDPs," which assume that the output of the Bellman operator on a smooth function remains smooth. Still, Mildly Smooth MDPs are more general than "Strongly Smooth MDPs" and thus positioned somewhere between the two. The authors show that Mildly Smooth MDPs are Locally Linearizable for some partition and Inherent Bellmann Error and by that achieve regret of HvdKν+2d2ν+2d+H2ν+2dνH v^d K^{\frac{\nu+2d}{2\nu+2d}} + H^{\frac{2\nu + 2d}{\nu}} where ν\nu is the smoothness parameter of the class.

优点

The paper is well-crafted and clearly positions itself within the existing literature. It employs a well-structured pedagogical approach. It first presents a straightforward solution for Locally Linearized MDPs and proceeds to point out that this naive approach's regret bounds are insufficient for deducing a non-trivial regret bound for Mildly Smooth MDPs. This leads to the introduction of their algorithm, which offers improved regret bounds and leads to their main result.

The authors conduct a comprehensive analysis of the regret bounds, discussing each component to assess its necessity and provide a comparison with prior work. For the most part, the paper is transparent and addresses its limitations.

缺点

The nature and definition of Mildly Smooth MDPs are not very natural, and it is hard to agree that learning becomes substantially more feasible compared to Weakly smooth MDPs. Specifically, the authors mention that in Weakly Smooth MDPs the regret bound must scale exponentially with HH, which they consider "statistically unfeasible". However the proposed approach results in a regret bound that scales exponentially with dd, the dimension of the feature space. This raises similar concerns about the practical feasibility of the setting. The authors claim that for ν\nu \to \infty the regret becomes of order K\sqrt{K}, but this is not clearly derived from the provided regret bounds, leaving it unclear whether the result generalizes existing results in Strongly Smooth MDPs.

问题

  • Does [24] also have similar dependency in dvd_{v_*}? If not, I wouldn't consider your regret as "improved regret bounds for Strongly Smooth MDPs"
  • Could you clarify how you achieve K\sqrt{K} regret in the limit ν\nu \to \infty?

局限性

For the most part, the paper addresses its limitations.

作者回复

Q: However the proposed approach results in a regret bound that scales exponentially with , the dimension of the feature space. This raises similar concerns about the practical feasibility of the setting. (...) Does [24] also have similar dependency in dνd_{\nu_*}? If not, I wouldn't consider your regret as "improved regret bounds for Strongly Smooth MDPs"

Yes, also [24] enjoys a regret that is exponential in dd. This can be seen either from the proof of Theorem 11, where the regret scales as Vol(S×A)=2d/2\sqrt{Vol(S\times A)}=2^{d/2} or from the fact that the length of their feature map is (N+dN)Nd\binom{N+d}{N}\approx N^d (see last paragraph before section 4.3).
In fact, this exponential dependence cannot be avoided. Indeed, if we look at Gaussian bandits with the squared-exponential kernel, a problem family that is strictly contained also in the Strongly Smooth class, [1] ensures an exp(d)\exp(d) regret lower bound (see Table 1 of [1]).

Regarding the comparison between our regret bound and the one of [24], we have to remark that not only our regret order is always strictly better, as

ν+3/2d2ν+d>ν+2d2ν+2dd,ν>0,\frac{\nu+3/2d}{2\nu+d}>\frac{\nu+2d}{2\nu+2d}\qquad \forall d,\nu>0,

but also, and most importantly, our result is always sub-linear, while [24] gets vacuous for d>2νd>2\nu. Therefore, our improvement over the regret bounds is not limited to avoiding the exponential dependence in HH.

Q: Could you clarify how you achieve regret K\sqrt K in the limit ν\nu\to \infty?

The regret order, for ν+\nu\to +\infty, gets limνν+2d2ν+2d=1/2.\lim_{\nu\to \infty}\frac{\nu+2d}{2\nu+2d}=1/2.

In general, this is the condition that papers on continuous space / RKHS seek, as it shows that whenever functions tend to be infinitely smooth the optimal order is recovered.

[1] Vakili et al., On Information Gain and Regret Bounds in Gaussian Process Bandits.

评论

The first term in your regret bound scales as HdvKv+2d2v+2dHd_{v^{*}}K^{\frac{v+2d}{2v+2d}}.

limvdv\*=\lim_{v \to \infty} d_{v^\*} = \infty, so how do you achieve regret K\sqrt{K} in the limit vv\to\infty?

评论

As the reviewer correctly points, the constant in the bound diverges as ν+\nu \to +\infty.

To overcome this issue, we note that, since the smooth spaces satisfy

ν1<ν2    Cν2Cν1,\nu_1<\nu_2\implies C^{\nu_2}\subset C^{\nu_1},

we can limit the magnitude of ν\nu in case it is much bigger than KK, and apply the regret bound for a lower value ν\nu'. Specifically, if ν=+\nu=+\infty, we can apply the bound for ν:=log(K)<ν\nu':=\log(K)<\nu, which results in

RKO~(dνKν+2d2ν+2d)O~(log(K)dK1/2+dd+log(K))O~(K1/2),R_K\le \widetilde{\mathcal O}(d_{\nu'}K^{\frac{\nu'+2d}{2\nu'+2d}})\le \widetilde{\mathcal O}(\log(K)^dK^{1/2+\frac{d}{d+\log(K)}})\to \widetilde{\mathcal O}(K^{1/2}),

where we have used line 253 to bound dνd_{\nu'}. The appearance of log(K)d\log(K)^d should not scare: this term is necessary even in the simpler case of bandits with SE kernel, as the lower bound in [1] shows.

Before the reviewer's comment, we only considered 1/21/2 to be the limit of the order, i.e. the exponent of KK, as done in most of the works on the field. Still, we think that the result of this discussion is valuable, and we are going to add it to the final version of the papaer.

[1] Vakili et al., On Information Gain and Regret Bounds in Gaussian Process Bandits.

审稿意见
7

The paper discusses the concept of local-linear MDPs which is a general representation class of MDPs that extends previous works on learnable (sublinear regret in KK) and feasible (polynomial regret in HH) episodic MDPs.

优点

The paper considers important complexity questions associated with the continuous episodic MDP, and is very well-written. Compared to the much more well-studied tabular episodic case, the complexity of continuous episodic MDPs is much more challenging to characterize. The authors identified a class of mildly smooth MDPs and show one can achieve no-regret learning in this regime. The regret bound still has an exponential dependence in HH but the authors demonstrated through a continuous bandit special case this exponential dependence is unavoidable. The class of mildly smooth MDPs is larger than the previously studied classes such as kernelized MDPs and strongly smooth MDPs where no-regret learning is possible.

缺点

Even though the theoretical development is sound and complete, I think the authors could do better in further highlighting their contributions which I will use the questions below to elaborate.

问题

Can you give some concrete examples (can be purely constructed examples or from real-world applications) of mildly smooth MDPs that are not strongly smooth? Since I'm not very familiar with the literature, this could help me further understand what are the additional new problem instances you are able to characterize no-regret learning.

Moreover, can you discuss what the current gap between your O~(HdνKν+2d2ν+2d+H2ν+2dν)\tilde{O}(Hd_{\nu_*}K^{\frac{\nu+2d}{2\nu+2d}}+H^{\frac{2\nu+2d}{\nu}}) is, especially in terms of KK?

局限性

There is no obvious limitation in this paper.

作者回复

We point out that the regret of our algorithm does not depend exponentially in HH. Exponential dependence cannot be avoided for the Lipschitz class, and this work is also born to find a class of MDPs that is able to elude this undesirable phenomenon.

Q: Can you give some concrete examples (can be purely constructed examples or from real-world applications) of mildly smooth MDPs that are not strongly smooth?

To have an idea of an MDP that is Midly Smooth but not Strongly Smooth, one can think about transition functions of the form

s=f(s,a)+U,UUnif(b1,b2).s'=f(s,a)+U,\qquad U\sim \text{Unif}(b_1,b_2).

Indeed, the density function of the uniform noise is not continuous, which makes this class of MDPs not Strongly Smooth. Still, we argue that if f(,)f(\cdot,\cdot) is not itself discontinuous, this class of MDPs is Mildly smooth. For more details, see the example at the end of page 29 and the whole page 30. Of course, this is true not only for uniform noise, but also for any noise admitting a bounded density function, even discontinuous. We argue that this family of processes is rather realistic and, therefore, the generalization has practical relevance.

Q: Moreover, can you discuss what the current gap between your O~(HdνKν+2d2ν+2d+H2ν+2dν)\widetilde O(Hd_{\nu_*}K^{\frac{\nu+2d}{2\nu+2d}}+H^{\frac{2\nu+2d}{\nu}}) is, especially in terms of KK?

We do not understand which "gap" the reviewer was interested in, so we kindly ask the reviewer to clarify it.

评论

Thanks for the clarifications. By "gap" I mean the gap between the current bound and the information-theoretic lower bound on the regret.

评论

We thank the reviewer for this question, which allows us to highligh some future goals of this line of research on MDP with continuous state and action space.

Matching the lower bound for this setting is a challenging open problem. The best-known lower bound for the order in KK is ν+d2ν+d\frac{\nu+d}{2\nu+d}, which is inherited from smooth bandits. Nonetheless, even achieving it for Kernelized MDPs with Matern Kernel - a problem which is a subfamily of Strongly Smooth MDPs, which are in turn a subfamily of our setting - is an open problem (see COLT 2024 Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning). Note that, for the former class of processes, a regret order of ν+2d2ν+2d\frac{\nu+2d}{2\nu+2d} was never proved before this paper; therefore, our result is novel even when restricted to the family of Kernelized MDPs.

Why is this problem so hard? The root of the issue is to be found in the so-called "covering arguments", which regularly appear in the analysis of regret for value-based methods; for an example see Lemma D.6 from the seminal paper [1] on Linear MDPs. Indeed, as a value-based algorithm needs to estimate the value function at each step hh from the same function at step h+1h+1, the corresponding regression problem has a stochastic target. When trying to solve a regression problem of this type, where the random target is not independent from the samples used for estimation, we pay a penalization term which depends on the covering number of the function class.

Unfortunately, the function class used in our problem - or in Kernelized MDPs - is very big, and this has a pernicious effect on the final regret bound. In previous papers, this led to also to superlinear regret bounds (Table 1), while here, with the idea of local linearity, we are able to get Kν+2d2ν+2dK^{\frac{\nu+2d}{2\nu+2d}}. Doing better than this would mean finding a way to avoid the covering argument, but we do not know if this is possible: again, the lower bound is proved for Continuous Armed Bandits, where covering arguments are not needed.

[1] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory,

评论

Thank you for the answer!

审稿意见
6

This paper studied a structural property called local linearity that makes continuous MDPs learnable. In particular, local linearity means that the continuous state-action space can be partitioned into multiple regions and in each region, the Q-function is linear w.r.t. a unique (different) parameter. The paper first proposed an algorithm to learn MDPs with local linearity and proved that the regret is sublinear. Then, the paper showed that mildly smooth MDPs, where the Bellman operator is smooth, are locally linearizable. So that it demonstrated the broadness of local linearity.

优点

  1. The studied problem is important.
  2. The new concept of local linearity is interesting and indeed capture a broad class of MDPs according the paper.

缺点

  1. The proposed algorithm is computationally inefficient. It needs to run over all regions while the number of regions might be exponentially large according to section 4.

问题

Please see the weaknesses part.

局限性

No further limitations need to be addressed.

作者回复

Q: The proposed algorithm is computationally inefficient. It needs to run over all regions while the number of regions might be exponentially large according to section 4.

It is true that the number of regions is exponential in dd. This makes the computational complexity exponential in this parameter. Still, we argue that this is not a downside w.r.t. the state of the art.

  1. State of the art: other algorithms for other very general continuous RL settings require access to an optimization oracle [1, 2] over the space of candidate QQ functions. This means that if one is interested in approximating a CνC^\nu function, as in our case, an optimization over a space of O(2(1/ε)d)O(2^{(1/\varepsilon)^d}) functions is needed (if one is satisfied by ε\varepsilon-cover, otherwise ++\infty).

  2. Magnitude of dd: if, in linear MDP problems, it is well known that dd may be very large, almost of the order of KK, the same thing does not apply here. In fact, the lower bound, even for the simple case ν=1,H=1\nu=1,H=1, entails that dd is much smaller than log(K)\log(K): RK=Ω(Kd+1d+2)and dclog(K)    RK=Ω(Kclog(K)+1clog(K)+2)=Ω(K).R_K=\Omega(K^{\frac{d+1}{d+2}})\quad \text{and } \quad d\approx c\log(K) \implies R_K=\Omega(K^{\frac{c\log(K)+1}{c\log(K)+2}})=\Omega(K). This means that, unless dlog(K)d\ll \log(K), the problem is provably unlearnable. Therefore, an exponential dependence in dd is not as bad as it may seem.

[1] Jin et al. "Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms."

[2] Du et al. "Bilinear classes: A structural framework for provable generalization in rl."

评论

Thanks for the response. The second point seems reasonable. I would raise my score to 6.

最终决定

This paper investigates regret minimization in continuous and smooth MDPs. The authors first study a linear MDP setting where the model parameter (\theta) may change across different partitions of the state space, and the feature map (\phi) is fixed throughout the space. They design a new algorithm, Cinderella, which generalizes an existing algorithm for the linear MDP setting (Eleanor). The authors then apply their generalization to the continuous and smooth MDP setting by defining the feature space as polynomials, leveraging Taylor's expansion for smooth functions. Their results improve upon prior works for the Mildly Smooth MDP setting (Definition 3, which assumes Bellman operator smoothness). Additionally, they derive regret bounds for Cinderella in the Strongly Smooth and Kernel MDP settings.

As pointed out by the reviewers, this work has some limitations:

  1. Although the paper is well-executed, it is unclear which new analysis tools the authors introduce and how other researchers can utilize these tools to enhance their results or develop new algorithms based on these findings.
  2. Cinderella lacks computational efficiency, which is a common issue with optimistic algorithms.

Despite these limitations, I believe this paper makes a significant contribution by formalizing an intuitive idea and advancing our understanding of RL algorithms for continuous MDPs, hence I recommend accepting this paper. I encourage the authors to incorporate the reviewers' comments in the final draft and include the analysis mentioned in their discussion with Reviewer 1ViU.