Horizon-Free Regret for Linear Markov Decision Processes
We present an algorithm to achieve regret bound of \tilde{O}(poly(d)\sqrt{K}) for the horizon-free linear MDP problem.
摘要
评审与讨论
This paper introduces an algorithm for linear MDPs that achieves a regret bound independent of the horizon length (). It specifically analyzes regret by grouping indices with the same maximal total variance for the optimal value function, allowing value functions within the same group, even if values are inhomogeneous, to share samples. Despite its computational inefficiency, this work addresses the research question of whether it's possible to obtain a horizon-independent regret bound in linear MDPs, extending the prior research on horizon-free regret analysis in linear mixture MDPs.
优点
In a linear MDP (simply assuming a known reward function), by setting for the transition kernel parameter , since the action value for each state-action pair is given by , hence finding the optimal action at h-step is the same as solving a linear bandit problem. Based on this perspective it seems like an interesting approach to group bandit problems and share samples to get a -independent regret bound.
缺点
- While the analysis of horizon-free regret in linear MDPs, as an extension of the algorithm for linear mixture MDPs, is intriguing, the technical novelty in this paper seems somewhat limited. As mentioned in related work, Zhang et al. (2021b) initially proposed a horizon-free algorithm for linear mixture MDPs, and Zhou & Gu (2022) introduced a computationally efficient horizon-free algorithm. Although it is acknowledged that Linear MDPs become fundamentally more challenging as the dimension of unknown parameters grows with the increasing state space, given that the order of regret bound for minimax algorithms with dependence in both linear MDPs and linear mixture MDPs is the same (Hu et al., 2023), it is regrettable that this paper only presents a computationally inefficient algorithm despite the existence of computationally efficient -free regret bounds in linear mixture MDPs.
(Hu et al., “Nearly minimax optimal reinforcement learning with linear function approximation”, ICML 2022)
- Providing a slightly clearer explanation of the method in the main text would be beneficial. For instance, as mentioned in Technique 3, calculating an upper bound for is required, but it was stated that a bound for was given. A more explicit explanation of how these two bounds are connected would enhance clarity.
问题
-
While the set of all possible features is not precisely defined, if we define it as , then the convexity of implies that for all and \lambda \in [0,1], . This appears to be a highly restrictive assumption. It would be beneficial to explicitly mention this assumption as a Key assumption like Assumption 1-3 if it is essential, especially considering it is not utilized in existing linear MDP literature (e.g., Jin et al. 2020b).
-
On page 7, could you please provide an explanation for the statement, "" being described as a linear function of the matrix ?
-
Could you clarify how the last inequality in equation (11) holds?
-
Typo?: It seems that while is defined below equation (10), is not defined. Please provide a definition for .
-
What is the meaning of "size of transition model"?
伦理问题详情
N/A
Thanks for your detailed review! Please find our discussions on the computational complexity and the significance of in the common response. For other questions, please find our response below.
About connection between and (Here we use () to replace so that the Latex editor works):
We try to give a more detailed explanation as following: Our initial target is to bound \sum_{k=1}^K \min\left\\{ \sum_{h=1}^H \sqrt{ (\phi_h^k)^{\top} (\Lambda^k(V_{h+1}^*))^{-1}\phi_h^k } ,1\right\\}.
Following the naive maximal variance approach, we need to bound the later term , which is still hard (as presented by the counter example). Instead, we turn to bound \max_{h'\in [H]}\sum_{k=1}^K \sum_{h=1}^H \mathbb{V}(P_{s_h^k,a_h^k},V_{h'+1}^*)\leq \sum_{i=1}^{\log_2(H)+1}\sum_{k=1}^K \min\left\\{ \sum_{h\in H_i}\sqrt{ (\phi_h^k)^{\top}(\Lambda^k(V_{h+1}^*))^{-1}\phi_h^k } ,1\right\\}, where we could use the bound for the first term .
About convex assumption of feature space:
Indeed, for any feature space , we can perform any feature in the convex hull of . Suppose there are two actions and with feature and respectively, then we can create a new (virtual) action with feature by playing with probability and with probability . Therefore this assumption is without the loss of generality.
About the statement "" in page 7:
We can write as where
L(s,a)= \left[\begin{array}{cc} \phi(s,a)\phi^{\top}(s,a) , & \phi(s,a),\\\\ \phi^{\top}(s,a), & 1 \end{array}\right] $$ andB(v)= \left[\begin{array}{cc} -\theta(v)\theta^{\top}(v) , & \theta(v^2)/2,\\ \theta^{\top}(v^2)/2 ,& 0 \end{array}\right]
For fixed $(s,a)$, $\mathbb{V}(P_{s,a},v)$ is a linear function of the matrix $B(v)$. In Equation.(42), with Lemma 5 we derive that $ \sum_{k=1}^K\sum_{h\in H_i}\max_{v\in V_i}\mathbb{V}(P_{s_h^k,a_h^k},v) & = \sum_{k=1}^K\sum_{h\in H_i}\max_{v\in V_i}\mathrm{Trace}(L^{\top}(s_h^k,a_h^k)B(v)) \\\\ & \leq 2(d+1)^2\max_{v\in V_i}\sum_{k=1}^K\sum_{h\in H_i}\mathrm{Trace}(L^{\top}(s_h^k,a_h^k)B(v))\\\\&=2(d+1)^2\max_{v\in V_i}\sum_{k=1}^K\sum_{h\in H_i}\mathbb{V}(P_{s_h^k,a_h^k},v). $ **About the last inequality in Eq.(11):** We have that $$ \max_{\phi\in\Phi} 2d\phi^{\top}\sum_{h=1}^{H-1}\mu^{\top} (V_{h+1}-V_{h+2}) =2d \max_{\phi\in \Phi}\phi^{\top}\mu^{\top}V_{2} =2d\max_{s,a}\phi_{s,a}^{\top}\mu^{\top}V_2.$$ Continuing the computation, we have $$ \max_{\phi\in\Phi} 2d\phi^{\top}\sum_{h=1}^{H-1}\mu^{\top} (V_{h+1}-V_{h+2}) =2d\max_{s,a}P_{s,a}^{\top}V_{2}\leq 2d\max_{s}V_1(s)\leq 2d.$$ **About $l_h^{*}$:** Thank you for point this out. It should be $l_h$ but not $l_h^*$. **About “size of transition model”:** It means the number of parameters needed to fully describe the transition model.Thank you for your time and efforts in reviewing our work. We have provided detailed clarification to address the issues raised in your comments. If our response has addressed your concerns, we would be grateful if you could re-evaluate our work.
If you have any additional questions or comments, we would be happy to have further discussions.
Thanks,
The authors
I thank the authors for their response. I am keeping my score.
This paper considers a regret bound of linear MDP that has a mild dependency on the horizon . The main results are emphasized in the Thm 1 (2nd page). The transition kernel is uniform over timesteps, the Q function is discountless, and the time horizon H is fixed. The challenge here is the optimal action may depend on the remaining time (i.e., should the model explore more or not), which is reflected in the index in , . The process is sequential; before the beginning of each episode k, we declare policy to play it and observe rewards, and evaluation is based on online measure (regret). Assumptions are that 1: bounded total rewards, 2: linear transition kernel (dim: ) and rewards (dim: ) with constant scale per feature. Assumption 3 states that the transition and reward share parameter \theta (dim: ).
The contribution of this paper is to show horizon-free regret (the regret of subpolynomial dependence on H) for MDP under assumption 2 (compared with similar results with assumption 3 by Zhou and Gu 2022). The paper proposed an algorithm (Algs 1--3 combined) to show this regret. The algorithm seems to be computationally inefficient given it defines some quantities that depend on v (a real variable with bounded norm). While the contribution of this paper is written in the paper and it seems fine, I do not consider the paper to be well-written and ready for publication. Since I am not convinced of the correctness of the claim, I lean to be negative on the paper, but I do not have a strong position . I have several questions in the corresponding section for the sake of clarity.
优点
- The paper considers interesting question whether linear MDP is learnable with horizon-free sample complexity. The paper considers the setting where are unknown, which is surely interesting to the community.
- Sections 1--2 are well-written, except for related work section where many papers are cited without specific connection, but overall, nice.
缺点
- The algorithm is computationally inefficient with respect to , so the algorithm here is more conceptual than practical.
- The structure of the paper can be improved. It seems the structure follows Zhou & Gu (2022), but the demonstration of "technical challenge for 3.5 pages without demonstrating algorithm (the algorithm is introduced after that) does not make much sense. At least, this 3.5 pages does not help the reader to be convinced of the results more than 1-page summary.
- There are many typos, which makes it hard for the reader to be convinced of the results.
问题
My largest question is about Algorithm 1. If I understand correctly, these algorithms require optimization over continuous space. First, is not defined in this paper; similar quantities such as and are defined twice, so I assume these are equivalent. Lines 6--9 in Alg1 loops over , which, defines quantities as a function of . is also optimized on the space of .
- is not used in the line 10 of Alg 1 and after?
On assumption 2: Is assumption (2d) standard? It seems that the number of unknown parameter can be arbitrarily large and it is highly nontrivial to have regret bound that does not depend on the size of .
On the similarity with the (contextual) linear bandit problem. The linear bandit problem is where is disclosed (contextual is is ) and reward is where is an unknown quantity. In the MDP of this paper, transition has uncertainty and the uncertainty on the reward w.r.t. transition kernel is much more involved.
The following are rather minor opinions:
- VOFUL: Is it by any means related to OFUL? If I understand correctly, OFUL (e.g., Abbasi-Yadkori et al. 2011) is an algorithm for online bandit algorithm whereas Alg 3 is for the uniform confidence bound (also referred as self-normalized bound or martingale bound).
- Typos. Many periods are missing, for examples:
p2: See Section 3 for more details. Due to space limitation, we postpone the full proof of Theorem 1 to Appendix B p6: That is, we need to use all the samples along the trajectory... p7: Fix some eps>0
- Lines 5, 11, 13 in Alg 1 are comments and can be used some special characters for indicating that, such as
伦理问题详情
The paper is algorithmic and no ethics reviews needed.
Thanks for your detailed review! Please find our discussion on the computational complexity in the common response. For other questions, please find our response below.
About paper structure: We have adjusted the structure of this paper and presented the algorithm before introducing the techniques.
About the typos: We have fixed the typos accordingly in the revision. Thanks for the effort to check the correctness of this work.
About Algorithm 1:
About :
is a continuous set and is the discretization of .
You are correct that and denote the same set.
We replaced by and updated the paper accordingly. Thanks for pointing out!
is not used in the line 10 of Alg 1 and after?
We do not use in the main algorithm. in the output of Algorithm 3 is used in the statement and proof of Lemma 10, which states that is a close estimation of .
Assumption 2: Is assumption (2d) standard?
Assumption (2d) could be find in Section 8.1 in [Agarwal et al, 2019]. The commonly used assumption [Jin et. al., 2020, He et.al., 2022]: with , implies assumption (2d). It is possible to obtain an -independent regret bound due to the linear structure since we are only required to learn for certain vectors .
Additional reference:
[Agarwal et al., 2019] Reinforcement learning: Theory and algorithms
[Jin et. al., 2020] Provably efficient reinforcement learning with linear function approximation
[He et. al. 2022] Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision Processes.
Similarity with contextual linear bandit problem:
We believe linear MDP has the same intrinsic hardness as contextual linear bandit problem. The transitions are involved but it suffices to learn for to make decisions.
About VOFUL:
VOFUL is an algorithm with variance-awared regret bound for linear bandits. The high-level idea is to construct a confidence region for the unknown parameter and then choose arm optimistically. Both VOFUL and OFUL rely on optimism but the algorithm structures are quite different.
Thank you for your time and efforts in reviewing our work. We have provided detailed clarification to address the issues raised in your comments. If our response has addressed your concerns, we would be grateful if you could re-evaluate our work.
If you have any additional questions or comments, we would be happy to have further discussions.
Thanks,
The authors
Overall, I will be keeping my score mainly due to the structure of the paper, but please weigh less on my review. If the other reviewers consider this paper good enough, I do not object.
- Structure of the paper
I think the paper looks better. Still, the informal proof sketch of 3.5pp is not a good idea. I know these sketches are better than nothing, and in many cases, they help the reader -- we cannot understand a paper without good intuition, and typically, the key points in 25-page proof can be condensed. Meanwhile, these are nothing guaranteed in the sense of math. Making them into key lemmas enables you to highlight the main results without compromising rigorousness.
- Assumption 2: Is assumption (2d) standard?
- Similarity with contextual linear bandit problem:
I understand, thanks. At least your comment is very clear.
- VOFUL (minor)
I mean, Algorithm 3 builds some confidence ball, but no "selection" component. Optimism in the Face of Uncertainty (OFUL) implies choosing the argmax of some function in a confidence ball, so Algorithm 3 does not represent its name. I do not think this affects the decision, so just please ignore this.
Thank you for your suggestions! We will further polish our paper structure in the final version, including adding key lemmas in the proof sketch section.
The paper provides the first horizon-free bound for linear MDP. The corresponding algorithm is also novel since this algorithm estimates the value function and corresponding uncertainty directly instead of estimating the transition probability.
优点
-
This is the first bound for the linear MDP problem.
-
The approach is novel and interesting.
缺点
- It would be better to have some numerical illustrations to see whether the given algorithm has better numerical performances.
问题
- I wonder whether this algorithm is numerically efficient.
Thanks for the review! Please find our discussion on the computational complexity in the common response.
I appreciate the authors' response, and I will keep my rating.
The authors propose a novel algorithm for reinforcement learning (RL) in linear Markov Decision Processes (MDPs). The key feature of the presented algorithm is horizon-free regret bounds, that means that the regret of the algorithm scales only polylogarithmically with a problem horizon.
优点
- The first algorithm for linear MDPs that attains nearly horizon-free regret bounds;
- Very detailed discussion on novelties of the proposed approach;
缺点
- Algorithm is computationally unfeasible: it requires to optimize over the confidence region of all models of linear MDPs, that is unfeasible in MDPs with infinite number of states;
- Proof of Lemma 11 is unclear for me, see section Questions;
问题
Here I would like to present questions to proofs and additionally indicate misprints that I found. I would like to increase my score if all the clarifications regarding proofs will be given.
Questions regarding Lemma 4:
- Why it holds ?
- Is there any references that will show why (16) hold?
Lemma 11:
- Possible misprint: After an inequality (a) there is missed square over in variance and also there should be in the second term (not );
- Definition of is not clearly written: is it $\bar v(s) = \max_{a} P_{s,a} v?
- It is not clear why (c) holds because it is not easily understandable why the presented maximizes , or I did not understand the definition of .
- Why (d) holds? What is a constant ?
Misprints and undefined or confusing notation.
- eq. (8) — V^* without h
- Equations (9) and (13) — missed square in \bar \sigma^k_h in the second multiplier after Cauchy-Schwartz inequality;
- After equation (9) it claims that \sigma^k_h is variance, whereas before was a square root of variance, it is very confusing;
- After eq. (10) — in the beginning and after l without *;
- After definition of : w.r.t. should be with points not commas;
- Proof of Lemma 5: is not defined;
- After equation (28): use of Var as variance whereas usually it was denoted by , and after it confusing claim .
- After equation (29) — no brackets around in the inequality .
- After equation (30) — forgotten subscript in the definition of and it is not clear why .
- Equation (43) — tilde in not over but over all the expression;
- Last parargraph of page 21: underfined norm type ;
- Lemma 12 — missed norm type in the statement (is it norm or as it used later?).
Thanks for your detailed feedback! We have updated our paper accordingly. Please find our response to your questions below.
About Lemma 4:
We revise the proof of Lemma 4 in the revision. We are sorry that in the previous version we mistakenly define . Indeed we need add a regularization factor as to make sure is the geometry center of .
Here we presented the response to your two questions regarding Lemma 4. Because of the symmetric structure of , we can assume that . Now we try to explain your questions.
Why :
We have added a detailed proof in the appendix page 13 (marked in red). At a high-level, denote the -dimensional volume of the slice (here denotes the first dimension of ). Note that is non-empty, and is convex and closed. Let be an arbitrary element of . We then have that , which implies the -dimensional volume of is at least fraction of that of for .
About (16):
After correcting the definition ,
$
z &= \frac{\int_{\Phi}\phi^{\top}\psi du(\phi)}{\int_{\phi}du(\phi)} \nonumber \ & = \lim_{\epsilon\to 0}\sum_{i=0}^{\left\lceil l/\epsilon\right\rceil}\frac{\int_{\phi\in \Phi, \phi^{\top}\psi \in [i\epsilon,(i+1)\epsilon) } \phi^{\top}\psi du(\phi) }{\int_{\Phi}du(\phi)}\nonumber \ & = \lim_{\epsilon\to 0}\sum_{i=0}^{\left\lceil l/\epsilon\right\rceil}\frac{\int_{\phi\in \Phi, \phi^{\top}\psi \in [i\epsilon,(i+1)\epsilon) } i\epsilon du(\phi) }{\int_{\Phi}du(\phi)}\nonumber \ & = \lim_{\epsilon\to 0}\sum_{i=0}^{\left\lceil l/\epsilon\right\rceil}f(i\epsilon)i\epsilon^2\nonumber \ & = \int_{0}^l f(x)xdx. \nonumber
$
By re-arrange the inequality, and noting that we have that
$
\int_{ 0\leq x \leq z}f(x)\cdot (z-x) d(x)= \int_{ z\leq x\leq l} f(x)(x-z)dx, \nonumber
$
About Lemma 11:
About Equation (a). We make a re-arrangement as
About : It should be .
About Equation . with the definition of above, we have
About Equation (d). The term c should be replaced by . Sorry for this mistake.
Typos:
About : we use to denote the variance term throughout the paper. We fix the typos in the revision.
About in the proof of Lemma 5. should be . We fix this inequality and delete .
About in Eq.(28): we replace it by .
About Eq.(29): We add the bracket to .
About Eq.(30): We replace by in the analysis after Eq.(30). The equation
should be replaced by
About : it should be .
About norm type in Lemma 12: it should be norm
Thank you for your detailed response!
Now the proofs of Lemma 4 and Lemma 11 are much more clear for me and I am happy to increase my score.
We thank the reviewers for valuable feedback. We have carefully fixed the typos in the revision. Below we present the response to the major concerns. We also present response to each reviewer for their additional concerns.
About computation complexity:
We acknowledge that the computational cost is a major limitation of this work. One major bottleneck is that we need to invoke a variance-aware linear bandit algorithm to learn the reward function, where it remains an open problem for computational efficient variance-aware regret bound for linear bandit problem.
We also want to note that the first horizon-free works for tabular MDP [Wang et al. 2020] and linear mixture [Zhang et al. 2021b] are not computationally efficient as well, but greatly inspire follow-up works on computational efficient algorithms with horizon-free regret bounds. Therefore, we believe our work already serves an important step toward deeper theoretical understanding of the popular linear MDP model.
Significance of this paper:
We want to emphasize that the existing techniques in optimal bounds for linear MDP [Hu et. al., 2023, He et. al., 2022 ] do not indicate that the horizon-free case is trivial. All existing algorithms with horizon-free bounds try to fully recover the transition model. As a result, their bounds scale with the size of the transition model. The best one for tabular is [Zhang et al. 2021A] and the best one for linear mixture MDP is [Zhou and Gu, 2022].
In linear MDP, roughly speaking, the existing algorithms try to learn independent linear bandit problems for and functions from . In the time-inhomogeenous case, each is independent, and a polynomial dependency on is unavoidable. This approach can obtain optimal regret [Agarwal et al. 2022].
In the time-homogeneous case, although the transition model is shared across all , the and are still different for different . The key technical challenge is to relate and from different without having a regret that depends on the size of the transition model**. To remove all polynomial factors of , it is necessary to share samples among different (in contrast, existing papers on linear MDP focused on the inhomogeneous case so that there is no need to share samples). As stated in Section3, although we have the desired exploration bonus for each linear bandit separately following [Zhou and Gu, 2022], there is no standard way to bound the sum of the bonus in equation.(1) due to inhomogeneous information matrix. In comparison, in horizon-free algorithms for tabular MDP and linear mixture MDP, the samples are shared by estimating the transition model.
The paper solves a very natural question: can we provide a regret bound for linear MDPs that is independent of the time horizon? It provides an algorithm which achieves such horizon-independence. The main criticism of this work is the high computational complexity of the proposed method, but this is something that can be investigated and improved in follow up work.
为何不给更高分
Given that the main result here is theoretical without, at the moment, significant practical implications, I think poster is more appropriate than spotlight or oral.
为何不给更低分
The paper makes a solid contribution to the ML literature.
Accept (poster)