Misspecified $Q$-Learning with Sparse Linear Function Approximation: Tight Bounds on Approximation Error
摘要
评审与讨论
In this work, the authors study RL where Q-function is sparse codable up to error. Their main contributions are i) showing the existence of a hard instance where achieving suboptimality less than requires sample complexity exponential to C, ii) establishing a theoretical method achieving suboptimality of with polynomial sample.
优点
- An interesting work on the RL learnability under limited knowledge of environment.
- The theoretical results are strong in the sense that almost-matching upper and lower bounds are presented.
缺点
- The sample complexity grows exponentially to the sparseness , which is unavoidable with the present set of assumptions but undesirable in practice.
- The construction of the hard instance depends on exponentially large state space. This is not uncommon in the theoretical analysis, but worth discussing if similar results hold with smaller state space.
Some mathematical definitions are confusing:
- Def of lacking?
- Why parameters restricted on sphere? (I suspect the authors are confusing sphere and ball)
- How s_0 is sampled? Is it fixed?
问题
Does the proposed method exploit the sparseness except through the cardinality of the epsilon net?
Thank you for your careful review! We address your questions below:
Practicability: In this paper, we focus on the setting where the sparsity is much smaller than the feature dimension . In particular, when the sparsity is a constant, our algorithm has a polynomial dependency on and . This is because we are considering a much smaller epsilon-net that exploits the sparsity structure, instead of an epsilon-net for the whole feature space. For instance, when the optimal Q-function is a dimensional linear function, but can be approximated by a -sparse linear function, the sample complexity and running time of our algorithm would be quadratic and could be practical for real-life applications.
Exponentially large state space: When state space is small, we can directly use algorithms in tabular RL setting and get arbitrarily small suboptimality with polynomial sample complexity, using, for example, the algorithm in Jin et al.(2018). Therefore, we cannot construct a hard instance with a small state space.
: is the set of all distribution over . We have added in this definition in the preliminary section in the revised version. Thank you for pointing it out.
Sphere vs. Ball: In Assumption 1, we stated that each optimal parameter is in the d-dimensional unit sphere . Our results can be extended to assuming ’s are in the unit ball, i.e. by creating a grid for the radius.
: Yes. We assumed the initial state is deterministic at line 229.
Question: In the upper bound result, we agree that we mainly exploit the sparseness through the cardinality of the set of non-zero indices and through the cardinality of the epsilon net. Therefore, the results supersede a setting where the misspecified Q-function of each layer is in a finite-sized function class as a special case. Assuming there are functions at each layer, we can construct the feature vector as -dimension, where the i-th index is the value of the i-th function. Then, this setting is captured by a -sparse parameter .
References: Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, Michael I. Jordan. Is Q-learning Provably Efficient? Advances in Neural Information Processing Systems 31 (NeurIPS 2018)
Thanks for the clarification. I will keep my score.
This paper considers the problem of online reinforcement learning with linear function approximation. It is well known if that in the presence of epsilon-misspecification (in the sense that Q* is eps-approximately linear), the best one can hope for is a an error guarantee that scales with \sqrt{d}*eps, where is the feature dimension, even in the bandit setting where . The main result in this paper is to show that if we additionally assume that the optimal parameter is sparse, it is possible to improve upon this guarantee. In particular:
-
- The author show that it is possible to achieve an error guarantee that scales as H*eps (with no dependence on dimension), albeit with sample complexity and runtime scaling as roughly (d/eps)^{k}.
-
- The authors show that the error guarantee above cannot be improved further, in the sense that achieving H/T*eps for any T requires exp(T) sample complexity.
优点
This paper makes a reasonable contribution to the theory of reinforcement learning. Guarantees in the linearly (approximately) realizable setting the authors consider are non-trivial and often somewhat surprising, so I think this paper makes an interesting contribution by filling in another piece of the landscape.
缺点
The main drawbacks preventing me from giving a higher score are as follows:
-
On the upper bound side, the significance of the algorithmic techniques is somewhat narrow. Concretely, the authors consider a setting where we have a separate parameter for each step . If we instead assume that there is a shared parameter for all steps , then I believe that simply enumerating over an epsilon-net for the parameter space and trying each policy one-by-one would be sufficient to achieve the approximation and sample complexity guarantees the authors get. So, the only reason why a non-trivial algorithm is required at all is because there is no parameter sharing across the steps . Indeed, without parameter sharing, an epsilon-net would have size at least 2^H and lead to 2^H sample complexity, so the authors' contribution can be viewed as removing this exponential dependence on for this setting.
-
The problem setting in the paper is fairly niche, and will probably only be interesting/compelling to research working on core RL theory. Broader impact is unclear.
问题
Overall, I do think the results are interesting and not necessarily obvious a-priori---they certainly piqued my curiosity. Some questions I was inspired to think about based on the results are as follows:
-
Can we get similar guarantees beyond linear function approximation? My thought would be that if we assume that Q_h is eps-approximately realizable by a class \mathcal{F}_h, then perhaps we can get Heps error while paying for roughly max_h{Covering-Number(\cF_h, eps)} in the sample complexity using a generalization of the algorithm. If so, this would suggest that the linear structure is actually not that important.
-
Have the authors thought about whether there exists an algorithm that achieves the H/T*eps error vs exp(T) sample complexity tradeoff in the lower bound?
-
Is there any hope of achieving just d^{k} sample complexity instead of (d/eps)^k sample complexity?
Thank you for your careful review and positive evaluation! Below, we respond to all your questions:
Weakness1: If the parameters are shared across levels, then the problem setting can only apply to a very limited number of scenarios. Because of the sparsity assumption, such a setting only has non-zero parameters. On the other hand, our setting can be applied to a much larger range of problems since we have non-zero parameters.
Weakness2: We believe core RL theory is an important research topic :)
Question1: Yes. The function class setting can be thought of as a special case of our linear setting. When there are functions at each layer, we can construct the feature vector as -dimension, where the i-th index is the value of the i-th function. Then, this setting is captured by a -sparse parameter .
Question2: If the transition is deterministic, then we could modify our Algorithm 1 to achieve vs. trade-off. Specifically, we first separate the levels into blocks, each consisting of levels. Within each block, we would traverse through all the possible actions, with in total trajectories. Then, we eliminate each block based on the smallest Bellman error in these trajectories. This process ensures that we incur at most suboptimality within each block, leading to suboptimality. The total sample complexity scale by .
If the transition is stochastic, we might be able to use the importance sampling idea of Kearns et al. (1999) in our setting. Such a generalization could be nontrivial and we will leave it as a future work.
However, restricting the sample complexity to be polynomial in this setting forces to be a log factor, resulting in only a log improvement in suboptimality. As this improvement is relatively minor and risks complicating the presentation, we chose not to include these results in the paper.
Question3: Previous works (Weisz et al., 2022, Wang et al., 2021) show that even when the optimal Q-function is well-specified, any RL algorithm would have exponential dependency on or . Note that this is equivalent to the case where the sparsity and the approximation error in our setting. Therefore, in our misspecified setting which is strictly harder, exponential dependency on is unavoidable, unless we can accept an exponential dependency on . However, we agree that a better dependency on , such as rather than , might be possible. We have added such improvements as future work.
References:
Michael Kearns, Yishay Mansour, Andrew Ng. Approximate Planning in Large POMDPs via Reusable Trajectories. Advances in Neural Information Processing Systems 12 (NIPS 1999)
Gellert Weisz, Csaba Szepesvari, and Andras Gyorgy. TensorPlan and the Few Actions Lower Bound for Planning in MDPs under Linear Realizability of Optimal Value Functions. 2022.
Yuanhao Wang, Ruosong Wang, and Sham M. Kakade. An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap. 2021
Thank you for answering my questions! I will maintain my positive score.
In Q-learning with function approximation, it is assumed that the optimal Q-value function can be written as a linear function of the feature of each state, action pair, i.e., . In the misspecified setting, the assumption is relaxed to be such that the optimal Q-value function is -close to some linear function of the features, i.e., . Unfortunately, it has been shown that exponentially many samples are required in the misspecified setting even when the MDP only has step, which simplifies to the linear bandit setting. Yet, a prior work shows that the hardness result can be circumvented in the linear bandit setting provided that the optimal secrete vector is further assumed to be -sparse for some constant . In such cases, the sample complexity for linear bandit has been improved to .
In this work, it has been shown results with a similar favor hold for general multi-step Q-learning as well. In particular, they give an algorithm that uses polynomially many samples and returns an -optimal policy, where is the number of steps of the MDP. Interestingly, the error has been shown to be optimal up to polylogarithmic factors.
The main novelty of the proposed algorithm, compared to the prior work, is in how it chooses the policy to execute. In particular, the policy is chosen to maximize the so-called “empirical roll-in distribution” at all levels. After that, similar to prior work, the algorithm proceeds to compute average bellman error, and eliminate parameters that lead to large Bellman error (though the algorithm is a bit more conservative in the sense that it always eliminates at most one candidate parameter per level in every iteration).
优点
The error guarantee of the algorithm proposed is shown to be optimal though it has a surprising linear scale with the number of steps . The lower bound argument is quite interesting. They first demonstrate a suboptimality gap assuming that the RL algorithm is given no samples using hard instances based on binary trees. They further show that the instance can be modified so that the reward-signal is exponentially sparse. Hence, the algorithm must search the space in a brute-force manner to extract useful information.
缺点
The algorithm is structurally similar to the algorithm from Jiang et al., 2017. Since the algorithm eliminates one parameter per level every iteration, it is almost doing a brute-force search over the entire parameter space. As a result, the sample complexity and runtime of the algorithm both scales exponentially with the sparsity parameter of the problem.
问题
Do the authors have a guess whether the exponential dependency on in the sample complexity is necessary?
Thank you for your careful review!
First, we want to emphasize that our algorithm is not brute-force. By Assumption 1.1, each horizon has a different optimal . Therefore, a brute-force algorithm would have a sample complexity with exponential dependency on unlike our algorithm. Moreover, In the setting where the sparsity is a constant, which is the main focus of this paper, our algorithm has a polynomial dependency on and , since in our algorithm we considered a much smaller net that exploits the sparsity structure. For instance, when the optimal Q-funcion is a dimensional linear function, but can be approximated by a -sparse linear function, the sample complexity and running time of our algorithm would be quadratic.
The exponential dependency on k is necessary. Previous works (Weisz et al., 2022, Wang et al., 2021) show that even when the optimal Q-function is well-specified, any RL algorithm would have exponential dependency on or . Note that this is equivalent to the case where the sparsity and the approximation error in our setting. Therefore, in our misspecified setting which is strictly harder, exponential dependency on is unavoidable, unless we can accept an exponential dependency on .
References:
Gellert Weisz, Csaba Szepesvari, and Andras Gyorgy. TensorPlan and the Few Actions Lower Bound for Planning in MDPs under Linear Realizability of Optimal Value Functions. 2022.
Yuanhao Wang, Ruosong Wang, and Sham M. Kakade. An Exponential Lower Bound for Linearly-Realizable MDPs with Constant Suboptimality Gap. 2021
The authors found a misspecified sparse linear bandit algorithm based on elimination. They first proved that the traditional approaches such as OLIVE work suboptimally in this environment, and also a naive extension from the previous bandit work [Dong & Yang, 2023] fails. After that, the authors propose a lower bound from the MDP without sample to RL with sample, using the INDQ problem as a mediator. This lower bound matches with the upper bound of their algorithm, showing that their algorithm has a near-optimal rate.
优点
- Clarity & Soundness: Their writing was clear on these points.
- They clearly stated their novelty and significant difference from the misspecified bandit algorithm proposed by Dong & Yang.
- They tried to demonstrate how they set the lower bound instance by high-level, but an explicit explanation throughout Section 3, with a nice figure. (and also the main difference from Du et al. 2020)
- The algorithm itself is also quite simple - it follows a traditional method of elimination algorithms, and thanks to the detailed explanation, I could easily get a sketch about how the proof goes on. The technique is also using some simple Hoeffding's inequality.
- Significance & Novelty: Even though their results are clearly written and easy to understand, many of their results are quite fresh and counter-intuitive. One cannot extend bandit algorithms in this case was quite interesting indeed.
缺点
-
Computationally intractable. If I understood correctly, they use an -net on the candidates for - starts from all possible -sparse vectors. After that, for each iteration, this algorithm is searching for argmax over all candidates. Also, eventually they proposed an algorithm that provides -optimal.
-
Even though they tried to show their process transparently, there are some parts that I cannot understand. Please check the 'Questions' below.
问题
-
I don't understand the word 'MDP without sample.' Without any trajectory, how could a learning agent learn something? You mean, 'without learning' the error is bounded by ?
-
Also, it is interesting that this algorithm eliminates the 'most promising candidate' first. In bandits, the elimination eliminates all the suboptimal candidates that show less upper bound than the largest lower bound (eliminate an arm i which is max_j LCB_t(j) > UCB_t (i)). Why does this difference happen?
-
So what is the optimal size of the parameters, and ? I guess they also should be in the scale of .
3-1) Also just to clarify, what is -optimal in here, does this mean ? If so, according to your final result (Lemma 4.4), for the error , one should set . This means now is proportional to too and the overall sample complexity also should include , I believe.
3-2) In that sense, the line 438 is very confusing. Why it is , including the in the denominator? You repeat samplings for iterations. Shouldn't it be ?
Thank you for your careful review and the positive score! Below, we respond to all your questions:
Computationally intractable: In this paper, we focus on the setting where the sparsity is much smaller than the feature dimension . In particular, when the sparsity is a constant, our algorithm has polynomial dependency on and . This is because we are considering a much smaller epsilon-net that exploits the sparsity structure, instead of an epsilon-net for the whole feature space. For instance, when the optimal Q-funcion is a dimensional linear function, but can be approximated by a -sparse linear function, the sample complexity and running time of our algorithm would be quadratic.
“MDP without sample”: This setting serves as a warmup for the more complicated problem instance construction in the next section. Moreover, since our hard instance is 1-dimensional, the feature is equal to the optimal Q-value up to an error of , so it is a very simple setting where an approximate version of the optimal Q-function is given to the algorithm. Our lower bound indicates that, even in this very simple setting, a suboptimality of is the best one can hope for. Therefore, the problem we are studying here is:
- if we are given a perturbed optimal -function, without further learning, how does the perturbation translate to the suboptimality of the resulting policy in the minimax sense.
Eliminating most promising candidate: In UCB of the Bandit setting, arms are eliminated based on the confidence bound of each arm's reward. In our algorithm, we maintain a set of all the remaining parameters, select the parameters with the largest value at each iteration, and eliminate them if the Bellman Error (Def. 4.1, measuring consistency between parameters of two consecutive horizons) is large. The two algorithms follow completely different frameworks, one maintains a set of remaining arms, and the otehr maintains a set of possible parameters.
Optimal sizes of and : The optimal value of and depends on users' priorities. A larger value of and leads to a larger number of iteration (Lemma 4.1) and a larger data requirement (Line 448). However, it leads to a smaller suboptimality with high probability (Line 459-461). Therefore, there is no universal optimal value of and . As a rule of thumb, if one wants the final suboptimality to scales with , then and should be in the scale of .
-optimal: We define as the misspecification error (Assumption 1.1). If we see as the suboptimality (i.e. ) and use a separate notation for the misspecification error, then the sample complexity would indeed need to scale by .
However, as indicated by our lower bounds (Theorem 3.1 and 3.2), we are unable to achieve less than suboptimality (here is the misspecification error), unless we take a sample size that is exponential in . We have added a footnote to clarify this.
: We apologize for the typo. In each iteration of the algorithm, we draw a new sample of trajectories for the estimation at each level, so we need samples per iteration.
This paper present a elimination-based algorithm and polynomial sample complexity bound for reinforcement learning with sparse linear function approximation. The sample complexity is polynomial in the feature dimension d and planning horizon H when the sparsity level is constant. A lower bound matching to logarithmic factors is also provided to complement the result. The problem is important and technical result is solid. All reviewers are supportive of this work, we thus recommend acceptance.
审稿人讨论附加意见
All reviewers support the acceptance of this paper.
Accept (Poster)