Kernel-Based Function Approximation for Average Reward Reinforcement Learning: An Optimist No-Regret Algorithm
We propose an optimisitc kernel-based RL algorithm for the infinite horizon average reward setting and prove no-regret performance guarantees.
摘要
评审与讨论
This paper studies learning in MDPs with the long-run average reward objective, assuming that the MDP satisfies some kernel assumptions. A UCB-type of learning algorithm is proposed and is proved to achieve sublinear regret when the eigenvalues of the kernel operators decay at the rate of a p-th degree polynomial with p > 1. The paper also proves a confidence interval bound for kernel ridge regression, which is of independent interest.
优点
The result in this paper seems to be strong: it proves a regret bound with a pretty weak MDP assumption (weaker than weakly-communicating), and a RKHS kernel assumption on the structure of the transition kernel which seems to significantly weaken the tabular and linear assumptions in prior work.
缺点
The use of Bachmann-Landau notation is confusing. Specifically, in Assumption 3, it is unclear which variable is scaling in the o(1) notation. Similarly, in Theorem 2, multiple variables appear in the big-O notation, again making it unclear which variable is scaling. It seems to me that the use of big-O notation is unnecessary, given that most steps in the proofs are non-asymptotic. It would be clearer to state the theorems in the non-asymptotic form, and then state a corollary when certain variables scale.
Details for bounding the maximal information gain \gamma(\rho; t) is missing. The authors mention that this maximum information gain is O(d \log(t)) in line 216-211 in the special case of d-dim linear kernels, but did not provide references or proofs.
问题
M is not defined in the equation in line 291. Do you mean the equation holds for any natural number M?
局限性
A discussion of limitation is included.
Thank you for reviewing our paper. We appreciate your positive feedback on the strength of the results. Here, we address your comments in detail and hope this will enhance your evaluation of the paper.
The use of notation:
As mentioned by the reviewer all of our proofs are non-asymptotic and our use of notation, which is common practice in the literature, serves to simplify presentations. Theorem 1 does not use notation. We will revise the statement of Assumption 3 to read: "..., for any and some constant independet of ." We will also clarify the statement of Theorem 2 by introducing an absolute constant independent of all parameters: . We are ultimately interested in scaling with after and are selected optimally with respect to , as given in Remark 2, which we will further clarify.
References for bounds on information gain
Thank you for pointing out the missing references for maximum information gain. The bound for the linear case is given in [22]. Further discussions and tight bounds for kernels with exponential and polynomial eigenvalue decay are provided in [35]. We will ensure these references are properly included.
M is not defined in the equation in line 291. Do you mean the equation holds for any natural number M?
Yes, the statement holds for any natural number . We will clarify this point in the revision.
For further details on the role of this in the analysis, please see our response to Reviewer KBUd. Additionally, a simplified presentation of the confidence interval is given in Remark 1, where is removed.
We would be happy to provide further clarifications on any of these points.
Thanks for your response. I have one more question: I am a bit confused if " for any " is the definition of Assumption 3, for two reasons:
First, typically means a sequence converges to zero, while the definition you just gave only requires the sequence to be bounded. Did you intend to write instead of in Assumption 3?
Second, in the sentence after Assumption 3, you said that Assumption 3 is implied by p-polynomial eigendelay with , which I do not quite follow. By definition 1 of [35], p-polynomial eigendelay means for some constant , so has order , which is bounded only when .
That is correct, and thank you for pointing it out. We overlooked this rather straightforward calculation. Here, we reexamine the entire calculation for clarity.
For Remark 1, we select with respect to such that the third term in becomes only a constant. We have updated Assumption 3 as follows:
Assumption 3 For kernel k′, we assume for constants , both independnet of .
This is a mild assumption. For instance, for the -polynomial kernels defined in Definition 1, which apply to a large class of common kernels including SE, Matérn, and NT kernels, Assumption 3 is satisfied with and , where is the constant in Definition 1.
Under this adjustment, to prove Remark 1, we choose . In the third term in , we have , thus the third term is only a constant. In the second term in rather than under square root, we have , where is only a kernel specific constant independent of all other paramters .
We appreciate your feedback and would be grateful if you could please let us know if this adjustment and explanation address your question.
Thanks. I am convinced that this new version of Assumption 3 is implied by p-eigendecay with , and the expression of in remark 1 remains unchanged. However, since Theorem 2 also relies on Assumption 3, I am a bit concerned if this change in Assumption 3 affects the correctness of Theorem 2. Could you please clarify to what extent Theorem 2 relies on Assumption 3? In particular, is there a more general version of Theorem 2 that does not rely on Assumption 3 and has in its regret bound?
Also, I notice that on line 527, the expression of beta used in the proof of Theorem 2 does not exactly match the expression of beta in remark 1 (one starts with and the other starts with ). Could you please clarify?
Thank you for your response.
Theorem 2 relies on Assumption 3 only through Remark 1. As discussed above, the adjustment to Assumption 3 does not affect the presentation of in Remark 1, nor does it affect Theorem 2.
In Theorem 2, the first factor in the second term inside the notation represents . Specifically, can be replaced by , offering a variation of the theorem with . This term matches the expression of in Line 527, as pointed out by the reviewer.
The parameter in Theorem 1 denotes the upper bound on the RKHS norm of the target function . Within the proof of Theorem 2, the target function is . We substitute with because , given that is bounded by (due to the projection step onto ), and according to Lemma 3 of [42], that:
.
This is explained in Lines 530 and 531. Following your comment, we will add more details and reiterate this point in the main text.
Thank you for your engagement in the review process, which has significantly contributed to improving the paper. We appreciate your feedback and would be grateful if you could let us know if our response has clarified your question.
I appreciate the authors' thorough response to my concerns. Based on the rebuttal, I will raise my score to 6.
Thank you again for your thoughtful comments. We appreciate your contribution to the review process.
The authors consider the setting of reinforcement learning in infinite horizon average reward MDPs. In particular, they consider a kernel based function approximation to represent value functions. Most of the prior work involving kernel regression has been in the context of bandits, where the state cardinality is 1. In the context of reinforcement learning, the complexity is enhanced, since the state space is no longer degenerate and actions influence transitions to future states. Hence, the setting in which kernel approximation is employed is thus more complicated. Some limited prior work in the domain of RL involves using kernel functions whose eigenvalues decay in an exponential manner which in the current work has been relaxed to polynomial decay, thus encapsulating a wider class of approximation architectures. The value functions are computed using kernel functions along with a confidence bound. Another contribution of the authors is characterizing the confidence bounds in this context. The value functions are set to zero ahead of a time horizon and are recursively back calculated using samples generated till previous time horizon. The next set of samples are generated using these new estimates of value functions. The final regret bounds are characterized as a function of the polynomial decay , but are sublinear for all .
优点
-
A major strength of using kernel functions is the expressibility of the function class with respect of the representation of state action and state value functions. Most literature in the context of RL relies on linear function approximation (which is a special case of kernel functions) due to ease of analysis. Kernel approximations on the other hand capture functions classes such as neural tangent kernel and Matern family. Moreover this work relaxes the exponential decay of eigenvalues utilized in the previous work to polynomial decay, which is a significant improvement.
-
The final bounds are in general form which depends on the polynomial decay of the eigenvalues. The authors also remark on how the regret bounds for the linear function approximation case can be derived from this general form. The authors also present the first sublinear regret bounds in the context of polynomially decaying eigenvalues of the kernel function.
-
Their construction of confidence bounds may be of independent interest for the design of optimistic algorithms in RL using kernel function approximation.
缺点
-
The function approximation representation of the state action value function at time involves inversion of a matrix of dimension . This operation is performed at every iteration of the algorithm. As grows, this computation might be expensive. It is unclear as to whether there is a short cut to circumvent this step in the algorithm, which might make it intractable in large time horizons.
-
The value functions are being reset to 0 repeatedly after a finite time horizon . The samples generated till that time instant only manifest in the confidence bounds. It's not entirely clear as to what the significance of reseting this value function to 0 is establishing since it seems to be not data-efficient.
问题
-
The notion of maximum information gain is unclear. If the information gained is maximized, intuitively, it results in better regret bounds? However, larger the value of , greater the magnitude of the regret. It would help to characterize what represents in the context of RL.
-
It is unclear as to how the reseting and back calculation of the value function from time to in Algorithm 1 helps with the analysis.
-
Won't the projection operator in Equation 5 negate the role of the confidence bounds? It makes sense for the value functions to not exceed the horizon , since the rewards are , however the information from the previous samples seem to be represented solely through the confidence bound term . Would the performance be better if the projection was onto a space larger than ? Since there seems to be a tradeoff with respect to in the final bounds, is there a intuitive explanation as to why such a tradeoff exists? As in what aspect of the analysis gets better as grows larger and what gets worse?
局限性
The limitations of this work have been explicitly addressed.
Thank you for your detailed review of our paper. We appreciate your positive feedback regarding the generality of our setting and results, as well as the potential independent interest in the confidence bounds. We will address your comments and questions in detail, hoping this will enhance your evaluation of the paper.
Computational complexity:
Kernel-based models provide powerful and verstile nonlinear function approximations and lend themselves to theoretical analysis. It is, however, well-known that kernel ridge regression has a relatively high computational complexity, primarily due the matrix inversion step. This challenge is not unique to our work and is common across kernel-based supervised learning, bandit, and RL literature. Luckily, sparse approximation methods such as Sparse Variational Gaussian Processes (SVGP) or the Nyström method significantly reduce the computational complexity (to as low as linear in some cases), while maintaining the kernel-based confidence intervals and, consequently, the eventual rates; e.g., see Vakili et al. ICML'22 and references therein. These results are, however, generally applicable and not specific to our problem. We thus preferred to avoid further complicating the notation and presentation of an already notation-heavy RL setting to focus on our main results and contributions specific to this paper.
``Improved Convergence Rates for Sparse Approximation Methods in Kernel-Based Learning'', Vakili, Scarlett, Shiu, Bernacchia, ICML 2022.
Additionally, as a minor point, the matrix inversion step is required only every other steps, in contrast to every step, which further improves computational complexity. We will add these discussions to the revision.
- maximum information gain:
This is a kernel-specific quantity that captures the complexity of kernel-based learning. It is a common measure in the literature on kernel-based bandits [22, 23, 24, 36, 27] and RL [12, 25, 26]. As defined in Section 2.5, the value of depends only on the kernel and the free parameter . The algorithm itself does not aim to maximize information gain; rather, it is an algorithm-independent and kernel-specific complexity term that appears in the regret bound. Intuitively, captures the complexity of the function class. Under a linear assumption, . For kernels with exponentially decaying eigenvalues , and for kernels with polynomially decaying eigenvalues, such as the Matérn family and Neural Tangent kernels, the bound on is given in Section 2.5. For proofs of these bounds, see [22, 35].
- Reseting and back calculation of the value function on a window with size :
We here address this question along with the related observations from the "Weaknesses" section and the subsequent discussion on the tradeoff in choosing from the next question. We understand the reviewer's comments to be highlighting two specific aspects of our algorithm and analysis.
First aspect: For kernel ridge regression on , we only use observations up to , where and . This might seem data inefficient as it does not utilize the samples:
We address this in our analysis. After Theorem 2, which presents the performance of the algorithm, it is noted: "Algorithm 1 updates the observation set every steps, requiring us to characterize and bound the effect of this delay in the proof. A straightforward application of the elliptical potential lemma results in loose bounds that do not guarantee no-regret. We establish a tighter bound on this term, contributing to the improved regret bounds." In particular, we show that this delay in updating the observation set does not change the eventual rates. The proof, based on Lemma 4, is detailed in lines 555 - 565 in the appendix. The key idea in the proof is to bound the ratio between a delayed uncertainty estimate update and a non-delayed one, followed by some arithmetic manipulation.
Second aspect: Regarding the window’s role (i.e., setting the value function to at the end of the window, backtracking to compute the proxy value function over the window, and unrolling the policy in a forward direction over the window), there is an apparent tradeoff in choosing the window size:
This tradeoff balances between the strength of the value function and the strength of the noise. A longer window is preferred to capture the long-term performance of the policy, while a larger window increases the observation noise affecting the kernel ridge regression. We recall that both target function and observation noise are bounded by . This tradeoff is explicitly seen in the regret bounds. The optimal size of the window, as specified in line 335, results from an optimal interplay between these two factors, which is explicitly captured in the regret bounds.
We appreciate this discussion and believe that the inclusion of these points will enhance the paper. Please let us know if any further clarifications are needed regarding these aspects.
- The projection operator in Equation 5:
The analysis requires a confidence interval with diminishing width. In other words, we condition the proof on the validity of the confidence intervals with probability , hence "... with probability at least ..." in the statement of Theorem 2. We also show that the regret grows with the sum of the widths of confidence intervals over time.
Projecting on maintains the validity of confidence intervals. It also does not affect the growth rate of the sum of the widths: given that the uncertainties eventually diminish in a way that the confidence interval width becomes smaller than . However, it significantly simplifies the proof by allowing us to use a uniform upper bound on noise over time, rather than dealing with noise terms of the power that could be larger than for small .
In summary, although removing projections does not seem to affect the eventual rates, it can complicate the proof. We also note that the projection of the confidence intervals onto a priori known interval for the target function is a commonly used technique across RL literature [9, 10, 12, 23, 25, 29] and is not specific to our work.
We appreciate your detailed comments and constructive review. We would be happy to provide any further clarifications or engage in further discussions.
Thank you for your response to my comments. I have read them and am keeping my score.
The paper proposes a kernel-based optimistic algorithm for the average reward setting and corresponding regret analysis. As described, the kernel-based setting is a more general extension of linear structure to an infinite-dimensional linear model in the feature space of a positive definite kernel.
优点
This seems to be the first treatment for average reward MDP in the kernel-based setting. The paper seems to be well-written. Overall, this could be interesting addition to the set of literature in average reward RL
缺点
The technical portion of the main text can be improved if the paper discusses the key steps/challenges of the proof. E.g., Theorem 1 is based on results of prior work [37]; without moving to the appendix, it is impossible to extract the technical novelty in this paper.
问题
1)Can the results be improved to \sqrt{T} under uniform mixing conditions (which is a stronger assumption)?
局限性
N/A
Thank you for reviewing our paper. We appreciate your positive feedback.
Following your comment, we will enhance the main text with more detailed technical content. We have provided a proof sketch in the paragraph following Remark 1, which we will expand to further detail the proof of Theorem 1.
Unlike typical settings such as (supervised) kernel ridge regression or kernel-based bandit settings [22, 23, 24, 36, 37, 24], in the RL setting, confidence intervals are applied to , which varies due to the Markovian nature of temporal dynamics, with each derived recursively from . Hence, the confidence interval must be applicable to all functions in a function class. This requirement is captured in the theorem: "For all and , the following holds ...". To achieve such confidence intervals, we use a novel technique by leveraging the Mercer representation of and decomposing the prediction error into error terms corresponding to each Mercer eigenfunction . We then partition these terms into the first elements corresponding to eigenfunctions with the largest eigenvalues and the rest. For each of these eigenfunctions, we obtain high probability bounds using existing confidence intervals from [24]. Summing up over , and using a bound based on uncertainty estimates, we achieve the high probability bound—corresponding to the second term in , and bound the remaining elements based on the eigendecay—corresponding to the third term in .
In summary, the key technical novelties in deriving the concentration bound involve leveraging the Mercer decomposition of and the error term, partitioning the decomposition into the first elements and the rest, bounding each of the first elements using exisitng kernel-based confidence intervals, bounding the rest based on eigendecay, and then summing up over all to derive the confidence interval.
We emphasize that this is a substantial result that can be applied across various RL problems.
1) Can the results be improved to \sqrt{T} under uniform mixing conditions (which is a stronger assumption)?
As noted in the introduction, [20] achieved a regret bound of under the linear bias function assumption, where is the smallest eigenvalue of the policy-weighted covariance matrix. While assuming a strictly positive smallest eigenvalue (independnt of ) for the covariance matrix is reasonable in a finite-dimensional setting where , it becomes unrealistic in the kernel setting. This presents significant challenges in adapting the existing results to the kernel setting. It is not clear whether tighter results can be achieved under uniform mixing conditions, necessitating further research.
Please let us know if any further clarifications on these points are required.
Thanks for your detailed response. I am broadly satisfied by the rebuttal.
This paper proposes a kernel-based optimistic algorithm for average reward RL in the infinite horizon setting. The authors provide new no-regret performance guarantees under kernel-based modeling assumptions, extending previous work to the more complex RL context. The reviewers highlight both strengths and weaknesses of the paper. They praise its technical soundness and contribution, recognize its novelty in the average reward MDP context, and emphasize the importance of the kernel-based approach. However, some concerns are raised regarding computational complexity, the effectiveness of the value function reset, and the role of the projection operator in the analysis. The authors address all these issues.
Overall, the paper presents a solid contribution to the field of average reward RL by extending kernel-based methods and achieving sublinear regret bounds. The reviews acknowledge the novelty and theoretical strength of the work while raising valid concerns about computational complexity and specific aspects of the analysis. The authors demonstrate a comprehensive understanding of the issues and provide convincing responses. This paper warrants acceptance, provided the authors address the reviewers' concerns and further clarify the details of their analysis.