A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation
摘要
评审与讨论
This paper considers reinforcement learning with low switching cost. The authors design a new algorithm named MQL-UCB for RL with general function approximation. The key algorithmic design includes a general deterministic policy-switching strategy that achieves low switching cost, a monotonic value function structure with carefully controlled function class complexity, and a variance-weighted regression scheme that exploits historical trajectories with high data efficiency. MQL-UCB achieves minimax optimal regret and a near optimal switching cost.
优点
- The problem of reinforcement learning with low switching cost and general function approximation is interesting.
- The paper is solid. The proof looks correct to me.
- The bounds are strong. The switching cost bound is optimal according to the lower bound.
- The presentation is clear in general.
缺点
-
Some of the assumptions look very strong. The completeness of all functions , the second-order completeness and the existence of bonus oracle are not standard to my knowledge. It seems that the weighted regression is possible only if these assumptions hold.
-
The paper [1] also studies low switching reinforcement learning with general function approximation. It seems that the setting in [1] is slightly more general (please correct me if there is misunderstanding) and their algorithm is cleaner.
[1] Xiong et al. A general framework for sequential decision-making under adaptivity constraints.
- The algorithm itself is very complicated and whether it can be implemented in practice is unclear. Since low switching RL is a very practical problem, some experiments (even under the simplified linear MDP) to show the performance of the algorithm will be helpful. I am curious about how this algorithm performs compared to the standard approach: LSVI-UCB with a doubling trick under linear MDP.
问题
See the weakness above.
局限性
Yes.
Response to Reviewer ZRKX
Thank you for your insightful comments and suggestions! We answer your questions as follows.
Q1 Some of the assumptions look very strong. The completeness of all functions , the second-order completeness and the existence of bonus oracle are not standard to my knowledge. It seems that the weighted regression is possible only if these assumptions hold.
A1 The completeness assumption on the second moment, introduced by [1], is crucial for achieving tighter regret bounds in reinforcement learning (RL) with general function approximation. This assumption involves leveraging the variance of the value function at the next state, which is essential for obtaining minimax-optimal regret bounds in various RL settings. These settings range from tabular Markov Decision Processes (MDPs), as shown by [2], to more complex scenarios like linear mixture MDPs, as demonstrated by [3], and linear MDPs, as discussed by [4]. Given that the only previous work [1] to achieve the optimal regret bound also relies on this assumption, we believe it does not lessen the importance of our contribution. Additionally, while GOLF [9] only requires the standard completeness assumption, both in [1] and our work, a series of optimistic value functions are computed for a tractable planning phase. This requires including the optimistic/pessimistic value functions in the function classes, making the normal completeness assumption on insufficient from an algorithmic perspective.
As for the bonus oracle, when the function class is convex, [6, 7] has shown that using a binary-search based algorithm such sup over and in the bonus oracle can be efficiently and accurately evaluated. Empirically, [8] approximated uncertainty by computing the standard deviation of an ensemble of networks and showed experimental results supporting the effectiveness of using such an uncertainty weighting technique in offline RL.
Q2 The paper [5] also studies low switching reinforcement learning with general function approximation. It seems that the setting in [5] is slightly more general (please correct me if there is misunderstanding) and their algorithm is cleaner.
A2 Both their setting and ours encompass MDPs with bounded eluder dimensions as special cases. Upon closer examination, [5] proves to be slightly more general because, in our Definition 2.4, the dimension is equivalent to (3.2) in [5] after taking the supremum of each term. Compared to [5], our approach is oracle efficient, assuming an oracle is available to compute the bonus for UCB-based exploration. As discussed in Section 1, [5] addresses low-switching RL with general function approximation, achieving a switching cost of the same order. However, the algorithm in [5] features an intractable planning phase and is not statistically optimal. Additionally, checking the policy-switching in [5] requires recomputing the cumulative loss functions at every episode, which becomes inefficient when is large.
Q3 The algorithm itself is very complicated and whether it can be implemented in practice is unclear. Since low switching RL is a very practical problem, some experiments (even under the simplified linear MDP) to show the performance of the algorithm will be helpful. I am curious about how this algorithm performs compared to the standard approach: LSVI-UCB with a doubling trick under linear MDP.
A3 Since our algorithm provides a general framework for solving MDPs with general function classes, the computational complexity of our method highly relies on the efficiency of the regression subroutine and bonus oracle implementation. Actually, our proposed algorithm reduces to LSVI-UCB++ in the linear MDP case, which is a computationally efficient algorithm that shares the same computational complexity as LSVI-UCB with a doubling trick. Additionally, we want to clarify that it is a desirable property for an RL algorithm designed for general function classes to resemble a more efficient algorithm designed for simpler settings.
[1] A Agarwal, Y Jin, T Zhang. VOQL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation. The Thirty Sixth Annual Conference on Learning Theory, 2023
[2] MG Azar, I Osband, R Munos. Minimax regret bounds for reinforcement learning. International conference on machine learning, 2017
[3] D Zhou, Q Gu, C Szepesvari. Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. Conference on Learning Theory, 2021
[4] J He, H Zhao, D Zhou, Q Gu. Nearly minimax optimal reinforcement learning for linear markov decision processes. International Conference on Machine Learning, 2023
[5] Xiong, N., Yang, Z., and Wang, Z. A general framework for sequential decision-making under adaptivity constraints.
[6] D Kong, R Salakhutdinov, R Wang, LF Yang. Online sub-sampling for reinforcement learning with general function approximation.- arXiv preprint arXiv:2106.07203, 2021
[7] Q Di, H Zhao, J He, Q Gu. Pessimistic nonlinear least-squares value iteration for offline reinforcement learning. ICLR 2024
[8] C Ye, R Yang, Q Gu, T Zhang. Corruption-robust offline reinforcement learning with general function approximation. Advances in Neural Information Processing Systems, 2024
[9] C Jin, Q Liu, S Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 2021
Thanks for your detailed response. I will keep my score.
Thank you for your reply. If you have any additional questions or concerns, please let us know. We are happy to address them. Otherwise, we would greatly appreciate it if you could consider adjusting your rating.
This paper studies the reinforcement learning under general function approximation. This paper attempts to find an algorithm which achieves optimal regret and maintains low switching cost in the mean time.
In the algorithm, first they calculated the empirical value function using the weighted ERM, where the weights are chosen according to the variance of the estimator. Next, they utilize optimism and pessimism over the Q-function. The algorithm only collects new samples if the existing dataset does not have enough coverage.
This paper brings in the technique of constructing monotonic value function (originally from He et al. (2022)) into the cases of general function approximation.
This paper achieves the optimal rate (matching the lower bound) in both the regret and also the number of switches.
优点
This paper studies MDP with general function approximation, which is more general than the common linear MDP and tabular MDP setting.
Assumptions made in this paper include the completeness and finite Eluder dimension, which are common in related literatures as well.
From the regret and switching cost perspective, this paper matches the optimal rate for both of the terms. This paper seems to be the first paper which achives this goal.
Compared to Agarwal el al. (2022), the algorithm and analysis are similar to theirs, but this paper has the following advantages: (1) The policies output by this paper is Markovian, while Agarwal el al. (2022) has non-Markov policies. (2) The switching cost is better than Agarwal el al. (2022) by a factor of d_elu.
缺点
The assumptions made on completeness and eluder dimension are somehow stronger than the usual assumptions in related literatures. Specifically, the completeness assumption requires closeness after executing Bellman operator for any value function, and the Eluder dimension requires additional parameter . This completeness assumption is difficult to check unless under strong assumption over dynamics, e.g. linear structure or linear mixture MDPs, etc.
Compared to the related work Agarwal el al. (2022), the techniques does not have significant improvement. The algorithm idea is very similar to the algorithm in Agarwal el al. (2022). Specifically, the technique of calculating the variance and monotonicity of value functions are both used in Agarwal el al. (2022).
问题
I have the following questions:
-
Regarding the completeness assumption, I am curious is there a lower bound showing that the normal completeness assumption is not enough?
-
In the algorithm, calculating requires solving an optimization problem over , which is not computational tractable due to the non-convexity of the function. Is there any way to bypass this?
局限性
Yes. The authors addressed all the limitations listed in the guidelines.
Response to Reviewer Zbvr
Thank you for your positive feedback! We address your questions point-by-point.
Q1 The assumptions made on completeness and eluder dimension are somehow stronger than the usual assumptions in related literatures. Is there a lower bound showing that the normal completeness assumption is not enough?
A1 To our knowledge, there is no lower bound proving that the normal completeness assumption is insufficient for achieving a minimax optimal regret bound. However, all existing works that achieve minimax optimal regret for linear MDPs or MDPs with general function approximation require estimating the variance of the next-state expected return. This necessitates a slightly stronger completeness assumption of the second-order moment. Additionally, while GOLF [1] only requires the standard completeness assumption, both in [2] and our work, a series of optimistic value functions are computed for a tractable planning phase. This requires including the optimistic/pessimistic value functions in the function classes, making the normal completeness assumption on insufficient from an algorithmic perspective.
Q2 In the algorithm, calculating requires solving an optimization problem over , which is not computational tractable due to the non-convexity of the function. Is there any way to bypass this?
A2 Thanks for your insightful question! Here we do not require the functions are convex. However, when the function class is convex, [2, 3] have showed that using a binary-search based algorithm such sup over and in the bonus oracle can be efficiently and accurately evaluated. Empirically, [4] approximated the uncertainty by computing the standard deviation of an ensemble of networks and showed experimental results supporting the effectiveness of using such an uncertainty weighting technique in offline RL.
[1] C Jin, Q Liu, S Miryoosefi. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 2021
[2] D Kong, R Salakhutdinov, R Wang, LF Yang. Online sub-sampling for reinforcement learning with general function approximation.- arXiv preprint arXiv:2106.07203, 2021
[3] Q Di, H Zhao, J He, Q Gu. Pessimistic nonlinear least-squares value iteration for offline reinforcement learning. ICLR 2024
[4] C Ye, R Yang, Q Gu, T Zhang. Corruption-robust offline reinforcement learning with general function approximation. Advances in Neural Information Processing Systems, 2024
Thank you very much for your response. I do not have further questions.
Thank you for your support!
The paper studies reinforcement learning with general functional approximation and proposes a near-optimal algorithm with low policy switching times.
优点
The paper is very well-written, and the main results of the paper are of high quality. The authors improved prior works on RL with general functional approximations to get an algorithm that achieves both the near-optimal regret and the lowest possible switching cost when the number of episodes is large. Besides, the proposed algorithm is intuitive and clean, and the proofs are well-written and easy to follow.
缺点
It would be helpful to provide a bit more details for readers (like me) who are not very familiar with the literature on RL with policy switching cost on how to obtain the counting of switches in Table 1. For example, [1] doesn't seem to optimize the number of policy switches (while only trying to optimize the regret); since the paper is directly improving upon [1], it would be helpful to provide a more detailed discussion on why [1]'s algorithm needs number of switches and why the authors' algorithm can strictly improve upon the number of switches to without worsening the regret.
[1] Agarwal, Alekh, Yujia Jin, and Tong Zhang. "VOQL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation." The Thirty-Sixth Annual Conference on Learning Theory. PMLR, 2023.
问题
You have proposed an algorithm that achieves optimality of regret of switching costs simultaneously under the condition that is large. Also, the lower bound from Theorem B.1 suggests is unavoidable when one wants to achieve a regret sublinear in . Obviously, the current results will not make sense if the number of episodes is relatively small compared to the length of episode . I am thus wondering: do you have any insights (what can be done and what cannot be done) in this different regime?
局限性
There is no obvious limitation in this paper.
Response to Reviewer 6wYy
Thank you so much for your strong support! We address your questions as follows.
Q1 It would be helpful to provide a bit more details for readers who are not very familiar with the literature on RL with policy switching cost on how to obtain the counting of switches in Table 1. For example, [1] doesn't seem to optimize the number of policy switches (while only trying to optimize the regret).
A1 Thanks for your suggestion! In [1], to compute the bonus function, the authors slightly generalize the method proposed by [2] (see Algorithm 2, [1]). According to Theorem 1 in [2], the utilization of this online subsampling subroutine will automatically lead to an switching cost guarantee. We will add the explanation for the result shown in the table in the revision. Specifically, [2] employs online subsampling techniques, which maintain a small 'core' set of the history data and the policy is switched only if the subset is updated. In contrast, our algorithm applies a novel uncertainty-based policy switching strategy, which directly controls the cumulative sensitivity of historical datapoints.
Q2 You have proposed an algorithm that achieves optimality of regret of switching costs simultaneously under the condition that is large. Also, the lower bound from Theorem B.1 suggests is unavoidable when one wants to achieve a regret sublinear in . Obviously, the current results will not make sense if the number of episodes is relatively small compared to the length of the episode. I am thus wondering: do you have any insights (what can be done and what cannot be done) in this different regime?
A2 Thanks so much for your insightful question!
-
For the regret bound, when restricted to linear MDPs, our regret bound still suffers from the -independent term, which will be the leading term when . If we aim to achieve better results in the small regime, it is crucial to improve the term in the current regret upper bound. Essentially, this term results from the inaccuracy of the variance estimators. It is still an open problem if we can achieve optimal regret for linear MDPs even when is relatively small. To tackle this issue, we may consider (1) more accurate estimation of the variance term (e.g. [3] for linear mixture MDPs) or (2) variance-aware confidence set which does not require additional knowledge of the value of variance [4, 5, 6].
-
For the switching cost, our lower bound also holds when is small. For example, if , our lower bound can be interpret as 'to achieve sublinear regret, the switching cost should be '. Since such a high switching cost is not achievable, our lower bound result implicitly implies that sublinear worst-case regret is not possible when .
[1] A Agarwal, Y Jin, and T Zhang. "VOQL: Towards Optimal Regret in Model-free RL with Nonlinear Function Approximation." The Thirty-Sixth Annual Conference on Learning Theory. PMLR, 2023.
[2] D Kong, R Salakhutdinov, R Wang, L Yang. "Online sub-sampling for reinforcement learning with general function approximation"
[3] D Zhou, Q Gu. "Computationally efficient horizon-free reinforcement learning for linear mixture mdps." Advances in neural information processing systems, 2022
[4] Z Zhang, J Yang, X Ji, SS Du. "Improved variance-aware confidence sets for linear bandits and linear mixture mdp." Advances in Neural Information Processing Systems, 2021
[5] H Zhao, J He, D Zhou, T Zhang, Q Gu. "Variance-dependent regret bounds for linear bandits and reinforcement learning: Adaptivity and computational efficiency." The Thirty Sixth Annual Conference on Learning Theory, 2023
[6] Z Zhang, JD Lee, Y Chen, SS Du. "Horizon-Free Regret for Linear Markov Decision Processes." ICLR 2024
Thank you for the detailed explanations.
I recommend acceptance for this paper. There is a general consensus among the reviewers that the contribution is technically sound and it advances the theoretical state-of-the-art in online reinforcement learning. The authors clarified in the rebuttal the assumptions made and their relationship with prior work. Similarly, the authors addressed the concerns related to the "practicality" of the algorithm. While theoretical algorithms need to rely on "black-box" optimization routines, the overall structure of the algorithm seems rather simple and its different elements are well justified in terms of their theoretical impact.
As the authors work on a camera ready version of the paper, I strongly encourage them to borrow part of the discussion into the paper, notably
- Elaborate more on the assumptions, their connection with existing literature, and which parts of the theory they enable.
- Expand the comparison with related works, in particular similarity and novelty of the algorithmic and technical components.