Horizon-free Reinforcement Learning in Adversarial Linear Mixture MDPs
摘要
评审与讨论
This paper studies horizon-free RL in adversarial linear mixture MDPs with full-information feedback. This paper proposes an algorithm that employs a variance-aware weighted least square for the transition kernel and an occupancy measure-based method for the online search of a stochastic policy. The authors show the algorithm achieves a regret with polylogarithmic dependence on . Further, this paper provides a lower bound showing the inevitable polylogarithmic dependence on state number .
优点
- The paper is the first work that studies near-optimal horizon-free RL algorithms under adversarial reward and linear function approximation. This progress deserves to be known to the community.
- The connection between the value function derived from occupancy measure guided policy updating and the other one derived from backward iteration (Lemma 6.1) is new as far as I know, which may inspire other studies for RL problems.
- The paper is clearly written and well-organized. The proofs are technical sound though I don't check the proofs.
缺点
- The novelty of this paper may be limited. Most of the analysis follows from that of horizon-free reinforcement learning for linear mixture MDPs with stochastic rewards (Zhou and Gu, 2022).
- The occupancy measure-based algorithm is not computationally efficient as the running time has polynomial dependence on the state number and action number .
问题
N/A
Thank you for your supportive feedback. We will answer your question as follows:
Q1: The novelty of this paper may be limited. Most of the analysis follows from that of horizon-free reinforcement learning for linear mixture MDPs with stochastic rewards [1].
A1: We agree that the our analysis framework follows [1]. However, compared to [1] and other very related work [2][3], our paper actually introduces several new techniques. First, compared to [2][3], we use occupancy-measure, rather than policy-optimization, to update the policy, making our work the first to apply occupancy-measure to this setting. Second, compared to [1], we adopted random policy, which inevitably introduce a new policy noise term with . This term is bounded by by using Azuma-Hoeffding's inequality and therefore introduces an unwanted dependency on . To tackle this issue, we introduce a notion of conditional variance of into the analysis of Algorithm 3 (HOME) to bound the policy noise in a recursive way and settle the poly dependency down. This is also a new technique that does not appear in previous analysis [1]. Third, we also proposed a novel lower bound showing that horizon-free regret can only be achieved under finite state assumption.
Q2: The occupancy measure-based algorithm is not computationally efficient as the running time has polynomial dependence on the state number S and action number A.
A2: We agree with the reviewer as we have discussed in Remark E.2. The necessity of computing all the entries of occupancy measure is the limitation of our algorithm. However, as we have shown in Section 6, using occupancy measure is the key technique to get rid of the dependency in online mirror descent regret. We will explore more efficient algorithms in the future work.
References:
[1] Zhou, D., & Gu, Q. (2022). Computationally efficient horizon-free reinforcement learning for linear mixture mdps. arXiv preprint arXiv:2205.11507.
[2] He, J., Zhou, D., & Gu, Q. (2022, May). Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps. In International Conference on Artificial Intelligence and Statistics (pp. 4259-4280). PMLR.
[3] Cai, Q., Yang, Z., Jin, C., & Wang, Z. (2020, November). Provably efficient exploration in policy optimization. In International Conference on Machine Learning (pp. 1283-1294). PMLR.
This paper introduces the first algorithm enjoying horizon free bounds for the adversarial linear mixture MDP. The algorithm is based on a careful combination of policy updates step performed with mirror descent steps in the occupancy measures space and an optimistic policy evaluation phase carried out using weighted ridge regression estimators.
An interesting finding is also a separation between the adversarial and non adversarial case. Indeed, the authors managed to prove an asymptothic lower bounds which shows that either or must be paid in the regret bound while a independent horizon free regret upper bound can be obtained in the non adversarial case.
优点
I think that the algorithm is original and well explained in the main text.
The result which entails a dependence would not be very satisfactory in the function approximation setting but the author nicely shows that either this dependence or a dependence must be suffered.
I also enjoyed the lower bound construction which considers a tabular deterministic MDP and reduces it to an expert problem.
The proofs look correct to me.
缺点
There are few clarity omission or missing definition in the submission. Hereafter, I list few of them:
-
I think it should be clearer that also homogenous transition dynamics are required for obtaining reward free bounds. Therefore, the Bellman optimality equation at page 3 should not have in the footnote of the operator .
-
the value function is never formally defined in the paper. So it is difficult to understand what it denotes when reading the regret decomposition in equation (6.1). If I understood correctly from the Appendix, each mirror descent iterate induces via the marginals a transition kernel and a policy . At this point denotes the initial value of policy in the MDP endowed with reward function and transition dynamics . Can the authors confirm that this is correct and if yes add it to their revision ?
-
The definition of Regret at page 3 is a bit unclear. Indeed saying that is the optimal state value function could make the reader believe that , that is the regret we control has the maximum inside the sum. However, the regret controlled in the paper has a fixed comparator policy which might not be optimal for any of the reward function revealed at each round.
问题
I think that it is unclear that defined in Appendix C.2 is decreasing. After inspecting the proofs I think that what the authors need is that for any fixed than is decreasing with respect to . Is this correct ?
Thank you for your positive and constructive feedback. We answer your questions as follows:
Q1:I think it should be clearer that also homogenous transition dynamics are required for obtaining reward free bounds. Therefore, the Bellman optimality equation at page 3 should not have ℎ in the footnote of the operator P.
A1: Thank you for your suggestion and we have fixed it in the reversion.
Q2: The value function is never formally defined in the paper.
A2: We are sorry for the missing formal definition. We have added an equation to formally define in the revision.
Q3: The definition of Regret at page 3 is a bit unclear.
A3: Thank you for your suggestion and we have revised it to make it clearer.
Q4: I think that it is unclear that defined in Appendix C.2 is decreasing. After inspecting the proofs I think that what the authors need is that for any fixed k is decreasing with respect to . Is this correct ?
A4: Yes, we indeed need for any fixed k then is decreasing with respect to . We apologize for the confusion and have made it clear in the revision.
Dear Authors,
Thanks for correcting the issues I mentioned in my review.
I am keeping my positive score and will support the acceptance of your manuscript in the discussion phase.
Have a nice weekend,
Best, Reviewer
We're glad that our response has effectively addressed your concerns and suggestions.
This paper addresses the question of whether the favorable polylogarithmic regret seen in reinforcement learning (RL) with respect to the planning horizon can also be extended to adversarial RL scenarios. The authors introduce the first horizon-independent policy search algorithm, designed to cope with challenges arising from exploration and adversarial reward selection over episodes. The algorithm utilizes a variance-uncertainty-aware weighted least square estimator for the transition kernel and an occupancy measure-based approach for online stochastic policy search.
优点
Given my limited expertise in the adversarial RL domain, my evaluation focuses exclusively on the technical soundness and clarity of the paper. The manuscript exhibits a commendable standard of articulation. The framing of the problem, underlying assumptions, and derived outcomes are adequately elucidated. Notably, the inclusion of a proof sketch in Section 6 enhances the paper's comprehensibility, serving as a valuable reference point for those seeking deeper insight into the paper's theoretical foundations.
缺点
The paper makes relatively strong assumptions: linear, finite-state MDPs and full-information feedback. The only novel aspect here is the paper tackles adversarial reward functions rather than fixed or stochastic rewards. But even so, I think that the full-information feedback assumptions greatly alleviate the difficulty of adversarial rewards. To me, the hardness result is more interesting: an unbounded state space will incur regret in . Is this result novel in the literature?
I am a bit confused about the assumptions about the reward. Firstly, can the rewards be negative? If so, would assumption 3.1 still make sense? Furthermore, is assumption 3.1 equivalent to the bounded rewards assumption, i.e., if rewards are bounded in , we can always scale everything by to satisfy assumption 3.1.
问题
Please see the questions in the weaknesses section above. I am happy to increase my score if there are any misunderstandings.
Thank you for your supportive feedback. We will answer your questions as follows.
Q1: The paper makes relatively strong assumptions: linear, finite-state MDPs and full-information feedback. The only novel aspect here is the paper tackles adversarial reward functions rather than fixed or stochastic rewards.
A1: We agree with the reviewer that the assumptions we made (linear, full-information) are relatively strong. However, even under these relatively strong but standard assumptions [1][2][3], how to achieve horizon-free regret is still an open problem before our work. Notably, our result is the first horizon-free algorithm in this setting. We agree that finite state space is an extra assumption compared to previous works. However, this is a necessary assumption to achieve a horizon-free regret with adversarial reward, as we have shown in the hardness result in Theorem 5.3.
Q2: To me, the hardness result is more interesting: an unbounded state space will incur regret in . Is this result novel in the literature?
A2: Thank you for your interest! To the best of our knowledge, this result is new and we are the first to show that an infinitely large state space is incompatible with horizon-free regret under adversarial rewards. Therefore, we believe this result is novel and a noteworthy contribution.
Q3: Can the rewards be negative?
A3: Yes, technically, the reward can be negtive. To accommodate this, we can scale up the reward by a factor of 2 and subsequently translate it to the interval by subtracting a constant of .
Q4: Furthermore, is assumption 3.1 equivalent to the bounded rewards assumption, i.e., if rewards are bounded in [−R,R], we can always scale everything by 1/RH to satisfy assumption 3.1.
A4: Yes, scaling the bounded rewards in [-R, R] by a constant of can make it satisfy our assumption. We would like to bring to the reviewer's attention that the dependency on in the regret bound of RL algorithms has two sources: 1) the length of planning horizon, and 2) scale of the rewards. Here to offset the dependency on contributed by the scale of the reward, we scale down the reward by a factor so that the total reward in each episode is bounded by one, as suggested by [4].
References:
[1] Zhou, D., & Gu, Q. (2022). Computationally efficient horizon-free reinforcement learning for linear mixture mdps. arXiv preprint arXiv:2205.11507.
[2] Cai, Q., Yang, Z., Jin, C., & Wang, Z. (2020, November). Provably efficient exploration in policy optimization. In International Conference on Machine Learning (pp. 1283-1294). PMLR.
[3] He, J., Zhou, D., & Gu, Q. (2022, May). Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps. In International Conference on Artificial Intelligence and Statistics (pp. 4259-4280). PMLR.
[4] Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pp. 3395–3398. PMLR, 2018.
Dear authors,
Thank you for your response. I would like to keep my original score and recommend acceptance of the paper.
Warmest regards,
Reviewer.
Thank you for your support! We're glad that our response has addressed all your questions.
This paper studies the online learning problem of horizon-free and linear mixture Markov Decision Processes (MDPs). To the best of my knowledge, this is the first paper that can achieve theoretical guarantees with adversarial losses, that is, the loss function can change arbitrarily from episode to episode. To achieve this result, the authors propose two main techniques: (1) a variance-uncertainty-aware weighted least square estimator and (2) an occupancy measure-based approach for constructing policies. The first technique is widely use for linear mixture MDPs, while the second one is mainly used for adversarial losses. Combining these two techniques to establish valid regret guarantees is quite challenging. More importantly, the final regret bound is of the order , which is nearly the optimal.
优点
- The idea of combining the two techniques is very interesting. It would be great to see such combination to be applied in other (more general) linear MDP settings.
- Though I just skimmed the proof of several lemmas, the results seems to be rigorous proved and mathematically correct.
缺点
This paper does not have any specific weaknesses.
问题
1.Is it possible to design policy optimization algorithms for this problem setting? 2.Is it possible to avoid the usage of occupancy measure (which is not quite efficient in real world).
Thank you for your positive feedback. We answer your questions as follows:
Q1: Is it possible to design policy optimization algorithms for this problem setting?
A1: We would like to clarify that policy optimization has previously been applied to adversarial linear mixture MDP [1][2]. However, none of these approaches can guarantee a horizon-free regret upper bound. The reason for this limitation has been shown in Section 6. Specifically, the regret for online mirror descent is , where is the upper bound of the subgradient. When employing (multiplicative weights update/EXP3 type) policy optimization, due to reward accumulating, the scale of the Q-functions and the correspoding subgradient is , which will inevitably introduce a dependency on . Therefore, we resort to occupancy measures to tackle this challenge.
Q2: Is it possible to avoid the usage of occupancy measure (which is not quite efficient in real world).
A2: To our knowledge, to tackle adversarial reward, existing approaches are based on either policy optimization or occupancy-measure. As we found that policy optimization is difficult to deduce a horizon-free guarantee, we show that a feasible way is to take occupancy measure-based approach. We will explore other approaches rather than occupancy measure in the future work.
References:
[1] Cai, Q., Yang, Z., Jin, C., & Wang, Z. (2020, November). Provably efficient exploration in policy optimization. In International Conference on Machine Learning (pp. 1283-1294). PMLR.
[2] He, J., Zhou, D., & Gu, Q. (2022, May). Near-optimal policy optimization algorithms for learning adversarial linear mixture mdps. In International Conference on Artificial Intelligence and Statistics (pp. 4259-4280). PMLR.
This paper studies low-regret algorithm for Linear Mixture MDPs under adversarial losses. It proposed the first algorithm that achieves horizon-free regret on the order of , which is near optimal and a solid theoretical contribution. This paper also presents a lower-bound construction, showing that an infinitely large state space is incompatible with horizon-free regret under adversarial rewards, an interesting trade-off not previously known for the stochastic reward setting.
Overall, this is a very solid theoretical RL paper.
为何不给更高分
Given the strong theoretical focus and the very specific setting that is being studied here, it might not be suitable for oral/spotlight presentation.
为何不给更低分
NA
Accept (poster)