Robust Policy Optimization with Evolutionary Techniques
An algorithm to adapt deep RL models to more complex versions of the environment they were trained on; this algorithm is theoretically and experimentally shown to converge to an optimal policy.
摘要
评审与讨论
This paper proposed "Evolutionary Robust Policy Optimization (ERPO)", which aims to adapt policies to an altered environment using fewer training steps while getting higher rewards and requiring lower overall computation time.
优点
see above
缺点
This paper seems to be an incomplete work, fails to illustrate its motivation and main technical contribution, and misses a lot of important baselines (e.g., from meta-learning literature) in its experiments. Hence, I think it is difficult to evaluate the novelty, effectiveness, and importance based on its current version that needs to be improved significantly.
some comments:
-
I wonder why evolutionary game theory is a proper solution to robust policy optimization.
-
The title is too broad to reflect the main contribution of this work.
-
Please add more baselines and compare ERPO with them in extensive environments in order to evaluate ERPO's soundness.
-
incorrect citation format.
问题
see above
I wonder why evolutionary game theory is a proper solution to robust policy optimization.
Game-theoretic solutions have long been proposed in the learning domain, especially in the field of multi-agent learning ([1], [2], [3], [10]). Replicator dynamics, a key feature of evolutionary game theory which also features in our paper, has been used to analyze (reinforcement) learning ([4], [5], [7], [8], [9]). Additionally, we note that they have been used in collision avoidance cases and adapting to environments [6]. The novelty of our work is to use this approach for policy adaptation in large distribution shifts, an area in which we identified gaps in the current literature. We believe that the theoretical guarantee of convergence to an optimal policy and comparison to seven baselines across five environments solidifies the case for evolutionary game theory to be considered as a solution to robust policy optimization.
This paper seems to be an incomplete work, fails to illustrate its motivation and main technical contribution
Motivation: As specified in Section 1, the main motivation is the lack of current research in effectively adapting currently optimal policies to perturbed environments. Current Deep-RL and planning methods over-fit to the environment they are trained on and are not able to effectively adapt to large perturbations. Even domain-randomization and other robust learning methods only work for smaller environmental change, and cannot make much use of the learned policy once it is shown to be ineffective in a changed version of the environment
Contributions: The main technical contribution as specified in the Contributions part in Section 1, is presenting an algorithm that we show theoretically and experimentally to converge to an optimal policy. We also show that our algorithm experimentally outperforms many state-of-the-art algorithms by orders of magnitude in terms of clock-time, number of time-steps, and average reward.
...and misses a lot of important baselines (e.g., from meta-learning literature) in its experiments. Hence, I think it is difficult to evaluate the novelty, effectiveness, and importance based on its current version that needs to be improved significantly. Please add more baselines and compare ERPO with them in extensive environments in order to evaluate ERPO's soundness.
We note that we compare with the following seven baselines:
Algorithms trained from scratch over the new environment: (1) PPO, (2) A2C, (3) DQN;
Algorithms trained over the base model of the old environment: (4) PPO-B, (5) A2C-B, (6) DQN-B;
Domain Randomization algorithms: (7) PPO-DR;
We demonstrate this across five environments, representing nearly all the available gym environments that use discrete state-action spaces. As per our review, we have not encountered forms of meta-learning suitable for such applications. We are open to learn which specific baselines you would like us to additionally consider.
incorrect citation format
Thank you for pointing this out. We will make the change in the new draft.
References:
[1] Lanctot, Marc, et al. ``A unified game-theoretic approach to multiagent reinforcement learning." Advances in neural information processing systems 30 (2017).
[2] Bloembergen, Daan, et al. ``Evolutionary dynamics of multi-agent learning: A survey." Journal of Artificial Intelligence Research 53 (2015): 659-697.
[3] Jan’t Hoen, Pieter, and Karl Tuyls.``Analyzing multi-agent reinforcement learning using evolutionary dynamics." European Conference on Machine Learning. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004.
[4] Börgers, Tilman, and Rajiv Sarin. ``Learning through reinforcement and replicator dynamics." Journal of economic theory 77.1 (1997): 1-14.
[5] Galstyan, A. (2013). ``Continuous strategy replicator dynamics for multi-agent q-learning." Autonomous agents and multi-agent systems, 26, 37-53.
[6] Hennes, Daniel, Daniel Claes, and Karl Tuyls. ``Evolutionary advantage of reciprocity in collision avoidance."Proc. of the AAMAS 2013 Workshop on Autonomous Robots and Multirobot Systems (ARMS 2013). 2013.
[7] Panozzo, Fabio, Nicola Gatti, and Marcello Restelli. ``Evolutionary dynamics of Q-learning over the sequence form." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 28. No. 1. 2014.
[8]Ruijgrok, M., and Th W. Ruijgrok. ``Replicator dynamics with mutations for games with a continuous strategy space." arXiv preprint nlin/0505032 (2005).
[9]Tuyls, Karl, et al. ``Extended replicator dynamics as a key to reinforcement learning in multi-agent systems." Machine Learning: ECML 2003: 14th European Conference on Machine Learning, Cavtat-Dubrovnik, Croatia, September 22-26, 2003. Proceedings 14. Springer Berlin Heidelberg, 2003.
[10] Tuyls, Karl, and Simon Parsons. ``What evolutionary game theory tells us about multiagent learning." Artificial Intelligence 171.7 (2007): 406-416.
The manuscript presents a novel method for transfer learning that combines ideas from evolutionary game theory and reinforcement learning. The method operates only with discrete state-action spaces. Overall, the results showcase that the proposed method, ERPO, consistently outperforms several baselines.
优点
- The paper is generally well-written, easy to follow and the ideas are conveyed effectively.
- Strong empirical results on several tasks
- Theoretical convergence guarantees
缺点
- No limitations or weaknesses are described in the text
- No hints on how the method can be extended to continuous state-action spaces are given
- The authors mention "Simulation models are generally simplistic and fail to consider environmental variables ... so they cannot be directly deployed in such applications". I have the following issues with the statement:
- Simulation models are not generally simplistic. Even the less realistic simulators contain quite sophisticated procedures and models behind. Changing "simplistic" to "non-realistic" or just referring to the well-known Sim2Real gap is enough.
- Most real-world applications are continuous in nature and are quite difficult to discretize or solve with discrete state-action spaces. The manuscript proposes a method that can only operate with state-action spaces. Thus, this sentence is not the best "motivation" for the proposed method.
- The above comment leads to my main "complaint" from the presented work: since most realistic robotic/autonomous systems applications are continuous in nature, how is the proposed method solving the issue it claims to solve? All the experiments as well have nothing to do with robotic applications.
Typos/Minor comments
- Page 2, first paragraph: "approaches Rajeswaran et al. (2016)train" -> there is a space missing
- Page 3, Section 3, first paragraph: "theory (EGT) Smith (1982),Sandholm (2009)" -> there is a space missing
- Page 3, last sentence: "As we make state-wise updates, we modify the replicator equation to be We modify this replicator equation as follows:"
问题
- What are the main limitations of ERPO?
- How can we extend ERPO to the continuous case?
- How can ERPO be useful in realistic applications of autonomous systems/robotics?
We thank the reviewer for their comments and address them as follows:
Weaknesses:
No limitations or weaknesses are described in the text
No hints on how the method can be extended to continuous state-action spaces are given
Questions:
What are the main limitations of ERPO?
How can we extend ERPO to the continuous case?
We will add a limitations and future work section in the updated draft which we also describe here:
Our set up is limited to discrete state-action spaces. We are working on an extension that works with continuous spaces. This will be carried out with function approximation using radial basis functions that also update the policies of states within a certain distance of the state we are updating. Additionally, because we normalize the probability distribution across actions of a given state, a continuous model would work instead along with a probability density function that can be updated using a Dirac delta function. Our set up is also limited to single agent models (unless extended with independent learning). We are working on extensions that can combine other game-theoretic solution concepts for cooperative multi-agent learning. We also note that ERPO works hand-in-hand with other methods of learning.
The authors mention "Simulation models are generally simplistic and fail to consider environmental variables ... so they cannot be directly deployed in such applications". I have the following issues with the statement:
Simulation models are not generally simplistic. Even the less realistic simulators contain quite sophisticated procedures and models behind. Changing "simplistic" to "non-realistic" or just referring to the well-known Sim2Real gap is enough.
We agree with the reviewer and this will be corrected in the latest draft.
Most real-world applications are continuous in nature and are quite difficult to discretize or solve with discrete state-action spaces. The manuscript proposes a method that can only operate with state-action spaces. Thus, this sentence is not the best "motivation" for the proposed method.
The above comment leads to my main "complaint" from the presented work: since most realistic robotic/autonomous systems applications are continuous in nature, how is the proposed method solving the issue it claims to solve? All the experiments as well have nothing to do with robotic applications.
Questions:
How can ERPO be useful in realistic applications of autonomous systems/robotics?
This is indeed a valid concern, and we are hoping that the results established in this paper for discrete state/action spaces can extend to realistic environments. However, even within the space of discrete state/action environments, there are some environments that are used in realistic applications. A prominent example is that of the multi-robot warehouse environment that is commonly used as a benchmark for multi-agent path finding (and used in actual package fulfillment centers). We have experiments carried out on single (and multiple) agent versions of this environment, but did not showcase it in the paper as it is designed for multi-agent tasks. We will include some of our results to provide an example of a realistic environment. For complex environments, with large state-action spaces, deep learning/function approximation approaches are suitable tools to supplement ERPO that we will mention in the paper. We will rephrase our introduction and motivation to include generalized reach-avoid/path-planning problems that might undergo distribution shifts.
Typos/Minor comments
> Page 2, first paragraph: "approaches Rajeswaran et al. (2016)train" -> there is a space missing
> Page 3, Section 3, first paragraph: "theory (EGT) Smith (1982),Sandholm (2009)" -> there is a space missing
> Page 3, last sentence: "As we make state-wise updates, we modify the replicator equation to be We modify this replicator equation as follows:"
Thank you for pointing out the typos, we will correct them.
Thank you for the comments/replies. I have no further comments.
The paper addresses the problem of adapting reinforcement learning (RL) policies to significant changes in the environment dynamics in robust RL. Many existing methods, like domain randomization and robust policy optimization, fail when test environments differ substantially from training. The authors propose an evolutionary robust policy optimization (ERPO) approach to adapt policies without full retraining. Assuming access to the optimal blackbox policy on the original environment, ERPO explores the new environment using an -soft version of this policy. It incrementally improves the policy by weighting state-action pairs from fitter trajectories more highly, inspired by evolutionary game theory. Experiments show superior results against methods only trained on the old environment (referred to as base models) or only trained on the new environment. They also compare their algorithm to domain randomization methods such as PPO-DR.
优点
The paper's setup tackles an important real-world challenge - adapting black-box RL policies without full retraining. The simplicity and intuitiveness of ERPO are also strengths. Updating actions based on relative expected rewards is an interesting idea. This evolutionary approach avoids needing gradients for the new environment. However, there are some major concerns about the author's implementation of this idea, the theory, and the writing quality and experiments. Specific issues are detailed in the next section recommending rejection.
缺点
The main concern is about the soundness of the algorithm and theoretical claims. Even if we accept the sparse reward assumption, the conclusion of the authors that "the value of a state can be approximated by the average return across all trajectories containing the state" is an intuitive statement and needs concrete evidence. Even if we accept this argument, the proof of Theorem 1 is still incorrect. In particular, the proof mixes up the behavior policy and the learned policy and assumes they have the same sampling distribution, which is clearly not the case. Moreover, I think the current proof, even by fixing all the previous issues, would not work unless we define where, in the denominator, we have instead of .
Besides the previous concerns, I think the empirical comparison to other methods is unfair. In particular, the proposed method essentially uses the information from the old environment (through the optimal policy) and the data from the new environment. A more fair comparison would initialize the policy of any method that trains on the new environment as the previous policy (e.g., using a cross-entropy loss). The fact that PPO eventually gets to the optimal solution (even without the correct initialization) suggests that initializing its policy with will result in comparable or even better results than ERPO.
In summary, concerns include the following:
- Unjustified approximations in analysis
- Logical gaps in the convergence proof
- Unfair comparative evaluations against methods not exploiting old policy information
问题
Please refer to the previous section.
The main concern is about the soundness of the algorithm and theoretical claims. Even if we accept the sparse reward assumption, the conclusion of the authors that "the value of a state can be approximated by the average return across all trajectories containing the state" is an intuitive statement and needs concrete evidence.
We follow the definition of Sutton et. al. [1]
$
v_{\pi}(s)= \mathop{\mathbb{E}}[G_t|S_t = s]
$
where is the return from time when state occurs. According to our sparse-reward assumption, any reward in the transition , unless is the goal/terminal state. Therefore:
$
G_t = \sum_{k=t}^{T-1} \gamma^k r^{k+1} = \sum_{k=t}^{T-2} \gamma^k r^{k+1} + \gamma^{T-1} r^{T} \approx \gamma^{T-1} r^{T}
$
which indicates that for any .
As we have defined:
$
G (\tau) = \sum_{k=0}^{T-1} \gamma^k r^{k+1}
$
From (1), (2), and (3) we can say that where
The proof of Theorem 1 is still incorrect. In particular, the proof mixes up the behavior policy and the learned policy and assumes they have the same sampling distribution, which is clearly not the case.
We remark that using the behavior policy () for exploration/training while updating the policy we wish to learn is a technique used in many off-policy methods such as DQN (or any algorithm trained with replay buffers). The convergence of such algorithms does not rely on the sampling distributions of the policies being equivalent. Further, we mention that for the iteration acts as a -soft version of the learned policy where it assigns the weight to the old policy and to . Therefore, our algorithm continues to train by decrementing at each iteration until it is sufficiently small (i.e. ), by when is an -soft version of .For a sufficiently small , an -soft version of converges to optimal if converges to optimal.
Moreover, I think the current proof, even by fixing all the previous issues, would not work unless we define
\pi^{i+1} = \pi^i(a|s) \times \ \frac{ \mathop{\mathbb{E}}[f(\tau_{(s,a)}]}{\mathop{\mathbf{E}}[f(\tau_{(s)}]} \
We agree that this was an oversight on our part, an honest confusion while fleshing out the details. Our experiments have indeed been performed using the aforementioned update rule rather than the erroneous one. We will have corrected this error wherever mentioned in our paper.
Besides the previous concerns, I think the empirical comparison to other methods is unfair. In particular, the proposed method essentially uses the information from the old environment (through the optimal policy) and the data from the new environment. A more fair comparison would initialize the policy of any method that trains on the new environment as the previous policy (e.g., using a cross-entropy loss). The fact that PPO eventually gets to the optimal solution (even without the correct initialization) suggests that initializing its policy with $\pi_{old}$ will result in comparable or even better results than ERPO.
Please refer to the Baselines paragraph in Section 4 (Experiments), where we mention that we have in fact trained all our algorithms over the old information (base model with the old optimal policy) as well, and denoted them as PPO-B, A2C-B and DQN-B respectively. From tables (a)-(f) we show that these models do outperform PPO, DQN and A2C in some instances such as in the CliffWalking, Taxi, and DistributionShift environments as you suggested. However, we can observe that while they perform comparatively at times, they do not outperform ERPO in nearly all of the instances (save in Level 2 of FrozenLake).
Perhaps our framing of these experiments and observations was not explicit enough, and we will do better in adequately highlighting these comparisons. We will re-word our description in these experiments to this effect if it is still unclear.