Curriculum Reinforcement Learning via Morphology-Environment Co-Evolution
摘要
评审与讨论
This paper presents a curriculum reinforcement learning approach, MECE, which optimizes an RL agent's morphology and environment through co-evolution. The authors train two policies to automatically modify the morphology and change the environment, creating a curriculum for training the control policy. Experimental results demonstrate that MECE significantly improves generalization capability compared to existing methods and achieves faster learning. The authors emphasize the importance of the interplay between morphology and environment in brain-body co-optimization.
优点
-
The paper is well-structured, with a relatively clear introduction.
-
The paper includes comprehensive experiments on rigid robot co-design tasks, demonstrating the superiority of the proposed algorithm. The ablation studies effectively isolate each component's contribution and provide valuable insights into the algorithm's effectiveness.
缺点
The significance of the paper's contributions is a bit unclear. It is not the first to propose using co-evolution method to co-design brain, body and environment. The proposed methods should be compared with more strong baselines. Curiously, can and how this system extend to the real world?
问题
-
How general is the proposed approach, beyond the tasks and environments considered in the experiments?
-
Is the proposed MECE method computationally efficient?
-
Have you encountered any scalability issues when applying MECE to more complex tasks or environments?
-
It is not clear to me how environments are produced and how the agents perform in your environment (Figure 4), do you have a video?
-
It seems that MECE's performance is not much better than Transform2Act, can you provide more results on different tasks?
Thank you for your comments and suggestions! We address your main concerns as below:
The significance of the paper's contributions is a bit unclear.
- To the best of our knowledge, MECE is the first paper to discuss the co-evoltuion between the agent's morphologies and training environments and use the co-evolution to create an efficient curriculum for training embodied agent for better generalization. Would you mind kindly providing papers that have previously examined this problem?
- Our baseline methods include the SOTA algorithms of training agent to adpat to diverse environments (POET), the SOTA morphology optimization method utilizing RL to evolve the agent's morphology in a fixed environment (Transform2Act), and a widely-studied method that models an agent's morphology as GNN (NGE).
How general is the proposed approach, beyond the tasks and environments considered in the experiments?
- In MECE, the simulation is Mujoco, which is a widlely used simulator in RL.
- We did not design special constraints for environments because the learned automatically avoids generating unlearnable environments. Furthermore, tasks are not accompanied by any form of extrinsic reward signal. Hence, MECE has the potential to be applied to a wide range of tasks and circumstances, extending beyond the confines of experimental settings.
Is the proposed MECE method computationally efficient?
- MECE training is more computationally efficient than all the compared baselines. For all the test environments, MECE took around 44 hours to train on a standard server with 8 CPU cores and an NVIDIA RTX 3090 Ti GPU, while Transform2Act requires around 63 hours on the same server.
Have you encountered any scalability issues when applying MECE to more complex tasks or environments?
- The 3d-locomotion environment is a complex environment for RL navigation tasks. If we do not set any contraints on , in earlier stages, the unmatural environment policy may change the environment at a large scale, which can make the training environments overly hard/simple. Hence, we limit the scale of changes by to environment in each step.
It is not clear to me how environments are produced and how the agents perform in your environment (Figure 4), do you have a video?
- We introdce how generates and controls the environments in Appendix A.
- We will release the source code with videos.
It seems that MECE's performance is not much better than Transform2Act, can you provide more results on different tasks?
- We respectfully disagree with the assertion that this is "not much superior to Transform2Act." In Figure 2, MECE outperforms the modified Transform2Act (trained on diverse environments) by around 20% in terms of rewards and outperforms Transform2Act-Original by approximately 60%.
Dear Reviewer,
We haven't heard from you in the rebuttal phase. Since we are approaching the last day of the reviewer-author discussion, it would be really nice of you to confirm if the concerns were successfully addressed by our reply. We are looking forward to your feedback and we kindly expect that you can raise the score if all your main concerns are resolved.
Thanks!
Best regards,
Authors
Thank you for the response, I have no further questions.
Thank you for reading our responses and providing valuable feedback! If the rebuttal has addressed your concerns, we would really appreciate it if you would consider raising your score.
Your constructive input remains invaluable to us, and we appreciate your dedication to enhancing the quality of our manuscript. Thank you for your time and consideration.
Dear reviewer,
Would you mind validating whether our rebuttal addresses your concerns:
- We explained the generalization and contribution of MECE.
- We explained that MECE is not computationally expensive.
Please feel free to ask any additional queries after reading our rebuttal. During this author-reviewer discussion period, we hope to catch and answer any questions.
Thank you,
The authors
This paper addresses the problem of joint optimization of the policy and the morphology of a learning agent. The authors’ motivation is described in the claim written in the introduction: “a good morphology should improve the agents’ adaptiveness and versatility, i.e., learning faster and making more progress in different environments.” To realize it, the authors propose the novel framework where the morphology and the training environment are jointly evolved. In the proposed MECE scheme, three policies are introduced: one for the control of an agent’s action, one for the evolution of the morphology, one for the evolution of the training environment. Inside this scheme, the authors define reward functions for the training of the morphology policy and for the training of the environment policy. The authors have performed comparison with several baseline approaches on three control tasks and ablation studies have been conducted to confirm the effectiveness of each algorithmic component.
优点
-
A novel framework for morphology optimization aiming at obtaining a morphology under which a control policy can be quickly adapted to an unseen environment.
-
Promising empirical results compared to baseline approaches, but the experimental procedure is questionable, see below.
缺点
-
As far as I understand from what is written in the introduction, the motivation of the morphology optimization in this paper is to obtain a morphology with which the agent can adapt its policy quickly to unseen tasks. It is also written in the second question of the experiments. However, it seems that the reported results in figures are average performances of the agent obtained at each training time step on randomly-selected environment. Therefore, the performance evaluated in this paper is the one for domain randomization. It is different from the motivation. The efficiency of the adaptation of the policy under the obtained morphology is not evaluated. My understanding might be wrong as the evaluation procedure was not clearly stated. Please clarify this point.
-
It could be better if the design choices of the proposed approach is more elaborated. In particular, it is not clear how the reward functions (1) and (2) reflect the author’s hypotheses “a good morphology should improve the agent’s adaptiveness and versatility, i.e., learning faster and making more progress in different environments” and “a good environment should accelerate the evolution of the agent and help it find better morphology sooner”. It is also not clear why the authors want to train policies for morphology evolution and environment evolution instead of just optimizing the probability distributions over these spaces, despite the fact that these policies are not used afterwards and only the obtained morphology is used in the test phase.
-
The clarity of the explanations could be improved. First, the notation inconsistencies makes it confusing. For example, r^m vs r_m, r^E vs r_e, and E and Env. If they are the same, please use the same notation. Algorithm 2 was also not very clear. How could pi_m be updated by using D where transition history doesn’t necessarily have a reward information r_m? The same applies for pi_e.
问题
Please clarify the points given in the weakness section.
Thank you for your comments and suggestions! We will correct the typos following your suggestions. We address your main concerns as below:
The setting of experiments is different from the motivation.
- Our evaluation is consistent with the motivation and the 2nd question at the beginning of the experiments: "Does our method create agents more adaptive and generalizable to diverse new environments?". As stated in the 2nd line of Section 5.2, we evaluated MECE and its learned morphology in 12 diverse and unseen environments (random variants of 2d-locomotion, 3d-locomoation, or Gap-crosser) under 6 random seeds with the morphology fixed. The experimental results show that the morhology evolved by MECE can generalize better to the tasks in diverse unseen environments.
- MECE does not require finetuning for the unseen environments in the test, unlike the Transform2Act and NGE methods, which still require this additional step. Furthermore, we present the performance of baselines without finetunes in the table below.
| Methods | Performance |
|---|---|
| MECE | 4252.26 393.45 |
| Transform2Act | 3644.28 518.77 |
| Transform2Act without finetune | 1990.54 424.03 |
| NGE | 1874.65 816.51 |
| NGE - without finetune | 942.30 566.22 |
How do and contribute to our statement on "a good morphology" and "a good environment"? + Why we need morphology policy and environment policy ?
-
We explain the motivation of designing and in the 4th and 5th paragraphs of Section 3, in particular, the text above Eq. (1) and Eq. (2). They are in line with the statement of "a good morphology should make the agent learn faster and make more progress in different environments." (i.e., greater progress of control policy in Eq. (1)) and "a good environment should accelerate the evolution of the agent and help it find better morphology sooner" (i.e., greater progress of morphology policy in Eq. (2)).
-
The training and co-evolution of and is the key for the morphology optimization and gaining generaliation to diverse unseen environments. They jointly create a special adaptive curriculum for the training phase for the agent to evolve its morphology and generalization capability. In particular, creates a curriculum of environments for optimizing the morphology and control policy , while creates a curriculum for the environment changing and control policy learning. They improve the efficency of exploration through (1) interactions with MDP using different morphologies and environments; (2) structural constraint and correlation captured by the GNN; (3) more directed evolution than random mutation.
-
In our thorough ablation study-Ⅰ,Ⅱ,Ⅳ, we have extensively evaluated the importance and impacts of and , and in morphology optimization and generalization to new environments.
The clarity of the explanations could be improved.
- Thank you for this suggestion. We have updated all unclear notations and expressions following your suggestions.
Thank you for your clarification. However, I am even more puzzled by the response to the first question. If I understand correctly what you say, the objective of the joint optimization of the morphology and the policy is to have a high zero-shot performance, rather than to obtain a morphology with which a policy can be quickly learned to an unseen task. That is, the objective is the same as that of domain randomization, curriculum learning, etc., without morphology optimization, i.e., fixed morphology. I believe that such baseline approaches, including (the original) enhanced POET, must be included and the advantage of the proposed approach over these baseline must be empirically shown.
Dear reviewer,
Would you mind validating whether our rebuttal addresses your concerns:
- The consistency of our experiments and motivation, and we provide additional results.
- The design of the reward function for and .
Please feel free to ask any additional queries after reading our rebuttal. During this author-reviewer discussion period, we hope to catch and answer any questions.
Thank you,
The authors
Thank you for your response. We would want to further address your concern as follows:
- The evaluation tasks presented in the paper are characterized by a higher level of difficulty and generality compared to either morphology optimization alone or domain-specific generalization. Specifically, our objective is to acquire both a general morphology and general policy that can be directly applied to unseen environments and tasks without the need for further fine-tuning. To the best of our knowledge, the majority of existing work, including the baselines we have evaluated, does NOT directly tackle this problem and was NOT specifically developed to address this particular issue. To provide equitable comparisons, it is necessary to make some modifications to some baselines in experimental comparisons. The comparisons have already incorporated enhanced POET, as indicated in Figure 2 and the table provided in our rebuttal.
- Our current discussion is about "what is a good morphology or a policy for the test/inference phase". In our introduction section (and many places), we are discussing the good morphologies that exhibit higher learning efficiency during the training phase, rather than the test phase (since the adaption_cost =0 for the morphology evolved by MECE in the test). The MECE framework constructs a mechanism that encourages to discover the morphology that can achieve higher learning progress in diverse environments, utilizing the reward design .
Please let us know if we misunderstood your comment regarding our response to your first question.
This paper presents an approach to co-optimize both the morphology and environments of robots. The morphology and controller of the robot is updated, while the environment is progressively changed. The result of the employed co-evolutionary approach are environments that progressively get more complex, providing a good learning signal for the agent. The approach is compared to ablated versions, which demonstrate that the co-evolution of morphology and environment is beneficial, in addition to comparisons with modifications of methods such as POET, which typically only optimize the robot’s controller but not its morphology.
优点
- Interesting approach that could make robots more robust to varying environments
- Good ablation baseline comparisons
缺点
- Environment modifications seem limited (e.g. only environment roughness in the case of the 2D environment)
- Comparisons to other methods are a bit ad hoc, e.g. as the authors note, POET was not developed to deal with changing morphologies. In addition to randomly sampling environments here, I would suggest a slightly more advanced baselines that samples environments of increasing complexity
Minor comment:
"CMA-ES (Luck et al., 2019) optimizes robot design via a learned value function.” -> their method is not called CMA-ES. CMA-ES is used an evolution strategy for optimisation
问题
- "When the control complexity is low, evolutionary strategies have been successfully applied to find diverse morphologies in expressive soft robot design space” -> how does the control complexity in this paper compare to the one by Cheney et al.? One could say the soft robots in Cheney et al. (2013) are more complex than the robots co-evolved in this paper.
- How expensive is the approach of co-evolving the three different policies? And how does the computational complexity compare to the other baseline approaches?
- It would be good to see some pictures of the evolved environments
- What would happen if you start 3d-locomotion and gap-crossover with the same initial robot as in 2d-locomotion? There already seems to be a lot of bias given with the initial design.
Thank you for the suggestions! we have revised the draft by following your comments. We address your questions as below:
The control complexity compared to Cheney et al..One could say the soft robots in Cheney et al. (2013) are more complex than the robots co-evolved in this paper.
- Whether a soft-bodied robot is more complex than a rigid-bodied robot is still an open problem with different opinions from different studies. In this paper, we only focus on the optimization in the space of rigid-bodied robots. We will make it clear in the new version.
How expensive is MECE co-evolving the three different policies?
- Training MECE is not costly compared to baselines. For example, MECE took around 44 hours to train on a standard server with 8 CPU cores and an NVIDIA RTX 3090 Ti GPU, while Transform2Act took around 63 hours on the same server.
Comparison with POET.
- Yes, we agree that POET is not designed for changing morphologies. But this baseline was required by a previous reviewer. For ensure a fair comparison, we have made several modifications (detailed in point (3) of the ''Baseline'' paragraph in Section 5.1).
"I would suggest a slightly more advanced baselines that samples environments of increasing complexity"
- Thank you for the suggestion! We already included such a baseline in our comparison (e.g., in Figure 2): Enhanced POET starts with a simple environment and then gradually creates and adds new environments with increasing diversity and complexity.
- As mentioned in the paper, MECE stands as the pioneering approach in discussing co-evolution between the agent's morphology and the training environment. Finding baselines that align properly with this settings might be challenging.
What happened if starting with the same initial robot of 2d-locomotion in other environments?
- The initial agent in gap-crosser is the same as that in 2d-locomotion. We tried to initialize the same agent in 3d-locomotion but it didn't work. This is because 3d-locomotion has one more dimension (XYZ-plane) and it is hard to ensure the final performance of evolving morphologies without initial limitations.
"It would be good to see some pictures of the evolved environments."
- We will release the code with videos.
Thank you for the clarifications. I'm still in doubt about "There already seems to be a lot of bias given with the initial design." One could say that the final designs in your paper are only (slight) modifications of the designs already given at the start. You say "We tried to initialize the same agent in 3d-locomotion but it didn't work. This is because 3d-locomotion has one more dimension" but shouldn't the algorithm be able to evolve such a creature? It seems rather limiting if most of the desig has to be given by the human experimenter.
I saw the POET baseline but this is not exactly the same as "I would suggest a slightly more advanced baselines that samples environments of increasing complexity"? I had a much simpler approach than POET in mind, something more similar to adaptive domain randomisation (e.g. used here: https://arxiv.org/abs/1910.07113).
Thank you for your response! We respectfully express our disagreement with your remarks as follows:
- The initial morphologies we used for all environments/tasks are following the common settings in the community. For instance, Transform2Act employs the same initial morphology to ours, while the NGE/TAME approach employs random sampling and mutation within the domain of quadruped robots.
- The evolved morphology of MECE is not "slight" modifications to the initial one. The evolved morphology has not only a complex skeletal structure but also processes intricate joint attributes (refer to the 2nd paragraph on page 4, and Appendix A.1). In contrast to baselines, the morphology generated by MECE has reduced skeletal complexity while demonstrating superior performance. This observation aligns with the principles governing natural evolution.
- The contribution of MECE is not limited in adapting to domain randomization. In contrast, MECE aims to build a more efficient mechanism that learns to generate a new agent of novel morphology that is able to adapt diverse environments. In the future, we will extend the MECE framework to encompass real-world tasks/environments.
Dear reviewer,
Would you mind validating whether our rebuttal addresses your concerns:
- We have explained the issues of the experiments in MECE.
- The computation of MECE is not cost.
- We will release the source code with videos.
Please feel free to ask any additional queries after reading our rebuttal. During this author-reviewer discussion period, we hope to catch and answer any questions.
Thank you,
The authors
The paper proposed a novel algorithm for adaptation of morphology and controllers of a virtual agent, through a curricula learning process that can "tune" the complexity of the environment in which the agent operates.
This is a novel idea, which the reviewers acknowledge, and which I personally find very interesting and fascinating.
Based on the reviewers feedback and discussion, the current manuscript has a few serious issues:
- The main concern of the reviewers is the experiments evaluation is not sufficiently rigorous and extensive. While the results shown at the moment are promising and seem to support the value of the algorithm proposed, I do agree with the reviewers that a more thorough evaluation would certainly strengthen the paper.
- It is not clear from the manuscript to which classes of problems this algorithm could be applied. This was explicitly asked by reviewer cQrp, without an answer during the rebuttal. Answering this question would help in clarifying the significance of the contribution.
- I find the related work section unsatisfying, since a pretty extensive literature exists on the topic of morphology co-evolution, and many of these works could be considered feasible baselines. Instead, the authors seem to be focusing only a few very recent papers.
In addition, I would also encourage the authors to clearly state what is the goal that they are trying to achieve with this algorithm (e.g., "be directly deployed in unseen diverse environments without the need for further fine tuning") since, based on the reviewers' feedback, and my own reading of the paper, it is not obvious (e.g., in the conclusions you talk about several different metrics).
Overall, the work is novel, interesting, a promising. However, it currently lacks in terms of experimental evaluations, and related work.
为何不给更高分
The paper is clearly borderline. The reason for using the "lower score" is two-fold: 1) I find some of the answers in the rebuttal to be weak 2) It is unclear to me what is the exact application that the authors have in mind; while compelling, the algorithm is feasible in simulated worlds where it is possible to tune the complexity of the environment. Moreover, it seems that many different metrics are being used interchangeably when motivating the algorithm (data-efficiency, overall performance, generalization, etc) which is very odd.
为何不给更低分
N/A
Reject