PaperHub
6.3
/10
Poster4 位审稿人
最低5最高8标准差1.1
8
5
6
6
4.0
置信度
ICLR 2024

Eureka: Human-Level Reward Design via Coding Large Language Models

OpenReviewPDF
提交: 2023-09-24更新: 2024-03-28

摘要

关键词
Large Language ModelsReinforcement LearningDexterous ManipulationReward LearningRobotics

评审与讨论

审稿意见
8

This paper purpose a novel pipeline to harness the LLM to design reward for using reinforcement learning to do diverse tasks without any task-specific template. The pipeline use environment code and reward feedback as input to LLM, and let LLM to continously update the reward function. The comprehensive experiments demonstrate the generalization ability of the LLM for reward design and the effectiveness of the reward designed by LLM(comparable to human).

优点

  1. This paper purpose a general pipline for reward designing of RL, which is indeed a long standing problem for RL researches, this kind of pipeline may save a lot of time for human to shape the reward.
  2. The pipeline requires no task-specifiction template for LLM, which shows a great generalization to different tasks.
  3. This paper demonstrate the LLM with evolutionary is a effective approach for using LLM, has potential for other area.

缺点

Time cost and compute resource cost: Since each iteration, LLM will sample tens of reward sample, and we need to test all this reward function and get feedback, we need to simultanously run multiple experiments for each reward sample, it seems there will be a lot compute resource needed? And for each sampled reward, how many environments do you create to train? How do you decide when to terminate each running expertiment? Will it be possible that because the environment not create enough or did not train for a long time than miss a effective reward sample? What will be the total time of finding a good reward based on what kind of device?

问题

  1. Can this method used for task that object is diverse?
  2. If the task require image as input, it hard to run many environments simutenously, can this method still work?
  3. Will only give success reward as inital reward, can make the LLM to find more optimal reward?

伦理问题详情

No ethics concerns.

评论

Dear reviewer JeMn,

Thank you reviewer for your thoughtful comments and suggestions! Here, we respond to the specific questions and comments the reviewer raises. Please let us know if you have lingering questions and whether we can provide any additional clarifications during the discussion period to improve your rating of our paper.


Question/Comment 1: What is the computational cost of Eureka?

We have added the computation resources as well as run time of Eureka in Appendix D.4. In short, a single run of Eureka, with the default hyperparameters in our experiments, can be executed on 4 32GB memory GPUs (e.g. NVIDIA V100) and take less than one day of wall clock time even for the most complex environments (e.g. bimanual manipulation). Here, we also address some related questions from the reviewer:

(a) For each sampled reward, how many environments do you create to train?

Response: The number of environments created for training is determined by the task-specific hyperparameters set in the original benchmarks. For example, the number of environments is 2048 for Humanoid. We find the default number to work well for all our rewards, demonstrating the robustness of our Eureka approach.

(b) How do you decide when to terminate each running experiment?

Response: We do not perform early termination for RL runs. Fixing an environment, each RL run will execute for a fixed number of episodes according to the default task-specific hyperparameters provided in the benchmarks. That said, we believe that it is possible to use early termination to filter out unpromising reward candidates. This is an effective strategy for reducing the overall computational cost, and we leave it to future work for investigating this as computational efficiency is orthogonal and complementary to our core contribution of automated reward design.

(c) Is it possible that RL training is not long enough so it will miss an effective reward sample?

Response: This is certainly possible, but in practice, we find the benchmark default training hyperparameters are sufficient for discovering good reward functions across our tasks. With additional compute budget, we can also increase the training time to reduce this kind of false negative.

(d) What will be the total time of finding a good reward based on what kind of device?

Response: All our experiments took less than one day of wall clock time, and each individual experiment can be done on 4 V100 GPUs.


Question/Comment 2: Can this method be used for tasks with diverse objects?

Response 2: Yes, Eureka is a general reward design method that is not specific to any particular object type. As an example, we have demonstrated that Eureka can design effective reward functions for the novel task of pen re-orientation and spinning. The same procedure also discovers effective rewards for re-orienting a cube on a different robot hand (i.e., AllegroHand). In addition to existing evidence in the paper, we note that prior works on in-hand object re-orientation often uses the same reward function for diverse objects [1,2], and it is reasonable to assume that Eureka rewards, which work better than one such prior reward, can be applied across objects as well.


Question/Comment 3: Can this method still work with tasks that require images as input?

Response 3: Yes, we can infer state variables relevant to reward definition from vision. State estimation has been effectively applied in many prior works for vision-based robotic learning in real world scenarios [3,4]. While state estimation techniques may not always be accurate in complicated scenes, we believe that this is a challenge facing all real-world policy learning methods, and research progress on these fronts can readily benefit Eureka’s application to real-world tasks. On the other hand, if it is only the final policy that needs to act from visual inputs, we can use Eureka to first design the most effective reward function in simulation and then distill the state-based policy to a vision-based policy; this strategy has been effectively applied in many prior Sim2Real approaches [5].


Question/Comment 4: Can the success/fitness function serve as initialization for Eureka reward search?

Response 4: Yes, this is indeed possible. We have conducted a similar experiment in which we use a human-supplied reward function to initialize Eureka (Section 4.4). Note that supplying the fitness function, which is also the optimization objective for Eureka reward search, is likely to lead to reward functions that can better optimize it. For example, Eureka can choose to use the fitness function to create a success bonus in its reward generations.

We thank the reviewer again for their time and effort helping us improve our paper! Please let us know if we can provide additional clarifications to improve our score.

评论

[1] Chen, Tao, Jie Xu, and Pulkit Agrawal. "A system for general in-hand object re-orientation." Conference on Robot Learning. PMLR, 2022.

[2] Qi, Haozhi, et al. "In-hand object rotation via rapid motor adaptation." Conference on Robot Learning. PMLR, 2023.

[3] Gu, Shixiang, et al. "Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates." 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017.

[4] Büchler, Dieter, et al. "Learning to play table tennis from scratch using muscular robots." IEEE Transactions on Robotics 38.6 (2022): 3850-3860.

[5] Lee, Joonho, et al. "Learning quadrupedal locomotion over challenging terrain." Science robotics 5.47 (2020)

评论

Dear reviewer JeMn,

As the open discussion period draws to a close in a few days, we wanted to check back to see whether you have any remaining concerns. We have provided global responses for all reviewers to highlight new results we obtained, including new holistic evaluation of Eureka as well as promising Sim2Real results for real-world robotic control. In addition, we also believe that we have sufficiently responded to your earlier queries on various aspects of this work, and we provide a short summary here for your convenience:

  1. We have provided the computational cost of our algorithm in Appendix D.4 and answered various questions regarding the experimental details of our approach.

  2. We have answered your inquiry about using Eureka on tasks with diverse objects and visual inputs. For the latter, we have added a new Sim2Real result (Appendix F) that points to promising signs for learning a vision-based policy by distilling a policy learned in simulation.

  3. We have described how to use Eureka when the success function is provided as initialization. Section 4.4 in our paper studies a similar setting in which Eureka uses human-written reward functions as initialization for the search and discovers better reward functions.

Please let us know if/how we can address any remaining concerns, and we are grateful for any additional feedback and suggestions!

Best,

Authors

评论

Thank you for your response, most of my concerns have been addressed, but I still have two questions:

  1. For generalization on diverse objects, I mainly mean to generalize across different objects as in the example you provided of "in-hand object re-orientation". Although the example you show can generalize to different object instances with one reward, in-hand orientation actually has a small exploration space that is much easier to explore compared to the task in dexterity, in such case, it should be much harder to write a reward that can learn the general policy. For example in [1], the single PPO policy can not learn well for generalized grasping. Thus, I am curious whether your method is capable of writing a reward that improves these.
  2. For train with image as observation, I actually means by taking raw image as input for training, since in such case, it is hard to open many environments, due to GPU memory constraints. Or, if reduce the num of training environments, will the performance be influenced a lot?
评论

Dear Reviewer JeMn,

Thank you for your response, and we are glad to hear that most of your concerns have been addressed! Here, we respond to your remaining two questions.

Question: Writing reward functions for generalized grasping may be challenging for Eureka.

Response: We agree with the reviewer that generalized grasping is indeed a difficult task, but we believe that Eureka is capable of writing reward functions for generalized grasping. Recent literature has made significant progress in using RL-based approaches for the generalized grasping task; a state-of-art approach [1] has found that an object curriculum combined with a manually-engineered reward function, including object reaching, object touching, object lifting, and moving to target pose reward components, can effectively learn generalized grasping policies. These individual reward components and their combinations have been generated by Eureka in our Dexterity benchmark suite, so we believe that Eureka is indeed capable of generating reward functions that incorporate these components and more given its open-ended search nature. That said, generalized grasping, as shown by [1], typically involves more than end-to-end RL to solve (e.g., an object curriculum), but given that we have demonstrated that Eureka works well with curriculum-based learning in our pen spinning task (Section 4.3), we believe that a curriculum-based approach with Eureka can also tackle the challenge of generalized dexterous grasping. This is an exciting future research direction to pursue, and we thank the reviewer for making this interesting connection!


Question: Will image-based input reduce environment parallelism due to GPU memory constraints and impact training performance?

Response: Yes, the original Dexterity paper has ablated this problem and finds that training with visual inputs reduces the number of parallel simulations from 2048 to 256 on a single RTX 3090 GPU and hurts RL training performance (Figure 8 in [3]). Broadly speaking, the main challenges for direct vision-based training in simulation are the slower simulation speed and the instability of vision-based reinforcement learning. Both issues are generic to simulation-based visual RL and orthogonal to Eureka’s core contributions, and we believe that progress on these fronts will directly enhance Eureka’s applicability as well. For example, a recent benchmark [4] has been able to increase vision-based simulation speed by ten-fold without sacrificing the number of parallel environments. Extending Eureka to directly interface vision-based learning, for instance, by studying what types of reward components are more suitable for vision-based learning, is an exciting future research direction. Finally, we do note that given the current limitations of vision-based simulations, state-of-art dexterous manipulation approaches, such as [1,2], typically involve first training a state-based “teacher” policy and then distilling the policy to a vision-based “student” policy. This is an established and effective pipeline that Eureka can readily support by designing reward functions for learning the state-based teacher policy. We will include this discussion in our new discussion section in Appendix H.

We thank the reviewer again for their time and effort helping us improve our paper! Please let us know if we can provide additional clarifications to improve our score.

Best,

Authors


[1] Chen, Tao, Jie Xu, and Pulkit Agrawal. "A system for general in-hand object re-orientation." Conference on Robot Learning. PMLR, 2022.

[2] Xu, Yinzhen, et al. "Unidexgrasp: Universal robotic dexterous grasping via learning diverse proposal generation and goal-conditioned policy." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.

[3] Chen, Yuanpei, et al. "Towards human-level bimanual dexterous manipulation with reinforcement learning." Advances in Neural Information Processing Systems 35 (2022): 5150-5163.

[4] Gu, Jiayuan, et al. "Maniskill2: A unified benchmark for generalizable manipulation skills." arXiv preprint arXiv:2302.04659 (2023).

评论

From my perspective, tasks in these environments, usually require a human to take a lot of time and effort to shape a good reward, and when the human is designing the reward, the human has ultimate access to every part of the environment code, and even discuss with other people. Thus, Eureka can design a reward that can get such results is indeed useful. Compared to baseline, eureka also does not require task-specific prompts, which can be used in different tasks, this has already improved a lot for "generality". Even for a human to design a reward, humans need learn some specific task knowledge for different environments. And with more rigorous experiments have been provided, I would like to raise my score.

评论

Dear Reviewer JeMn,

Thank you for your improved assessment of our work! As the response period draws to a close, we wanted to thank you again for your time and effort in improving our paper!

Best wishes,

Authors

审稿意见
5

The paper introduces EUREKA, an LLM-powered method for reward design for reinforcement learning tasks. The proposed method utilizes the zero-shot generation and in-context improvement of LLM, enabling evolutionary optimization over reward code without task-specific prompting or pre-defined reward templates. The paper showcases EUREKA's performance against human-engineered rewards in a diverse range of robot tasks. Furthermore, the algorithm is able to integrate with human feedback for further reward improvement, enhancing the quality and safety of generated rewards without the need for model updating.

优点

While the idea of this paper is rather simple, it yields a surprisingly good performance, which reflects a well-structured system. Being able to bring an easy idea to such a complete and well-considered system is commendable.

Moreover, this work brings insight to the reward design community by removing the dependency on collecting expert demonstration data. The study suggests that Large Language Models (LLMs) can serve as an cheap alternative to human expert demonstrations for acquiring domain-specific task knowledge.

The paper's presentation is clear; the authors make the content easily comprehensible. Their responsible practice of providing all relevant prompts and code offers an added advantage.

缺点

  1. Unrealistic assumption of access to the environment source codes:

The reward code generation in this paper critically depends on having access to the source code of the MDP specification as context for the initial reward proposal. The authors have presented this as a benefit, allowing the LLM to exploit code structure to understand task environments. However, it makes an unrealistic assumption, as most reinforcement learning setups only require access to a black-box simulation.

A significant limitation of this approach is that it may be infeasible for real-world robotic tasks where the transition function may either be unavailable or in different analytical forms. Given the heavy dependence on source code for task environment understanding, this method could be essentially restricted to simulated RL environments only.

  1. Strong assumption on the fitness function F(.)

Another weak point is the strong assumption on the fitness function F(.). The evolutionary search for the LLM generated reward function requires a fitness function capable of assessing the quality of each proposed reward function. In this work, the fitness function F(.) is implicitly assumed to have access to the ground truth reward function to evaluate the induced policies of the proposed reward functions. This limitation implies that the method is applicable only to tasks that come with known ground-truth reward functions. This weakens the method's practical applicability, restricting it mainly to idealized or known environments, hindering its usefulness for real-world, less predictable reinforcement learning tasks.

问题

While the pen spinning demonstration is impressive, it remains uncertain what is driving the task's success. Is it due to an effective reward design, which is the paper's main focus, or is it a byproduct of unspecified engineering efforts? Section 4.3 is not very clear and leaves room for several pertinent questions:

  1. The paper does not detail how the pre-trained policy was obtained. The statement, "Specifically, we first instruct EUREKA to generate a reward for ... Then, using this pre-trained policy," leaves the readers wondering what exactly "this" references to.

The application of curriculum learning here appears to break the policy training into two main stages - an initial 'pre-training' stage and a 'fine-tuning' stage with the pre-trained policy. If this interpretation is accurate, clarity around the following is crucial:

  1. Is the training process – both pre-training and fine-tuning stages – guided by the reward functions derived using the LLM-powered reward design method proposed in this paper? Are these reward functions identical?
  • If they are, there needs to be a detailed explanation of the distinctiveness of 'pre-training' and 'fine-tuning' when it comes to the optimization of policies under the same reward function. The reason how a dual-stage policy optimization can notably enhance policy performance remains under-explained. Additionally, if this is the case, it appears that the success of the task may be due to the dual-stage policy optimization rather than the reward design, casting doubts about the central argument of the paper.

  • If they're not, clarity is needed on how these varying reward functions are generated, and what differences exist between them. Furthermore, how does this dual-stage reward function model vary from the primary algorithm proposed? How to ascertain the number of reward functions sufficient for a given task?

评论

Dear reviewer XD65,

Thank the reviewer for your thoughtful comments and suggestions! Here, we respond to the specific questions and comments the reviewer raises. Please let us know if you have lingering questions and whether we can provide any additional clarifications during the discussion period to improve your rating of our paper.


Question/Comment 1: Unrealistic assumption of access to the environment source codes; reinforcement learning setups only require access to a black-box simulation.

Response 1: We would like to clarify that the problem setting of our work is not reinforcement learning, but rather reward design. This is traditionally done by humans who write and refine reward function codes in trial-and-error fashion in order to induce the desired behavior; Eureka automates this at scale. Its outer loop is a code generation problem, and it is standard to assume access to parts of the source code in order to generate/complete other parts of the code [1]. Furthermore, as we have clarified in our updated manuscript, Eureka’s “environment as context” procedure requires only the environment’s state and action variables to be exposed. This information can be supplied, for example, via an API without requiring the full environment code; see Example 1 in Appendix E, where we showed that a documentation-style description of the environment variables is sufficient for Eureka to generate effective reward functions. Eureka’s inner-loop does use RL to evaluate its reward candidates, and this inner-loop indeed only accesses the environment with Eureka generated reward function as black-box simulation. We hope that this explanation helps clarify the misunderstanding regarding Eureka’s use of environment as context and the problem setting it tackles.


Question/Comment 2: Strong Assumptions on fitness function F(); This limitation implies that the method is applicable only to tasks that come with known ground-truth reward functions.

Response 2: We would like to clarify that the ground-truth reward functions are simply sparse task success criteria, which are easy to define in practice. On all our 29 benchmark tasks, they can be written in one line of code (see Appendix B), which is much shorter and simpler than the human-written shaped reward functions that we compare Eureka to. Therefore, we believe that “tasks that come with known ground-truth reward functions” are not difficult to satisfy in practice, and Eureka readily applies to all such tasks with programmatic success conditions. This is a large space of tasks, and we have extensively evaluated a broad spectrum of robotic tasks in our experiments. Therefore, we believe that our approach’s applicability is already broad.

Furthermore, we do even show results where no fitness function is analytically expressed. In Section 4.4, we have studied cases where the desired behaviors are difficult to express via a mathematical function, such as running in a stable gait. In these cases, we have introduced Eureka from Human Feedback, which allows humans to express the “fitness” feedback in textual form, bypassing an analytical fitness function. We have shown that Eureka is indeed compatible with such a notion of fitness and can effectively generate reward functions that better align with human intention. In our limitation section in the updated manuscript (Section 5), we have also discussed the possibility of using a vision-language model (VLM) to construct automated textual feedback to improve the reward functions, providing a scalable alternative to human-based fitness feedback.

Please let us know whether our response has sufficiently addressed your concern about the fitness function; we are happy to provide more clarification as we believe that this is an important point that we’d like the reviewer to reach an agreement with.


Question/Comment 3: Clarification on how the pre-trained policy was obtained.

Response 3: The pre-training stage is identical to the original Shadow Hand environment in the Isaac Gym benchmark suite we study, except that we swapped out the cube object to the pen object. The application of Eureka for this stage is identical to the benchmark tasks, in which Eureka iteratively discovers better reward functions. The best Eureka reward is then used to train a policy to solve the task; the resulting converged policy from this training is the “pre-trained” policy.


Question/Comment 4: Is the training process – both pre-training and fine-tuning stages – guided by the reward functions derived using the LLM-powered reward design method proposed in this paper? Are these reward functions identical?

Response 4: Yes, during both stages, the policies are trained using the best Eureka-discovered reward function from the pre-training stages. These reward functions are hence identical.

评论

Question/Comment 5: A detailed explanation of the distinctiveness of ‘pre-training’ and ‘fine-tuning’ should be provided.

Response 5: Thank you for this suggestion; we have updated the manuscript to include more details (Section 4.3 and Appendix D.1). At a high level, the distinction of the two stages lies in the distribution of pen target poses. In both stages, the task is to reorient a pen to a target 3D pose, which is provided as a goal to the policy. When the current target pose is deemed achieved (i.e., pose difference lower than certain threshold), then a new target pose is provided. In the pre-training stage, a random pose from SO(3) 3D orientation group is sampled uniformly. During the fine-tuning stage, the target poses are a predetermined sequence of waypoints that specify the pen spinning patterns to be achieved. When the policy reaches the current waypoint, the target pose will switch to the next waypoint in the sequence. If the policy can consecutively reach all the waypoints, then it has accomplished one cycle of the pen spinning pattern. The waypoints will relay continuously, allowing the policy to spin the pen for as many cycles as the policy is capable of until the episode length is reached.

Question/Comment 6: It appears that the success of the pen spinning task may be due to the dual-stage policy optimization rather than the reward design, casting doubts about the central argument of the paper.

Response 6: Our central argument of the paper is not that a good reward function in itself is sufficient to solve all difficult tasks. Rather, we believe that many difficult tasks (such as pen spinning) can be best solved via a curriculum-based solution, but even with a curriculum, reward design can bottleneck policy learning. To empirically validate this point, we have conducted an ablation in which we fix the dual-stage curriculum as in the paper but replace the Eureka reward with a simple reward function that rewards the policy proportional to the negative distance between the pen’s current pose to the target pose. The resulting policy cannot solve the pre-training stage competently, often dropping the pen with unnatural and jerky hand motion. Therefore, we have demonstrated that the success of the pen spinning task is due to neither dual-stage optimization nor Eureka reward alone, but rather their combination.

On the higher-order bits, our experiment shows that Eureka can be combined with other generally applicable paradigms like curriculum learning, affirming its flexible, general-purpose nature complementary to other approaches. This experiment is a demonstration of this flexibility. Finally, we would like to reiterate that all other experiments in the paper do not use a curriculum, so the central claim about the effectiveness of automated reward design at scale is well supported in our paper.

Finally, we thank the reviewer again for their time and effort helping us improve our paper! Please let us know if we can provide additional clarifications to improve our score.


[1] Roziere, Baptiste, et al. "Code llama: Open foundation models for code." arXiv preprint arXiv:2308.12950 (2023).

[2] Akkaya, Ilge, et al. "Solving rubik's cube with a robot hand." arXiv preprint arXiv:1910.07113 (2019).

[3] Berner, Christopher, et al. "Dota 2 with large scale deep reinforcement learning." arXiv preprint arXiv:1912.06680 (2019).

[4] Lee, Joonho, et al. "Learning quadrupedal locomotion over challenging terrain." Science robotics 5.47 (2020)

评论

Dear reviewer XD65,

As the open discussion period draws to a close in a few days, we wanted to check back to see whether you have any remaining concerns. We have provided global responses for all reviewers to highlight new results we obtained, including new holistic evaluation of Eureka as well as promising Sim2Real results for real-world robotic control. In addition, we also believe that we have sufficiently responded to your earlier queries on various aspects of this work, and we provide a short summary here for your convenience:

  1. We have clarified that our algorithm does not require access to environment source code or the environment dynamics; instead, it requires only knowledge about the environment observation and action variables, which are minimal for writing reward functions in general. Furthermore, we have added an experiment on Mujoco Humanoid in Appendix E, where we have shown that documentation-style description of the environment variables is sufficient for Eureka to generate effective reward functions.

  2. We have clarified that the task fitness functions are simply one-line success functions that are easy to write and accessible in a large number of robotic tasks. Furthermore, we have demonstrated in Section 4.4 how Eureka can use human feedback to learn tasks that do not easily admit analytic task fitness functions.

  3. We have added detailed explanations for various components of our pen spinning experiments. In particular, we have demonstrated that Eureka reward design is necessary and complements the two-stage curriculum as a new baseline that replaces the Eureka reward with a manual engineered reward cannot solve the task.

Please let us know if/how we can address any remaining concerns, and we are grateful for any additional feedback and suggestions!

Best,

Authors

评论

Dear authors,

Thank you for the detailed response! They have clarified some of my doubts. I'll keep my score for now as the assumptions on your version of "known environments" and "known sparse rewards" are too strong for the works on reward design.

评论

Dear Reviewer XD65,

Thank you for your response! We are glad to hear that most of your concerns have been addressed. Here, we would like to clarify the assumptions on “known environments” and “known sparse rewards”.


First, our assumption on “known environments” is very mild – Eureka simply requires knowing what the environment state and action variables are (Section 3.1); this is the minimal knowledge required for reward design as reward functions are functions of environment state and action variables. In practice, we find a simple way to access this is by using the environment’s raw observation code. This kind of code is readily available for practical reward design scenarios, such as Sim2Real (as Reviewer JeMn also points out in their comment), where the designer constructs the environment source code and has access to the observation code. In Appendix E, we have also shown that when the source code is not available, a documentation-style description of the environment variables also suffices. Therefore, our "environment as context" procedure is easy to perform in practice.


Second, “known sparse rewards”, or task fitness function, is not a new assumption imposed by our method or introduced in our work; rather, it is a part of the original Reward Design problem definition introduced by [1] and widely adopted in subsequent works in reward design [2,3,4,5], including ours (Definition 2.1 in our paper). [5] specifically considers tasks where the fitness function takes the form of “a goal-based reward function where [the reward is non-zero] only on goal states”, which shows that sparse reward as a task fitness function is a valid formulation that has been accepted by the community. Our assumption on known sparse rewards as task fitness functions is no stronger than those in prior literature. As we have shown in our paper (Appendix B) and responses, they are typically one line of code (which Eureka does not have white-box access to) that is easy to define, and the community already knows how to write them for a large space of robotic tasks. For example, in the MetaWorld [6] benchmark, the sparse reward functions for 50 distinct manipulation tasks are all indicator functions (Table 12 in [6]) that share almost identical forms.

Furthermore, in Section 4.4 of our paper, we have considered cases when the task fitness function is hard to specify analytically. There, we have shown that human textual feedback can be used as a fitness function to steer reward generation for behavior that is hard to express mathematically. This is a novel reward design capability enabled by our LLM-based approach. Finally, in our new limitation section in Appendix H, we have also considered the possibility of using vision-language models (VLMs) to automate vision-based fitness feedback without human intervention. In summary, our version of task fitness function is consistent and supported by prior works, and we in fact have taken steps to make it more flexible by enabling a non-programmatic notion of task fitness.

Please let us know whether your concerns over the assumptions have been addressed; we are happy to provide additional clarifications during the remaining response period to improve your evaluation of our paper!

Best,

Authors


[1] Singh, Satinder, et al. “Where does reward come from." Proceedings of the Annual Conference of the Cognitive Science Society (CogSci), 2009.

[2] Niekum, Scott, et al. “Genetic programming for reward function search.”IEEE Transactions on Autonomous Mental Development, 2(2):83–90, 2010.

[3] Hadfield-Menell, Dylan, et al. "Inverse reward design." Advances in neural information processing systems 30 (2017).

[4] Faust, Aleksandra, Anthony Francis, and Dar Mehta. "Evolving rewards to automate reinforcement learning." arXiv preprint arXiv:1905.07628 (2019).

[5] Devidze, Rati, et al., “Explicable Reward Design for Reinforcement Learning Agents.” Advances in neural information processing systems 34 (2021).

[6] Yu, Tianhe, et al. "Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning." Conference on robot learning. PMLR, 2020.

评论

Dear Reviewer XD65,

As the response period draws to a close, we wanted to thank you again for your time and effort in improving our paper! We have also responded to your last remaining concerns in the above message and hope that they are now adequately addressed.

Best wishes,

Authors

评论

I am actually curious about the doubts of assumptions. From my perspective, for example, when a human decides to tackle a new task following a sim2real pipeline based on RL algorithm[1], the human might first establish an environment in simulation, in such case, the human also needs to write a reward function. And it seems reasonable for me to assume knowledge of the environment and sparse rewards.

[1]Chen, Yuanpei, et al. "Sequential Dexterity: Chaining Dexterous Policies for Long-Horizon Manipulation." 7th Annual Conference on Robot Learning. 2023.

审稿意见
6

The main idea of the paper is to have a LLM provide the reward function for RL problems. To achieve this, the LLM gets to see the code of the environment along with the plain text description of the task. The task for the LLM is to provide sample code for reward functions, which are in turn improved using an evolutionary search. The experiments section of the paper goes through a number of environments, where performance is shown to be better than performance using human-designed reward functions. For the examples provided, Eureka does not use task-specific prompts.

优点

I love the idea of using an LLM to provide initial versions of the reward functions, and to then improve it using evolutionary search. Moreover, the evaluation shows that the approach can deal with challenging environments, leading to good solutions or solutions for problems that have not been solved before. The work is also well motivated, and potentially lead to interesting advances in RL itself; it would be quite interesting to see this published and available for further research.

The work already contains inputs/outputs of the approach in the appendix, and code and prompts are expected to be released.

缺点

While the paper does do a great job in selling the idea, there's a frustrating lack of technical detail in the main part of the paper. One example to illustrate this problem: The subsection on evolutionary search provides no detail on what is the exact input, outputs, or about the specific method being used. This is one core aspect of the proposed approach, and would require more details to be understandable. I understand some parts of this appear in the appendix or will be clear from code release, but the main part should make the core parts of the approach more clear.

In the Eureka algorithm description, it is a bit unclear to me what the reward reflection part does; for example the algorithm specifies the reward functions and their evaluations are provided as inputs, but the text also talks about the choices of RL method that are important for this part. There's only little information in the text that tells me how it works.

Similarly, I like the idea of "environment as context", but it would be good to know what is considered here as environment (what does it mean), for example to what level of detail does the environment need to be described / down to what level do you have to go. The appendix describes that, due to limited context length, the only the observation context will be used, but for simulators different from Isaac, what information do you expect you need to provide for this to work.

This could be connected with a missing discussion of limitations of the approach, for example do you expect this approach to be useful when you do not have full access to the code of the environment but maybe just some API, or if the environment is the real world.

Maybe more philosophically but I am also not quite sure about the classification of the generation as "zero-shot" as it is unclear what the LLM has actually seen during training, and it would be interesting to see further investigations of this (not necessarily in this work) and capabilities of transferring domain knowledge - the release of benchmarks after the cut-off doesn't necessarily mean there was no environment like this before.

Most of the above weaknesses impact the presentation of the work; while formal presentation of the work is good overall, Figure 2 and Figure 3 could be improved for contrast and font sizes.

问题

My main questions are around limitations as mentioned above, and the types of problems this is expected to work well for (and where do you see this not working). To what extent the approach makes use of information available at training time (eg about the simulator or environments). More directly about the approach I would find it interesting to hear about approximate run times from beginning to end, for some of the examples.

评论

Dear reviewer hi7d,

Thank you for your thoughtful comments and suggestions! Here, we respond to the specific questions and comments the reviewer raises. Please let us know if you have lingering questions and whether we can provide any additional clarifications during the discussion period to improve your rating of our paper.


Question/Comment 1: The main part of the paper lacks technical details. For example, in the subsection on evolutionary search, what is the exact input, outputs, or about the specific method being used?

Response 1: We have revised our manuscript to clarify the technical approach in Section 3 and added pointers to the Appendix for specific prompts and examples when relevant. Please let us know whether there are additional places you’d like us to add details.

In the context of evolutionary search iteration, the inputs are (a) the best reward function from the previous iteration, (b) its reward reflection (Section 3.3), and (c) the prompt instruction for how to mutate the provided reward function given its reward reflection feedback. These inputs will be the prompt to the LLM, which then generates KK i.i.d. outputs, corresponding to KK reward samples.


Question/Comment 2: Clarifications on reward reflection. For example, the algorithm specifies the reward functions and their evaluations are provided as inputs, but the text also talks about the choices of RL method that are important for this part.

Response 2: Thanks for this suggestion! We have updated our manuscript to point to concrete reward reflection examples in App. G.1. The reviewer is correct in that reward reflection is a textual evaluation of reward function (e.g., summarizing the training dynamics of its components). The subsequent discussion about the choices of RL method is meant to provide intuitive justification as to why reward reflection is helpful. We have improved the writing to more cleanly separate the procedure itself from its motivations. Please let us know if there is any remaining clarification we should include.


Question/Comment 3: For “environment as context”, what level of detail does the environment need to be described. For simulators different from Isaac, what information do you expect you need to provide for this to work.

Response 3: “Environment as context” requires only the state and action variables in the environment to be exposed in the source code; in practice, as described in Appendix D, we have an automatic script to extract just the observation function from all raw environment source code. Note that our prompts and the observation source code do not reveal identity of the simulator, so Eureka is already general and we expect the same level of information, namely the names of state and action variables in the environment, to be needed for other simulators. To empirically validate this claim, we have also added an experiment testing Eureka on the Mujoco Humanoid environment; its observation code is included in Appendix E. As seen, the Mujoco variant exposes variables in a comment block whereas the Isaac Gym variant exposes them as class attributes, but since both contain sufficient state and action information, Eureka is able to competently generate effective reward functions in both cases, showcasing its generality.


Question/Comment 4: A limitation section should be included. Do you expect the approach to be useful when we do not have full access to the code of the environment, or if the environment is the real world.

Response 4: Thanks for this suggestion! We have included a limitation section in our updated manuscript that details Eureka’s current limitations and several future work directions to overcome them.

We do expect the approach to be useful when we do not have full access to the code of the environment. As discussed in the previous question, Eureka only needs to know the state and action variables of the environment in its context to generate reward functions; this information could be automatically extracted (as in our current approach) or accessed via an API (as the reviewer suggests) without access to the full environment code.

If the environment is real world, as we expanded in our new limitation section, we believe that Sim2Real approaches are particularly promising. This approach allows Eureka to readily combine with mature Sim2Real techniques to enable real-world transfer of policies learned in simulation. Another approach is to use state-of-art state estimation techniques to construct a symbolic representation of the real-world environment to allow Eureka to define reward functions. State-estimation based solutions have been attempted by many other works targeted at learning and deploying in the real world [1,2].

评论

Question/Comment 5: Not sure whether the classification of the generation as “zero-shot” is accurate. The release of benchmarks after the cut-off doesn’t necessarily mean there was no environment like this before.

Response 5: We would like to clarify that “zero-shot” in our context means that the generation is performed without any examples in the prompt [3]. We believe that this is a standard usage of the term in the LLM literature and is meant to distinguish from the common “few-shot prompting” where several input-output examples are provided to guide the LLM generation. In other words, whether a generation is “zero-shot” is determined by the lack of examples in the prompt and not the composition of the training dataset. We have updated our manuscript to clarify the intended meaning in our context.

With regard to whether the backbone LLM, GPT-4, has seen similar environments in the training data, we agree with the reviewers that this possibility cannot be ruled out. However, we have tried our best effort to select only benchmarks after the cut-off date to ensure that at least the identical environments are unseen to GPT-4. Furthermore, to the best of our knowledge, most of the bi-manual Dexterity tasks are novel and no prior work has studied those particular tasks. Therefore, we do believe that Eureka has demonstrated capabilities of transferring general reward design knowledge across domains. This is corroborated by our reward correlation analysis in Section 4.3, which finds Eureka to discover novel reward functions compared to the original human-written ones.

We thank the reviewer again for their time and effort helping us improve our paper! Please let us know if we can provide additional clarifications to improve our score.


[1] Smith, Laura, Ilya Kostrikov, and Sergey Levine. "A walk in the park: Learning to walk in 20 minutes with model-free reinforcement learning." arXiv preprint arXiv:2208.07860 (2022).

[2] Handa, Ankur, et al. "Dextreme: Transfer of agile in-hand manipulation from simulation to reality." 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023.

[3] Kojima, Takeshi, et al. "Large language models are zero-shot reasoners." Advances in neural information processing systems 35 (2022): 22199-22213.

评论

Dear reviewer hi7d,

As the open discussion period draws to a close in a few days, we wanted to check back to see whether you have any remaining concerns. We have provided global responses for all reviewers to highlight new results we obtained, including new holistic evaluation of Eureka as well as promising Sim2Real results for real-world robotic control. In addition, we also believe that we have sufficiently responded to your earlier queries on various aspects of this work, and we provide a short summary here for your convenience:

  1. We have extensively improved our technical writing (Section 3) to clearly explain important algorithmic components; we have also added the computation resources for our algorithm in Appendix D.4.

  2. We have included a limitation section in Appendix H.

  3. We have added a sim2real experiment in Appendix F to demonstrate Eureka’s applicability for real-world robotic control.

Please let us know if/how we can address any remaining concerns, and we are grateful for any additional feedback and suggestions.

Best,

Authors

评论

Dear authors, Thank you for clarifying my questions. I am satisfied the additions address my concerns.

评论

Dear Reviewer hi7d,

We are glad to hear that your concerns regarding our paper have been addressed. As the response period draws to a close, we wanted to thank you again for your time and effort in improving our paper!

Best wishes,

Authors

审稿意见
6

The paper presents a novel approach to leverage LLMs to plan/control complex low-level manipulation tasks in the form of an LLM-powered reward function generator. Its generation algorithm performs iterative improvements over an initially designed reward function without task-specific prompting nor a few short examples, by automatically calculating the fine-grained fitness of a policy in text over a sequence of executions on a target RL environment and using the LLM to improve the code. The presented approach is evaluated in varied RL environments and experimental results show it outperforms RL-expert humans on the majority of tasks. Moreover, the authors provide a case study of how the proposed approach enables higher complexity tasks where manual reward policy design is challenging.

优点

The submitted manuscript is very well written and presents a novel and interesting approach to automatically generate reward functions for simulated RL environments, which seemingly could be applied to different scenarios.

It presents a clever approach to leveraging recent LLMs' zero-shot code generation ability to both understand a simulation environment and to iteratively improve generated reward functions that would be hard to manually author and tune.

Moreover, the described evolutionary search and reward relection components or the approach, while not groundbreaking, provide interesting insights into the problem and on better interacting with LLMs for code generation.

缺点

One of the main weaknesses of the submitted paper is the lack of a Limitations section/discussion, or such discussion throughout the text. While the authors claim the generality of Eureka, the proposed approach has only been evaluated on a single base simulator (Isaac Gym) and with a fixed RL algorithm. In other words, the claim seems to be overstated.

Another weakness is the experiment part, while the submitted text showcases different (and relevant) comparisons with human results, the human rewards are zero-shot and not tuned for many RL trials to further improve the performance. Therefore, I believe the comparison may be unfair. If you tune the human rewards in this baseline (e.g. search the weights for different reward terms) and train RL for many trials (same as the cost of the evolutionary search in Eureka ), some claims may not hold.

A specific discussion I missed was about how the proposed approach handles the difference between optimizing for the internals of the simulation vs its sensing/acting interface. The former should be avoided in any general approach. The authors claim to use an "automatic script to extract just the observation portion of the environment source code", but this doesn't necessarily guarantee no leaks or that such observation code abstraction level leaks details.

Moreover, as the proposed approach depends on feeding the environment code to the LLM, besides just claiming the "the observation portion of the environment", I believe a more in-depth discussion is needed on how Eureka could be adapted to a) more complex environments, which may be too large for the model context windows; and b) scenarios of interaction with the real world (actual robot control).

Particularly for a), this is a critically important discussion. E.g., What would be the impact on the pen spinning demo with more detailed material characteristics and physics (friction, inertia, actuator latencies, etc.)?

The authors also claim only a "handful" of LLM output is enough. However, 16 is hardly a handful (<5). Intuitively the number of samples to obtain one fully executable policy will grow in proportion to simulator and environment complexity. However, again there is no discussion of such limitations of the approach.

问题

In view of the claim of generality, how to qualify/quantify the impact of different simulation platforms (and level of physics detail) or RL algorithms? Please also comment on the scenario of interfacing with the real world.

评论

Dear reviewer DCBe,

Thank you for your thoughtful comments and suggestions! Here, we respond to the specific questions and comments the reviewer raises. Please let us know if you have lingering questions and whether we can provide any additional clarifications during the discussion period to improve your assessment and rating of our paper.

Question/Comment 1: Human rewards are zero-shot and not tuned for many RL trials to further improve the performance. The comparison to human rewards may be unfair.

Response 1:

First, there are good reasons to believe that the human reward functions are representative of expert-level human reward functions and are not “zero-shot”. Given that they are written by RL researchers who designed the benchmarks, we hypothesize that they are likely carefully tuned to ensure that they are optimizable by their choice of reinforcement learning algorithms – so that the benchmarks are solvable; we discuss this aspect in Section 4.1. In particular, the authors of the bi-manual Dexterity benchmark wrote the following excerpt in Appendix 2.1 of their paper, confirming judicious and iterative reward shaping, as per common practice:

“Designing a reward function is very important for an RL task… because the scenarios of each task are different, the hyperparameters of the reward function will inevitably be different. We have tried our best to avoid manual reward shaping for each task provided that RL can be successfully trained.”

This excerpt confirms that their reward functions are not zero-shot. Furthermore, their codebase also has commented-out reward snippets (e.g., https://github.com/PKU-MARL/DexterousHands/bidexhands/tasks/shadow_hand_block_stack.py#L1456, https://github.com/PKU-MARL/DexterousHands/bidexhands/tasks/shadow_hand_kettle.py#L1488), which are likely prior attempted reward functions/components that did not work as well.

Given this and the fact that we cannot control for how the original human reward engineering took place, we believe that our experiments are a valuable and meaningful comparison of Eureka’s reward generation capability against expert humans.

Second, we note that the purpose of comparing against the official human reward functions is to be able to ground the quality of Eureka reward functions against some pre-existing reward functions that we know to be competent. It is certainly possible to come up with better reward functions to compare against, but we instead believe that the interesting scientific finding is rather the fact that Eureka is capable of improving reward functions over multiple iterations without manual prescription. It is this improvement capability that enables Eureka reward functions to eventually surpass human reward functions for various tasks despite the inferior quality of Eureka’s initial reward generations (Iteration 1 in Figure 1).


Question/Comment 2: A baseline that searches the weights for different reward terms in the human rewards should be considered.

Response 2: Thank you for this suggestion. We have implemented this baseline for Humanoid in the Isaac suite. More specifically, we choose 4 components from the respective human reward functions and grid search the best weight combination from having 3 possible values for each weight: 0.1x, 1x, 10x the original value. Therefore, we search over 81 tuned human reward functions (80 new ones), making the comparison fair to Eureka in terms of the total reward candidates searched. The results are below:

EurekaHumanHuman (Tuned)
Humanoid8.246.287.47

As shown, while Human (Tuned) does improve, the relative ordering between Eureka and Human does not change. The relatively small gap between Human and Human (Tuned) also suggests that the official human reward function is reasonably well-tuned and competent. Finally, we note that this is an unfairly advantaged baseline to Eureka because Eureka is meant to design a reward function from scratch (hence a reward design algorithm) whereas this baseline tunes an existing reward function. Even then, given the same reward sample budget, Eureka’s best reward generated from scratch outperforms the best tuned human reward function.


Question/Comment 3: Discussion on whether the proposed approach handles the difference between optimizing for the internals of the simulation vs its sensing/acting interface.

Response 3: Thank you for this suggestion. We have confirmed that our observation code does not leak details about the simulation internals; an example of extracted observation code is in Appendix D (Example 1). In general, we have also revised our manuscript to include a more concrete description of our “environment as context” procedure to clarify that observation code needs to expose only usable state variables in the environment for our proposed approach to work well.

评论

Question/Comment 4: A more in-depth discussion is needed on how Eureka could be adapted to more complex environments, which may be too large for the model context windows.

Response 4: Thank you for this suggestion. We note that Eureka has already been demonstrated on very complex environments such as all tasks in the bi-manual Dexterity benchmark; these environments have state dimensions larger than 400 with observation source code that span more than 200 lines each. These source codes are indeed quite long for the model context length; this is also partially why we did the “automatic script to extract just the observation portion of the environment source code” to ensure that all environments, even the very complex ones, can fit within the context length. Given that Eureka really just needs to know about environment state and action spaces, we believe that this information is of reasonable token length even for very large environments. As state-of-art LLMs have increasingly longer context length, we do not foresee the context length becoming a bottleneck. That said, we do acknowledge the inference cost associated with LLMs and believe that more efficient LLM inference methods should be considered in the future.


Question/Comment 5: A more in-depth discussion is needed on how Eureka could be adapted to real-world robotic control.

Response 5: Thank you for this suggestion. We have detailed in our new limitation section (Appendix H) on how to extend Eureka for real-world robotic control. Specifically, one approach is to combine Eureka and Sim2Real algorithms to first learn a policy in simulation using Eureka reward and then transfer it to the real world. Another approach is to use state estimation techniques to directly extract a symbolic state space for the real-world environment, for which reward functions can be defined. There is extensive literature in both categories [3,4,5,6], and we believe that real-world robotic control with Eureka is a very promising and interesting future direction. That said, demonstrating it is outside the scope of this work as many challenges there, such as state estimation, sample efficiency, and closing Sim2Real gap are general to running RL in the real-world and not particular to Eureka.


Question/Comment 6: What would be the impact on the pen spinning results with more detailed material characteristics and physics (friction, inertia, actuator latencies).

Response 6: We have included additional details of our pen spinning experiment in Section 4.3 and Appendix D.1. In short, our pen spinning environment builds on the original Isaac Gym ShadowHand environment, which is carefully designed to mirror the OpenAI Shadow Hand rubik’s cube environment that has enabled real-world transfer (Section 6.4 in [1]). Besides substituting the original cube object to the pen object (an included but unused asset in the official IsaacGym repository), we did not make any changes to physics parameters to make the task easier or unrealistic. Therefore, we believe that our pen spinning environment is a good demonstration of Eureka’s capability in scaling up to very difficult tasks.


Question/Comment 7: 16 is not a handful. Intuitively, the number of samples to obtain one fully executable policy will grow in proportion to simulator and environment complexity. There is no study on the number of samples required to obtain one fully executable policy.

Response 7: Thank you for this suggestion. We have removed the confusing wording from the updated manuscript and included a discussion on the number of samples. Our rationale for using 16 samples for all environments is to demonstrate that a single set of hyperparameters can be effective across a wide spectrum of task complexity. Our task suite ranges from simple ones such as CartPole and Ant to high-dimensional environments such as the bi-manual Dexterity tasks, which have 400+ state dimensions and significantly more complex observation codes. Therefore, while it is certainly true the number of samples required will grow in proportion to environment complexity, we have already demonstrated that a reasonable number of 16 is sufficient even for very difficult tasks. We note that in the code generation literature, it is common to repeatedly generate many more samples, such as 100 samples in the seminal Codex [2] work, in order to obtain a good code outcome. Therefore, we believe our 16 samples is moderate and reasonable to perform in practice, and experimenting with fewer samples may not be necessary.

评论

Question/Comment 8: The generality of Eureka may be overstated because it has been evaluated on a single simulator and with a fixed RL algorithm.

Response 8: We have tested Eureka on the Mujoco Humanoid environment. The comparison against the official human-written reward function is below and reproduced in Appendix E. Despite vastly different observation space and code syntax, Eureka remains effective and generates reward functions that outperform the official human reward function. This is perhaps not surprising – as information about the simulator and the RL algorithm are abstracted away from Eureka’s reward generation pipeline, there is no strong prior reason to expect that Eureka will not work on a new simulator or RL algorithm. Given that the observation code for Mujoco and Isaac Gym Humanoid environments are written very differently, this experiment demonstrates Eureka’s generality with regard to code syntax and physics simulation.

EurekaHuman
Mujoco Humanoid7.685.92

Overall, evaluating more simulators or algorithms is of course desirable, but we believe that doing so is complementary and orthogonal to our paper’s main contributions. In our paper, we claim generality only over the diversity and complexity of robots and tasks, and we believe this is the appropriate axis to evaluate automated reward design algorithms, such as Eureka. Information about the task (i.e, task description) and the robot (in the form of its observation code) are directly exposed to Eureka, so generality over these two axes is naturally desirable and should be rigorously examined. In our paper, we have indeed demonstrated that a single set of prompts and search hyperparameters is sufficient for 29 tasks that span 10 distinct robots. Given that the observation code for the 29 tasks vary significantly in terms of length, syntactic style and complexity, we believe our existing experiments do affirm Eureka’s generality. Furthermore, given that Isaac Gym is a widely used benchmark in the community and has enabled many real-world transfer results [3,4], we believe that the results we obtain are promising for real-world policy deployment in the future.

We thank the reviewer again for their time and effort helping us improve our paper! Please let us know if we can provide additional clarifications to improve our score.


[1] Makoviychuk, Viktor, et al. "Isaac gym: High performance gpu-based physics simulation for robot learning." arXiv preprint arXiv:2108.10470 (2021).

[2] Chen, Mark, et al. "Evaluating large language models trained on code." arXiv preprint arXiv:2107.03374 (2021).

[3] Rudin, Nikita, et al. "Learning to walk in minutes using massively parallel deep reinforcement learning." Conference on Robot Learning. PMLR, 2022.

[4] Handa, Ankur, et al. "Dextreme: Transfer of agile in-hand manipulation from simulation to reality." 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023.

[5] Gu, Shixiang, et al. "Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates." 2017 IEEE international conference on robotics and automation (ICRA). IEEE, 2017.

[6] Büchler, Dieter, et al. "Learning to play table tennis from scratch using muscular robots." IEEE Transactions on Robotics 38.6 (2022): 3850-3860.

评论

Dear reviewer DCBe,

As the open discussion period draws to a close in a few days, we wanted to check back to see whether you have any remaining concerns. We have provided global responses for all reviewers to highlight new results we obtained, including new holistic evaluation of Eureka as well as promising Sim2Real results for real-world robotic control. In addition, we also believe that we have sufficiently responded to your earlier queries on various aspects of this work, and we provide a short summary here for your convenience:

  1. We have addressed your concern on evaluation in Isaac Gym, validity of simulation focused evaluation, and feasibility of real-world robotic control by including an evaluation on a Mujoco environment (Appendix E) and a real-world Sim2Real result (Appendix F).

  2. We have clarified the high-quality nature of the human reward functions and showed that tuning the human reward functions does not match Eureka despite the fact that Eureka generates reward functions from scratch.

  3. We have clarified that Eureka requires only knowledge about the environment observation and action variables and does not use knowledge of the simulation internals; our new Mujoco experiment (Appendix E) provides a compelling example where we show a documentation-style description of the environment variables is sufficient for Eureka to generate effective reward functions

Please let us know if/how we can address any remaining concerns, and we are grateful for any additional feedback and suggestions!

Best,

Authors

评论

A quick question before I make the final decision. What is the error bar of the Human (Tuned)? Can Eureka in fact produce reward functions that are statistically significantly better than Humans (Tuned)?

评论

Dear reviewer DCBe,

The error bar in the Eureka vs. Human (Tuned) comparison on the Humanoid task is shown below; the standard deviation is computed over 5 independent RL runs. The Eureka reward function’s improvement over Human (Tuned) reward function is statistically significant according to an independent sample t-test with p=0.0002p=0.0002. Note that in Figure 12, we obtained statistically significant improvement over Human using a Mann-Whitney U-test because there we have samples from different tasks that may not come from the same distribution. Running the U-test on the comparison here also gives statistical significance improvement with p=0.007p=0.007.

EurekaHuman (Tuned)
Humanoid8.24 ±\pm 0.057.47 ±\pm 0.12

We thank the reviewer again for their time and effort helping us improve our paper! Please let us know if we can provide additional clarifications to improve our score.

Best,

Authors

评论

I am quite hesitant about making a decision because I've noticed in Figure 4 that the gap between EUREKA and human is 55/45, which I believe indicates a statistical difference. However, considering that the performance of human tuning later increases, I suspect that the gap between EUREKA and human (tuned) might not be significant across the 20 tasks, implying that the human reward still has potential. Of course, I am not currently asking the authors to conduct additional experiments, but I hope the authors can provide some extra explanations to address whether my concerns are reasonable. In any case, my stance is neutral at the moment, and I will follow the consensus reached through discussion with the other reviewers regarding the decision.

评论

Dear Reviewer DCBe,

Thank you for your response! Your concerns are certainly reasonable, and here we provide additional explanations to address your stated concerns over comparison to Human (Tuned). To this end, we highlight a few fundamental advantages Eureka has over the approach of tuning the weights within an existing reward function (Reward Tuning for short in this response), such as Human (Tuned).

First, Reward Tuning is only applicable when a base reward function exists in the first place. In contrast, Eureka is capable of generating reward functions from scratch. This distinction is of practical importance. In many practical scenarios, such as Sim2Real, the designer needs to create a simulation environment from scratch including its reward function. In these cases, Eureka can be applied whereas Reward Tuning cannot. In Appendix F, we have also demonstrated a preliminary Eureka Sim2Real result on a quadruped robot, readily demonstrating Eureka's applicability to use cases that demand reward generation from scratch. Therefore, regardless of how Human (Tuned) compares to Eureka on our benchmark tasks (which we find to favor Eureka in our preliminary Humanoid experiment), we believe that Eureka has novel zero-shot generation capability that cannot be replaced with Reward Tuning.

Second, even when considering Eureka vs. Reward Tuning purely in their abilities to edit existing reward functions, there are good reasons to prefer Eureka. Reward Tuning’s effectiveness is fundamentally limited by the quality of the base reward function. That is, if the initial reward does not include the correct components or have poorly designed components, then tuning the weights of its incorrect or incomplete set of components is unlikely to induce the desired behavior. For example, if the base reward function for the Humanoid running task contains only terms related to standing upright, then Reward Tuning will still not be sufficient for inducing the running behavior. In contrast, Eureka’s reward editing is open-ended, producing new reward functions that can be structurally and functionally very different from the base reward function; in this particular example, it is likely that Eureka will recognize that the current reward function is missing components that encourage running and propose new ones that include them. See Figure 3 for examples of how Eureka can edit reward functions far beyond changing the weights of existing components. This contrast is also supported by our Humanoid comparison between Eureka and Human (Tuned). There, despite having the same total number of reward queries, Human (Tuned) statistically significantly underperforms Eureka from scratch, indicating that the base, original human reward function cannot be improved much beyond given its functional form, whereas Eureka can freely propose new reward functions that may be better suited for the task in a fundamentally different way.

In conclusion, while it is true that Reward Tuning can improve existing reward functions (though the extent of the improvement is in large part dependent on the base reward function) and bring closer the performance gap to Eureka's reward functions, we believe that Eureka brings in new capabilities in both reward generation and editing that are important contributions to the reward design literature independent of the comparison on any particular task.

Please let us know whether your concerns over the comparison to Reward Tuning have been addressed; we are happy to provide additional clarifications during the remaining response period to improve your evaluation of our paper.

Best,

Authors

评论

I am grateful for the public opinion and the efforts of the author. Although there may still be some issues with this paper, there is no doubt that it is more refined than the original version. By the way, it seems that excessive promotion of articles on Twitter is not necessarily a good thing.

评论

Dear Reviewer DCBe,

Thank you for your improved assessment of our work! As the response period draws to a close, we wanted to thank you again for your time and effort in improving our paper!

Best wishes,

Authors

评论

Dear all reviewers,

We thank you for your time and effort in reviewing our work and providing detailed suggestions in improving our work! We have edited our manuscript (all edits are highlighted in orange color) and obtained several new experimental results. We feel that these new additions have greatly strengthened our paper:

  1. A new limitation section (Appendix H) that discusses Eureka’s current limitations and several future work directions to further extend its applicability.

  2. Evaluating Eureka on Mujoco Humanoid environment (Appendix E). This demonstrates Eureka’s generality over different physics simulations and code syntax.

  3. Improved writing of Section 3 that better explains several technical details of our approach.

  4. An ablation for the pen spinning experiment that demonstrates the importance of Eureka reward design; in addition, technical details about the pen spinning environments are added in Section 4.3 and Appendix D.1.

  5. Computational resources for our experiments in Appendix D.3.

Below, we have also responded to each individual reviewer’s questions and concerns. We thank all reviewers again for their time and effort helping us improve our paper! Please let us know if we can provide additional clarifications to improve our score.

Best,

Authors

公开评论

Dear readers,

I am posting this message in good will, to make possible issues with this submission apparent, and to give the authors the possibility to clarify some doubts that myself and other members of the community have about this paper.

I will leave aside the unsupported claims about generality present in the paper, which have been mostly addressed by the other reviewers. Direct access to the code defining the dynamics of the environment is most likely the strongest assumption one can have about a reinforcement learning problem, but this is not the main object of this message.

In particular, my doubts are about the experimental setting, results and aggregation, and can be summarized as follows:

  1. In Section 4.2, it is stated that "For each reward, we run 5 independent PPO training runs and report the average of the maximum task metric values achieved by policy checkpoints as the reward’s performance". If I understand correctly, this implies considering the maximum performance achieved by any iteration of PPO in any of the runs! A plethora of influential recent work in reinforcement learning has shown that relying on such maximization-based methodologies for measuring performance does not lead to meaningful comparisons, and that it should be not present in rigorous empirical evaluations [1-3].
  2. In a similar vein, I do not see error bars in any plot comparing performance of different approaches. Given the amount of possible variance implied by an approach based random search, just reporting a success rate such as "55.30" is meaningless if no confidence interval or measure of variation is reported.
  3. To iterate again on error bars and aggregation of performance over a set of tasks, the work in [1] established rigorous and reasonable methodologies to compare aggregated performance, suggesting the IQM as a robust metric. This is not considered, or mentioned, in the paper.
  4. There is no discussion on sample efficiency whatsoever, nor any training curve in the paper. If I understand correctly, Eureka uses orders of magnitudes of additional samples compared to the baseline. How can a reader be sure that part of the increased performance would not be achieved by just training for longer, if they cannot look at the training curves?

To justify the claims of good empirical performance of Eureka, and to not associate them exclusively to flawed and not rigorous experimental protocols, properly aggregated performance (e.g., via the IQM) with proper error bars, and without any maximization over seeds or checkpoints is required.

The fact that a research work uses the latest AI product or that it got viral on Twitter is not an excuse to forget about careful experimentation. Scientific work to be published at a machine learning conference like ICLR has to comply with the highest standards of scientific rigour, even if a tweet does not have to.

Let me know if I misunderstood any mentioned part of the paper, or if you can run rigorous evaluation that would justify the claims in the paper.

Best,

[1] Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Advances in neural information processing systems 34 (2021): 29304-29320.

[2] Henderson, Peter, et al. "Deep reinforcement learning that matters." Proceedings of the AAAI conference on artificial intelligence. Vol. 32. No. 1. 2018.

[3] Patterson, Andrew, et al. "Empirical Design in Reinforcement Learning." arXiv preprint arXiv:2304.01315 (2023).

评论

Dear reader,

Thank you for reading our paper and providing thoughtful comments! We are grateful for your suggestions and have added several experimental results in our updated manuscript addressing your concerns over our experimental protocols. Here, we respond to the specific questions the reader raises and believe that there are several misunderstandings regarding our paper that we would like to address. Please let us know if you have lingering questions and we are happy to answer any additional inquiries.


Question/Comment: Section 4.2 implies considering the maximum performance achieved by any iteration of PPO in any of the runs.

Response: This is incorrect, and we believe that there is a misunderstanding regarding our evaluation protocol. Our evaluation protocol is as follows. We first obtain a reward function using each approach (ours and the baselines). Then, for each of these reward functions, we perform 5 PPO runs. For each PPO run, we consider the best checkpoint out of 10 checkpoints taken at fixed training intervals, and evaluate the performance of each checkpoint by averaging the performance over 1000+ test rollouts that randomize over initial environment states. Then, we average the performance over the best checkpoints from all the PPO runs.

Importantly, this same procedure is applied to all final reward functions, including the ones from our baselines. In particular, the maximum is taken over the same number of checkpoints for each approach, so there is no bias towards our approach. Thus, this protocol is a fair procedure that does not advantage our method. Furthermore, we believe that this is a reasonable method for reporting a reward function’s performance as the performance of the checkpoint is evaluated using a large number of rollouts, and an average is taken over all individual RL runs to mitigate outliers.

We acknowledge that these details may not have been conveyed in sufficient detail in our paper, and we have updated our manuscript to leave no ambiguity. Furthermore, we have added several additional evaluation metrics, as the reader suggested, to complement our core evaluations in the paper, offering a holistic picture into Eureka’s efficacy in reward design.


Question/Comment: Additional evaluation metrics, such as inter-quantile mean (IQM), should be considered.

Response: We agree and thank you for this suggestion. We have computed Mean, Median, IQM, and probability of improvement metrics with 95% stratified bootstrapped intervals in Appendix F, Figure 11 and 12, using the official code from Agrawal et al., 2021. These new results demonstrate that Eureka is consistently effective and outperforms baselines over different evaluation metrics, supporting that it indeed is capable of generating well-behaved reward functions across tasks.


Question/Comment: Error bars should be included to contextualize the reported average success rate in Figure 4.

Response: We agree and thank you for this suggestion. In our updated manuscript, the main result of Figure 4 as well as Figure 10 in Appendix have now included error bars, which compute standard error over tasks and distinct Eureka and baseline runs. As shown, Eureka’s reward functions are effective across tasks and are equally robust to variances within RL runs as human-written reward functions. We thank the reader for suggesting these additions as we believe they do strengthen our claim.


Question/Comment: Does Eureka use orders of magnitudes of additional samples compared to the baseline? RL Training curves should be included.

Response: For the final training runs, Eureka uses the exact same number of environment samples as the baseline final reward functions (e.g., L2R and Human) for training. As stated in Section 4.2 of our paper, “for each task, all final reward functions are optimized using the same RL algorithm with the same set of hyperparameters”; this fixed set of hyperparameters includes the predetermined number of environment samples used for RL training. Given that the hyperparameters are chosen by the original benchmark designers to ensure the Human reward function is effective, we believe that our evaluation setting in fact disadvantages Eureka’s reward functions as Eureka does not tune the RL algorithm (PPO) to work well with its reward functions.

We have included the aggregate RL training curves on the 20 Dexterity tasks in Figure 10, Appendix F. As shown, Eureka reward functions exhibit better sample efficiency compared to the baselines, but given that the RL algorithm is not tuned to Eureka rewards, the learning curves do exhibit slightly more variance throughout training.

评论

Question/Comment: Direct access to the code defining the dynamics of the environment is a strong assumption.

Response: As we have clarified in our updated manuscript and our response to the reviewers, our algorithm does not require access to code defining transition dynamics of the environment. Rather, it simply requires knowing what the observation and action variables are in the environment, which is the minimal assumption for writing a reward function in general. We show that conveniently, this information can simply be the raw observation code, which is accessible for a reward designer. In the case the raw observation code is not available, in Appendix E, we have shown that this information can also be supplied in a comment format that can be returned via an API. Given that our problem setting is code generation/completion, assuming access to parts of the code to generate the other parts is a very standard assumption. Finally, it is also worthwhile noting that most RL for robotics approaches involve some sim-to-real training, which in turn involves tuning the simulator components to enable better policy transfer. This process implicitly assumes access to the environment code, and until the recent emergence of LLMs, there has not been a good way to make use of this access to automate parts of simulation design (e.g., reward design). Furthermore, Eureka’s assumption on knowing environment variables is strictly looser than prior work L2R, which required additional manually-written textual representation of the environment tailored towards the space of possible tasks on top of the raw source code.

We thank the reader again for participating in the open review process; we are grateful for your suggestions that have helped further improve our paper. Please let us know if/how we can address any lingering questions, and we are grateful for any additional feedback and suggestions.

Best, Authors

公开评论

Thanks for the clarifications and the additional information. I appreciate the answer but I still have, unfortunately, doubts about claims and evaluations.

[Maximum in evaluation] Thanks for clarifying your evaluation protocol. Despite being less severe than what I had inferred before, I believe also the protocol you have now described does not satisfy the standards of well-executed modern evaluation of reinforcement learning algorithms. In particular, the maximum over a training curve should not be used at all in this type of evaluation, and instead previous work has established that more reasonable metrics such as the average return of the final policy obtained by the algorithm are preferable. Appendix A.4 of [1] contains a detailed motivation of why this invalidates comparisons, leading to biased estimates of the quantities of interest that can favour an approach or the other even when the evaluation protocol is the same for all algorithms.

[Stochasticity and error bars] I appreciate the addition of error bars to Figure 4, but I still have the following concerns:

  • If I understand correctly, multiple repetitions of PPO are run, but with a single reward function in output from Algorithm 1. Given that Algorithm 1 is the main object of study of the paper, and it is heavily based on random search, I believe averaging over multiple runs of the algorithm itself is necessary for rigorous evaluation. Otherwise, a reader would not have the ability to say whether EUREKA is a potentially useful algorithm or there was bias in its evaluation caused by the selection of the output reward functions from it.
  • Even the current results, which are based on the incorrect max-based evaluation, would invalidate some claims in the paper. In particular, if you look at Figure 4 and Figure 11, there appears to be no statistically significant difference in performance between EUREKA and human-designed rewards, thus invalidating any claim that "EUREKA generates reward functions that outperform expert human-engineered rewards.", as stated in different parts of the paper, including the abstract. It is not unlikely that EUREKA might be significantly worse when evaluated with proper average-based protocols, and over multiple reward generation runs, and this makes standard evaluation even more necessary to justify the claims currently present in the paper.
  • Error bars should be in all performance plots, including Figure 5 and Figure 7. I do not see a valid reason not to include them.

[Number of samples] Thank you for adding a training curve. Despite it uses the same number of interactions for the final training run, at the beginning of that training run EUREKA has already used multiple training runs to select the best reward function. This means it has used many more samples, together with many more computations, compared to the baseline. While there is no clear way to make the baseline use more computations (even though using larger networks might be one), for the comparison to be fair, the baseline PPO runs should (at least) have access to a budget of interactions that is equal to the total number of interactions EUREKA had access to.

Hope this clarifies my concerns about the evaluation.

Best,

[1] Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Advances in neural information processing systems 34 (2021): 29304-29320.

评论

Dear reader,

Thank you for your response and providing insightful comments! Here, we respond to the specific questions the reader raises and believe that there are several misunderstandings regarding our paper that we would like to address.


Comment: the maximum over a training curve should not be used in the evaluation of reinforcement learning algorithms.

Response: We disagree. In fact, Appendix A.4 in [1] supports our experimental methodology. Given an algorithm AA, they consider running it NN times and taking the maximum, resulting in algorithm ANA^N. Analogously, in our setting, ANA^N denotes the algorithm obtained by taking the max over N=10N=10 checkpoints of a single PPO run.

Now, as they then point out, this method is effectively a single sample from algorithm ANA^N. Quoting from [1], the correct experimental strategy is that "Since ANA^N is itself stochastic, proper experimental methodology requires multiple runs of ANA^N." This is exactly our experimental methodology. More precisely, we run ANA^N 5 times to estimate its mean and standard error.

The second critique in Appendix A.4 of [1] is that "results produced under the maximum-during-training protocol are in general incomparable with end-performance reported results". As we described in our previous response, we do not report any end-performance results, only maximum-during-training results, so our evaluation does not suffer from this issue.

Another critique in Appendix A.4 of [1] is that the “maximum of a set of random variables is a biased estimate of their true maximum”. However, we are not trying to estimate the true maximum; we are simply trying to evaluate the algorithm ANA^N.

Furthermore, our arguments and evaluation protocols are supported and used by a concurrent and well-cited work [2 below] to [1], which studies the appropriate choice of point estimate for policy learning performance in detail and concludes that maximum-within-training can be preferable to final-within-training. Quoting from [2], “policy performance can change significantly from epoch to epoch; This makes it difficult to select the best trained model. In our study, we evaluate each policy checkpoint online in the environment in simulation, and report the best policy success rate per training run” in Section 2, and again, “policy checkpoints can vary substantially in performance during training, even when performance appears to converge" in Appendix G.

Finally, we would like to note that the sensitivity and bias inherent to any point estimate, max or final performance alike, is precisely why [1] advocates for a holistic evaluation protocol consisting of a set of aggregation metrics and confidence intervals on top of the raw point estimates. These are useful suggestions from the reader that we have already incorporated and we thank the reader for making this connection and pushing us in this direction.


Comment: Looking at Figure 4 and 11, there appears to be no statistically significant difference in performance between EUREKA and human-designed rewards

Response: We disagree with this statement in two folds:

First, and most importantly, we disagree that significance tests should be considered in the first place. [1] in fact argues against the significance test driven way of method comparison, stating that evaluation should embrace “statistical thinking but avoid statistical significance tests (e.g., p-value < 0.05) because of their dichotomous nature (significant vs. not significant) and common misinterpretations such as 1) lack of statistically significant results does not demonstrate the absence of effect”. We agree with this statement in its entirety, and have already included an extensive amount of evaluation, many of which are suggested by the reader and by [1], that together constructs a holistic performance profile that suggests the improvement is very likely to exist.

Second, even if we were to consider statistical significance, Eureka in fact produces reward functions that are statistically significantly better than human-designed reward functions. In particular, as suggested by [1], we consider the probability of improvement as a robust comparison metric. We present these results in Figure 12; here, the probability that the Eureka reward outperforms the human reward on a randomly selected task is above 50% with statistical significance as determined by a Mann-Whitney U-Test (also suggested by [1]) at p-value threshold of 0.05 (p<<0.05); the lower end of the 95% CI is well above 55%.

In conclusion, in light of the added metrics that provide detailed performance profiles of Eureka and the baselines, we believe that our original claims are only strengthened, and we thank the reader for making these helpful suggestions.

评论

Comment: Multiple repetitions of PPO are run, but with a single reward function in output from Algorithm 1.

Response: You are correct that we only evaluate one final Eureka reward for each task; we do so for computational reasons, since synthesizing multiple reward functions for every single task is prohibitively expensive. Importantly, however, we consistently show improvements for most tasks, as discussed throughout our responses. Thus, in aggregate, our results demonstrate that the reward functions synthesized using Eureka outperform reward functions written by humans. That said, Algorithm 1 does output many intermediate reward functions, and their qualities can be approximated by Figure 5.

Finally, we have also added error bars to Figures 5, we are currently re-running the ablations in Figure 7 in order to obtain error bars, but our preliminary checkpoints do not suggest the conclusion to change.


Comment: The baseline PPO runs with human-written reward functions should (at least) have access to a budget of interactions that is equal to the total number of interactions EUREKA had access to.

Response: We respectfully disagree. Eureka is solving the reward design problem, not the reinforcement learning problem. Giving the baseline reward functions more training samples is not a meaningful form of comparison as our goal is to compare the quality of reward functions not the computational process through which the reward functions are derived. Giving some reward functions more samples than others clearly does not allow for a fair comparison of the quality of reward functions.

When constructing reward functions, human reward engineers essentially assume unlimited access to environment interactions until they have found a satisfactory shaped reward function. For this reason, we believe our comparison is fair.

More importantly, this consideration is orthogonal to the core reason why human-written reward functions are and should be included in reward design works. The value of including “Human” baseline is to provide a useful and meaningful benchmarking for the quality of a shaped reward function, which otherwise is difficult to judge regardless of how a policy trained with the reward scores on the task metric (e.g., success rate). For example, a shaped reward function that enables a RL policy to achieve a success rate of 85% on a task is not meaningful without knowing how hard it is in general for a reward function to achieve 85% success rate; it could be that using the success criterion (“Sparse”) itself as reward function can in fact achieve higher success rate! By providing “Human” and “Sparse” as benchmark numbers (which prior work L2R does not do), our evaluation enables judging the usefulness of reward design algorithms in realistic reward design scenarios, where the norm is to have a human manually design rewards via trial-and-error. This is a useful form of comparison for any reward design algorithm, and changing what “Human” performance is based on the reward design algorithm itself defeats the purpose.

In summary, the point of our evaluation is to show that Eureka can automatically synthesize shaped reward functions that outperform human designed reward functions – i.e., when used in conjunction with a standard RL algorithm (in our case, PPO) and fixed hyperparameters, the Eureka reward function results in better performance than the human reward function. To this end, as detailed in our paper and in our responses, we have presented a comprehensive set of results, including numerous aggregation metrics, training curves, and their associated confidence intervals to provide a holistic performance profile evaluation that supports our claim.

Finally, we thank the reader again for participating in the open review process; we are grateful for your suggestions that have helped further improve our paper. Please let us know if/how we can address any lingering questions, and we are grateful for any additional feedback and suggestions.

Best,

Authors


[1] Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Advances in neural information processing systems 34 (2021): 29304-29320.

[2] Mandlekar, Ajay, et al. "What matters in learning from offline human demonstrations for robot manipulation." Conference on Robot Learning, 2021.

评论

Just want to double-check about the one final reward for each task, do the author just run once for each task using Eureka (with different seeds or the same seeds for different tasks?) and then use the final reward to run the subsequent 5 runs with different seeds?

评论

Dear Reviewer JeMn,

We run Eureka on each task a single time to produce a final Eureka reward; however, note that a single overall run of Eureka involves “5 independent runs” (i.e. 5 parallel Algorithm 1 in Section 3.2). Then, we evaluate this final reward by running 5 independent RL runs; these evaluation results are shown in Figures 4 & 11. This enables valid comparison to our benchmark reward functions as there is only one reward function in each benchmark category (“Human” and “Sparse”), so the interpretation of the performance aggregate metrics over tasks as well as resulting confidence intervals is consistent across methods.

The intermediate best Eureka reward functions from each of the 5 sub-runs are comparable in performance, and we find their average quality to exceed the Human reward; this result is shown in Figure 5, where we show the progressive improvement of Eureka reward functions; in this Figure, the error bars are now computed over both tasks and intermediate Eureka reward functions from all runs.

We do not have control over the LLM generation seeds for different tasks but they are all generated under the same temperature parameter.

We thank the reviewer again for their time and effort helping us improve our paper! Please let us know if we can provide additional clarifications to improve our score.

Best,

Authors

评论

To make sure I understand correctly, the results in Figure 4 are the averaged results of five PPO runs of the same final reward, and the results in Figure 11 are the averaged results of 1(?) PPO runs of the five best rewards of each sub-run?

评论

Dear Reviewer JeMn,

Both the results in Figure 4 and 11 (also 12) are the averaged results of five PPO runs of the same final reward per task. It is Figure 5 that is the average results of 1 PPO run of the best rewards from each sub-run at each iteration (each sub-run consists of 5 iterations), respectively.

Please let us know if we can provide any further clarification!

Best,

Authors

评论

Dear all reviewers and AC,

We again thank you for your time and effort in reviewing our work and providing detailed suggestions in improving our work! We have again updated our manuscript (all edits are highlighted in orange color) and added several new evaluation results based on the anonymous reader’s feedback. In summary, our new results include:

  1. A holistic evaluation of Eureka reward functions in terms of mean, median, interquartile mean (IQM), and probability of improvement as well as their associated 95% stratified bootstrap confidence intervals in Figure 11 and 12, Appendix F; these are evaluation metrics suggested by [1] in order to obtain a holistic performance profile for RL algorithms (in this case, reward design algorithms). Eureka reward functions on aggregate remain the best ones under all evaluation settings, offering strong evidence to Eureka’s effective reward design capabilities.

  2. Aggregate RL training curves over all 20 Dexterity tasks in Figure 10, Appendix F, where we demonstrate that Eureka reward functions exhibit improved sample efficiency compared to baselines.

  3. Error bars in Figure 4 that indicate that Eureka’s reward functions are effective across tasks and are equally robust to variances within RL runs as human-written reward functions.

  4. Promising preliminary results of an ongoing experiment to adapt Eureka for sim-to-real transfer of quadruped locomotion policies (Appendix F). We show that quadruped forward locomotion policies learned using Eureka reward functions in Isaac Gym can transfer to the real-world. The policy execution in sim and real are included in the supplementary videos, and they closely resemble each other. This result validates that Isaac Gym physics is realistic and points to promising signs that Eureka can be adapted to enable real-world robotic control.

Finally, we’d like to summarize our contributions for the reviewers and AC in light of all our new results during the rebuttal phase:

  1. We present the first LLM-powered reward design algorithm that can self-improve and generalize to an extensive suite of robots and tasks without task-specific engineering.

  2. We rigorously evaluate our method using a holistic suite of metrics and notably compare against human-written reward functions on a suite of open-sourced benchmarks. This design choice allows us to benchmark the quality of Eureka reward functions against the standard, de-facto solution to reward design: human reward engineering.

  3. We have demonstrated that Eureka can be combined with other generally applicable paradigms like curriculum learning to solve novel challenging tasks such as dynamic pen spinning for the first time (Section 4.3).

  4. We have shown Eureka’s flexibility in incorporating human feedback and reward initialization to generate more aligned reward functions (Section 4.4).

We believe that our experiments are comprehensive, and our paper overall makes important contributions to the challenging, long-standing problem of reward design in reinforcement learning. Below, we have also responded to each individual reviewer’s questions, and we are eager to receive the reviewers’ reply! Please let us know if there are any lingering questions and we are available and happy to address any further inquiries you may have.

Best wishes,

Authors


[1] Agarwal, Rishabh, et al. "Deep reinforcement learning at the edge of the statistical precipice." Advances in neural information processing systems 34 (2021): 29304-29320.

AC 元评审

(a) Scientific Claims and Findings: The paper introduces "Eureka," a method that leverages Large Language Models (LLMs) for reward function design in reinforcement learning (RL). Eureka uses a CodeLM to generate reward code, then feed that to a RL algorithm, and use the evaluation metrics to refine the reward function. While the idea is straightforward, the paper demonstrates results on 29 open source RL environment, including first results on a simulated Shadow Hand spinning a pen.

(b) Strengths (+) Simple and general approach (Reviewer DCBe, Reviewer hi7d): Eureka presents a simple, yet novel method for automatically generating reward functions across different RL scenarios. Its ability to leverage LLMs' zero-shot code generation for understanding and improving reward functions in various environments is significant.

(+) Extensive evaluation across many environments (Reviewer DCBe, Reviewer XD65): The paper effectively evaluates the approach in diverse RL environments, showing Eureka’s ability to outperform human-designed reward functions. Particularly impressive are the shadow hand results, where manual reward design is challenging.

(c) Weaknesses (-) Limited evaluation Scope (Reviewer DCBe): While evaluation is extensive, it focuses on a single simulator (Isaac Gym) with a fixed RL algorithm, potentially overstating the approach's generality. A broader evaluation across different simulators and algorithms could strengthen the findings.

(-) Lack of analysis and discussion on limitations (Reviewer DCBe, Reviewer hi7d): There's little analysis on when this might not work, and how Eureka adapts to complex environments that might exceed LLMs' context windows.

(-) Unrealistic assumptions about environment access (Reviewer XD65): The method's reliance on accessing the environment's source code might not be feasible in real-world applications where only black-box access to simulations is available.

(-) Strong assumptions on fitness function (Reviewer XD65): The approach assumes access to ground-truth reward functions for evaluating proposed rewards, which might limit its applicability to idealized or well-understood environments only.

为何不给更高分

The method itself is pretty straightforward, and the main contribution of the paper is showing that it works. It could have gotten a stronger score if it provided more insights about what problems LLMs are good at optimizing, and where they are not. Evaluation across different environments and RL algorithms, including environments and algorithms that are completely new and unlikely to have been on the internet would have been insightful.

为何不给更低分

The results are pretty impressive and useful for the community to be exposed to. It's likely to spur future research that adds more refined analysis.

最终决定

Accept (poster)