PaperHub
6.7
/10
Poster3 位审稿人
最低6最高8标准差0.9
6
8
6
4.0
置信度
ICLR 2024

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

OpenReviewPDF
提交: 2023-09-21更新: 2024-03-14
TL;DR

We propose a method that enables Language Model guided RL for long-horizon robotics tasks by appropriately integrating vision-based motion planning.

摘要

关键词
Long-horizon robot learningreinforcement learningLLMs

评审与讨论

审稿意见
6

The authors propose Plan-Seq-Learn (PSL) to address long-horizon robotics tasks from scratch with a modular approach using motion planning to bridge the gap between abstract language and low-level control learned by RL. The authors experiment with 20+ single and multi-stage robotics tasks from four benchmarks and report success rates of over 80% from raw visual input, out-performing previous approaches.

优点

originality The authors propose PSL: 1) breaks up the task into sub-sequences (Plan), 2) uses vision and motion planning to translate sub-sequences into initialization regions (Seq), 3) train local control policies using RL (Learn).

quality Experiments show that the proposed method outperforms previous methods in simulation.

clarity The paper is basically well-organized and clearly written.

significance As an LLM-based approach, the authors have made some progress.

缺点

It is a paper about robotics. However, experiments are based on simulations only.

It is about long horizon robotics tasks. However, the largest number of stages is 5 in the experiments.

From the perspective of long horizon robotics tasks, it is not clear how the method may proceed forward. See detail below.

问题

"Large Language Models (LLMs) are highly capable of performing planning for long-horizon robotics tasks" This is very arguable. There are evidences that it is not the case. And if it is so, there is no need to write the paper.

See e.g., On the Planning Abilities of Large Language Models--A Critical Investigation, 2023. Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks, 2023

"Language models can leverage internet scale knowledge to break down long-horizon tasks (Ahn et al., 2022; Huang et al., 2022a) into achievable sub-goals" How valid is such claim? Why is it so? What if the tasks are not available or not frequent in Internet texts?

How can we guarantee the decomposition of tasks always work? What if it does not work? How can we guarantee the optimality of the decomposition of tasks? What if it is not optimal? The current work uses simulations to validate the proposed method. There will be sim2real gap. How to bridge such gap?

How to improve the current work? If there is something wrong in the task decomposition stage, it is hard or impossible to make improvements, and a pre-trained or fine-tuned LM may be called for. It is beyond the current work. However, the point is, it is not clear how the proposed method deal with such issues.

End-to-end vs hierarchical approaches, there are tradeoffs. The paper focuses on the advantages of hierarchical approaches and disadvantages of end-to-end approaches. Desirable to discuss from both sides.

"This simplifies the training setup and allowing the agent to account for future decisions as well as inaccuracies in the Sequencing Module." For some mistakes at the higher level, a lower level RL can not deal with.

Table 2, Multistage (Long-horizon) results. 5 stages are not quite long-horizon, and the success rate may be as low as .67 ± .22

伦理问题详情

NA

评论

We thank the reviewer for providing a detailed review and for appreciating the strength of our experimental results and quality of writing.

“It is a paper about robotics. However, experiments are based on simulations only.”

In this work, we focus on studying the question: how can we train policies to solve long-horizon robotics tasks? Our main contributions are an algorithm for guiding RL agents for learning low-level control using LLMs via motion planning and a series of insights for practically training such policies. To study this question effectively, we perform extensive empirical evaluations on 25 tasks across 4 evaluation domains, validating the strength of our method on established benchmark tasks for comparison. We leave extensions to the real world for future work and briefly discuss two possible directions for doing so: 1) sim2real transfer by training local policies in simulation and chaining them using motion planning and LLMs at test time 2) directly running PSL in the real world as it is far more efficient to train than E2E methods.

“It is about long horizon robotics tasks. However, the largest number of stages is 5 in the experiments.”

We emphasize that even solving tasks with up to 5 stages is quite difficult for end-to-end based methods; they make no progress at all in most cases beyond 1-2 stages. Furthermore, we only evaluate tasks with up to 5 stages, as a majority of the benchmarks for robotic control have only up to that many stages, not because our method cannot be applied beyond 5 stages. As shown in the experiments section, prior methods such as E2E [1], RAPS [2], MoPA-RL [3], TAMP [4] and SayCan [5] do not reliably solve the benchmark tasks with up to 5 stages. In contrast, for our method, whether we have 1, 5 or even 10 stages (we include these new results at the end of this response), we can still solve the task because our modular, hierarchical method effectively decomposes the task and simplifies the learning problem significantly.

“"Large Language Models (LLMs) are highly capable of performing planning for long-horizon robotics tasks" This is very arguable. There are evidences that it is not the case. And if it is so, there is no need to write the paper.”

We have toned down this point in the updated version of the paper by instead stating that LLMs have been shown to be capable of high-level planning. Our changes are shown here in italics.

While LLMs perform poorly on general purpose planning as noted in the work cited by the reviewer [6], in our work the LLM is not required to perform general purpose or fine-grained planning. Instead, in PSL, the LLM is only outputting a very coarse high-level plan - where to go and how to leave the region - which is simple and does not require significantly complex reasoning ability. For such tasks, we find that the semantic, internet-scale knowledge in LLMs is sufficient to produce high-quality plans. Empirically, on the tasks we consider, the LLM achieves 100% planning performance.

We additionally note that our method is not necessarily tied to using an LLM as a task planner. We can also use classical task planners such as STRIPS [7] and use an LLM to simply translate the natural language prompt into a format for task planning as done in LLM + P [8]. In this way, we inherit the guarantees and benefits of classical task planners while guiding the RL agent to efficiently solve the task from a natural language task description.

“"Language models can leverage internet scale knowledge to break down long-horizon tasks (Ahn et al., 2022; Huang et al., 2022a) into achievable sub-goals" How valid is such claim? Why is it so? What if the tasks are not available or not frequent in Internet texts?”

We have re-written this statement to be appropriately qualified: “Prior work (...) has shown that when appropriately prompted, language models are capable of leveraging internet scale knowledge…” There is a large body of recent work [5, 9, 10, 11, 12] in this area that empirically illustrates such capabilities for long-horizon robotics tasks. We have updated the main paper with this change. Our changes are shown here in italics.

However, we emphasize the existence (or lack thereof) of general purpose planning capabilities of LLMs is orthogonal to the claims of our paper. Our focus, with respect to the Plan Module, is on a simple, coarse planning interface for the LLM with demonstrably high performance across a wide range of robotics tasks.

评论

“How can we guarantee the decomposition of tasks always work? What if it does not work? How can we guarantee the optimality of the decomposition of tasks? What if it is not optimal? The current work uses simulations to validate the proposed method. There will be sim2real gap. How to bridge such gap?”

To clarify, in this work we are not claiming optimality, or that the LLM given task decomposition will always be the right one. Since we are using neural network language models, vision models and RL policies there are no guarantees in general for any component of the pipeline. However, empirically we find that this decomposition works surprisingly well in practice even if there are no optimality guarantees. If we desire some guarantees on the high-level plan, we could integrate in LLM+P [8] as discussed in the above responses.

If the LLM outputs the incorrect plan, we hypothesize that the agent will default to performing as well as (or perhaps slightly worse than) E2E [1]. To evaluate this, we ran an experiment to evaluate the performance of PSL when provided the incorrect high-level plan. Please see the response to Reviewer KErm in which we describe the experiment in detail.

Addressing the sim2real gap is a challenging problem, but it is beyond the scope of this work. However, we note that there is a growing body of work on performing sim2real for robotic navigation [13], locomotion [14] and manipulation [15, 16] which could be applied to our setup with our local policies.

“How to improve the current work? If there is something wrong in the task decomposition stage, it is hard or impossible to make improvements, and a pre-trained or fine-tuned LM may be called for. It is beyond the current work. However, the point is, it is not clear how the proposed method deal with such issues.”

The reviewer makes a valid point that thePSL may fail in ways that may be challenging for the learner to recover from such as incorrect high-level plans, sequencing module executions that go to the wrong region, stage termination condition estimation failures.

In the Limitations Section (Sec. B.3), we acknowledge that as defined in the paper, if the Plan Module or Sequence Module fail catastrophically (incorrect plan or moving to the wrong region in space), there is currently no concrete mechanism for the Learning Module to adapt.

However, we ran an experiment in which we train the agent using PSL using an incorrect high-level plan on two stage tasks (MW-Assembly, MW-Bin Picking, MW-Hammer) and find that in some cases, the agent can still learn to solve the task, achieving performance close to E2E [6]. Intuitively, this is possible because in PSL, the high-level plan is not expressed as a hard constraint, but rather as a series of regions for the agent to visit and a set of exit conditions for those regions. In the end, however, only the task reward is used to train the RL policy so if the plan is wrong, the Learn Module must learn to solve the entire task end-to-end from sub-optimal initial states. We have also updated the paper with this result.

Plot Link: https://drive.google.com/file/d/17DJCQAJBASfrl3f3bMRKvahPkK19cPVd/view?usp=sharing

As scope for future work, we note in the Limitations Section (Sec. B.3) that the Plan and Sequence modules could be finetuned using RL as well. One way to resolve high-level plan failures would also be to re-prompt the LLM to form a new plan if the agent fails to learn to solve the task within a predefined number of episodes. We include simple proof of concept examples of such re-planning below:

Example #1:

Prompt:

Stage termination conditions: (grasp, place).

Task description: The milk goes into bin 2 and the cereal box in bin 3. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>).

Don't output anything else. Let's think step by step.

[(milk, grasp), (bin2, place), (bin3, place), (cereal_box, grasp)]

This plan ([(milk, grasp), (bin2, place), (bin3, place), (cereal_box, grasp)]) failed: agent success rate after 10K episodes: 0. Replan but make sure to still solve the overall task. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>). Don't output anything else. Let's think step by step.

Plan: [("milk", "grasp"), ("bin 2", "place"), ("cereal box", "grasp"), ("bin 3", "place")]

评论

Example #2:

Prompt:

Stage termination conditions: (grasp, place).

Task description: The milk goes into bin 2 and the cereal box in bin 3. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>).

Don't output anything else. Let's think step by step.

[("milk", "grasp"), ("bin 2", "place"), ("cereal box", "grasp"), ("bin 3", "place")]

This plan ([(milk, grasp), (bin2, place), (cereal_box, grasp), (bin3, place)] failed: agent success rate after 10K episodes: 0. Replan but make sure to still solve the overall task. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>). Don't output anything else. Let's think step by step.

Plan: [("cereal box", "grasp"), ("bin 3", "place"), ("milk", "grasp"), ("bin 2", "place")]

We leave this extension to be explored in more detail in future work.

“End-to-end vs hierarchical approaches, there are tradeoffs. The paper focuses on the advantages of hierarchical approaches and disadvantages of end-to-end approaches. Desirable to discuss from both sides.”

We agree with the reviewer and we note in the paper that one key strength of end-to-end approaches is their ability to learn complex control policies over high-dimensional action spaces. We will include further discussion of tradeoffs between hierarchy and end-to-end in the paper - hierarchical approaches impose a specific structure on the agent which may not be applicable to all tasks. One advantage of end-to-end learning is that, in theory, it can learn a better policy representation than the modular structure, however, in practice, it requires vast amounts of data. We have updated the paper to include a more nuanced discussion of these tradeoffs.

"This simplifies the training setup and allowing the agent to account for future decisions as well as inaccuracies in the Sequencing Module." For some mistakes at the higher level, a lower level RL can not deal with.

The reviewer is correct, for this reason, our LLM planning system does not enforce a hard constraint on the RL agent. Instead, it only takes the RL agent to a region of interest and expresses a condition for leaving the region. This way in the worst case scenario in which the LLM plan is wrong - the RL agent defaults to performing similar to a purely end-to-end agent as we show above.

“Table 2, Multistage (Long-horizon) results. 5 stages are not quite long-horizon, and the success rate may be as low as .67 ± .22”

One reason our method performs relatively poorly on K-MS-4 and K-MS-5 (67%) success rate is that it is training visuomotor policies from joint space control. This is well known in the community to be a challenging problem ([17, 18]). Instead, we modify the action space to use end-effector control (which is the action space that the RAPS [2] baseline uses) and re-train PSL and E2E [1] on the kitchen tasks. We build two new tasks, K-MS-7 and K-MS-10 which require the agent to interact with additional objects in the kitchen scene: the hinge cabinet, the remaining three oven burners While E2E [1] still fails to make progress on the longer horizon tasks, PSL is now capable of solving up to 10 stage tasks in the kitchen environment with 100% performance.

K-MS-3K-MS-5K-MS-7K-MS-10
E2E0.0 ± 0.00.0 ± 0.00.0 ± 0.00.0 ± 0.0
RAPS.89 ± 0.10.0 ± 0.00.0 ± 0.00.0 ± 0.0
TAMP1.0 ± 0.00.0 ± 0.00.0 ± 0.00.0 ± 0.0
SayCan1.0 ± 0.00.0 ± 0.00.0 ± 0.00.0 ± 0.0
PSL1.0 ± 0.01.0 ± 0.01.0 ± 0.01.0 ± 0.0

We have updated the paper to use these results and included additional discussion regarding the details of the tasks.

评论

[1] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.

[2] M. Dalal, D. Pathak, R. Salakhutdinov. "Accelerating Robotic Reinforcement Learning via Parameterized Action Primitives." NeurIPS, 2021.

[3] J. Yamada, Y. Lee, G. Salhotra, K. Pertsch, M. Pflueger, G. S. Sukhatme, J. J. Lim, P. Englert. "Motion Planner Augmented Reinforcement Learning for Obstructed Environments." Conference on Robot Learning, 2020.

[4] C. R. Garrett, T. Lozano-Pérez, L. P. Kaelbling. "Stripstream: Integrating symbolic planners and blackbox samplers." arXiv preprint arXiv:1802.08705, 2018.

[5] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. Jauregui Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, A. Zeng. "Do As I Can and Not As I Say: Grounding Language in Robotic Affordances." Conference on Robot Learning, 2022.

[6] K. Valmeekam, S. Sreedharan, M. Marquez, A. Olmo, S. Kambhampati. "On the planning abilities of large language models (a critical investigation with a proposed benchmark)." arXiv preprint arXiv:2302.06706, 2023.

[7] R. E. Fikes, N. J. Nilsson. "STRIPS: A new approach to the application of theorem proving to problem solving." Artificial Intelligence, vol. 2, no. 3-4, pp. 189–208, 1971, Elsevier.

[8] B. Liu, Y. Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, P. Stone. "LLM+ P: Empowering large language models with optimal planning proficiency." arXiv preprint arXiv:2304.11477, 2023.

[9] W. Huang, P. Abbeel, D. Pathak, I. Mordatch. "Language models as zero-shot planners: Extracting actionable knowledge for embodied agents." International Conference on Machine Learning, pages 9118–9147, PMLR, 2022.

[10] I. Singh, V. Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, A. Garg. "Progprompt: Generating situated robot task plans using large language models." 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 11523–11530, IEEE, 2023.

[11] J. Wu, R. Antonova, A. Kan, M. Lepert, A. Zeng, S. Song, J. Bohg, S. Rusinkiewicz, T. Funkhouser. "TidyBot: Personalized Robot Assistance with Large Language Models." Autonomous Robots, 2023.

[12] C. H. Song, J. Wu, C. Washington, B. M. Sadler, W.-L. Chao, Y. Su. "LLM-Planner: Few-shot Grounded Planning for Embodied Agents with Large Language Models." Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.

[13] J. Krantz, T. Gervet, K. Yadav, A. Wang, C. Paxton, R. Mottaghi, D. Batra, J. Malik, S. Lee, D. S. Chaplot, "Navigating to Objects Specified by Images." arXiv preprint arXiv:2304.01192, 2023.

[14] A. Agarwal, A. Kumar, J. Malik, and D. Pathak. Legged locomotion in challenging terrains using egocentric vision. Conference on Robot Learning, 2022.

[15] M. Andrychowicz, B. Baker, M. Chociej, R. Jozefowicz, B. McGrew, J. Pachocki, A. Petron, M. Plappert, G. Powell, A. Ray, et al. "Learning dexterous in-hand manipulation." The International Journal of Robotics Research, 2020.

[16] A. Handa, A. Allshire, V. Makoviychuk, A. Petrenko, R. Singh, J. Liu, D. Makoviichuk, K. Van Wyk, A. Zhurkevich, B. Sundaralingam, et al. "Dextreme: Transfer of agile in-hand manipulation from simulation to reality." 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5977–5984, IEEE, 2023.

[17] R. Martín-Martín, M. A. Lee, R. Gardner, S. Savarese, J. Bohg, A. Garg. "Variable impedance control in end-effector space: An action space for reinforcement learning in contact-rich tasks." 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 1010–1017, IEEE, 2019.

[18] M. Dalal, A. Mandlekar, C. Garrett, A. Handa, R. Salakhutdinov, D. Fox. "Imitating Task and Motion Planning with Visuomotor Transformers." Conference on Robot Learning, 2023.

评论

Dear Reviewer,

We would like to follow up on our rebuttal as there is only one day remaining of the discussion period. If there are any outstanding concerns that you would like us to address, please let us know. Thank you and we look forward to your response.

评论

Thanks for authors' careful rebuttal to reviews and update. However, there are still concerns if the paper follows a promising approach.

One example: from the rebuttal, "If the LLM outputs the incorrect plan, we hypothesize that the agent will default to performing as well as (or perhaps slightly worse than) E2E [1]. To evaluate this, we ran an experiment to evaluate the performance of PSL when provided the incorrect high-level plan. Please see the response to Reviewer KErm in which we describe the experiment in detail."

If so, what is the meaning of a high-level plan? It appears that the authors want to use LLM even LLM may not generate good high-level plans.

I will keep the score.

评论

We thank the reviewer for their reply.

“If so, what is the meaning of a high-level plan”

The LLM is used to decompose the task into achievable sub-goals in a zero-shot manner. In our work, we structure the LLM’s high-level plan as a series of target regions and exit conditions for leaving those regions. This simplifies the learning problem for the RL agent as the RL agent does not need to learn any semantic information about the task, just how to interact with the environment. We would like to emphasize that the high-level plan from the LLM is crucial: PSL significantly improves over the end-to-end baseline because the plan guidance reduces the complexity of the learning problem.

In the experiment that the reviewer mentioned, we hardcode an incorrect high-level plan (for the two stage tasks - we simply invert the correct plan) because the LLM gives the correct plan in all of our tasks. We then train PSL using this incorrect plan. In this case, we find that in two out of the three tasks the agent can still learn to solve the task, achieving performance similar to the end-to-end baseline. Intuitively, this is possible because in PSL, the high-level plan is not expressed as a hard constraint, but rather as a series of regions for the agent to visit and a set of exit conditions for leaving those regions. In the end, however, only the task reward is used to train the RL policy so if the plan is wrong, the Learn Module can learn to solve the entire task end-to-end from sub-optimal initial states.

Plot: https://drive.google.com/file/d/17DJCQAJBASfrl3f3bMRKvahPkK19cPVd/view?usp=sharing

Overall, these results suggest that our planning scheme can leverage the strengths of high-level planning (increased learning efficiency and ability to solve long horizon tasks) while minimizing its weaknesses (failure to learn when plans are incorrect) - when the plan is wrong we default to performing similar to end-to-end learning. When the plan is correct (which is the case for all the tasks we evaluate), learning speed is significantly improved (as well as our capability of solving long-horizon tasks) - this is why the LLM is a crucial component of our method.

“LLM may not generate good high-level plans”

To clarify, in all of the tasks that we evaluate (25+), the LLM achieves 100% planning performance: the high-level plan is always correct. As we note in our previous rebuttal response: in PSL, the LLM is only outputting a very coarse high-level plan - where to go and how to leave the region - which is simple and does not require significantly complex reasoning ability. For such tasks, we find that the semantic, internet-scale knowledge in LLMs is sufficient to produce high-quality plans. The bottleneck is performing effective low-level control given the planner’s guidance.

评论

Thanks the authors for explanations, updates and more experiments.

After detailed discussions with AC and other reviewers, although my concerns basically remain, given the work is SOTA, it would be a decent contribution to ICLR 2024 and may inspire further study, I raise the score to 6: marginally above the acceptance threshold.

审稿意见
8

In past works, people utilized LLM's internet-scale of knowledge to give robots sufficient information when planning for long-horizon tasks. However, the author believes that it is important for a robotic system to be capable of online improvement over at least low-level control policies at the same time. Otherwise, with the lack of a library for pre-trained skills in every other scenario, robots aren't able to learn very well. To this end, the paper proposes a framework, PLAN-SEQ-LEARN, that utilizes both LLM's ability to guide agent's planning and RL's ability for online improvement. The experiments show that not only did PSL's performance surpass SOTA visual-based RL methods through the help of LLM, but it also performed better than SayCan for its ability to improve with online learning.

优点

Motivation and intuition

  • Classical approaches to long-horizon robotics that can struggle with contact-rich interactions are convincing.
  • Use LLM for high-level planning guiding RL policy to solve robotic tasks online without pre-determined skills.

Novelty

  • The idea of utilizing RL to learn low-level skills under the framework of LLM planning is intuitive and convincing. ​

Technical contribution

  • Integrates LLM task planning, motion planning, and RL techniques.
  • Avoid cascading failures by learning online using RL algorithms.

Clarity

  • The overall writing is clear. The authors utilize figures well to illustrate the ideas. Figure 2 clearly shows the whole idea of PSL.
  • This paper provides a clear and detailed description of how to integrate the task planning module, motion planning module, and RL learning module.

Related work

  • Give plenty of related works with short but clear descriptions. ​ Experimental results
  • The overall performance on single-stage and multistage benchmark tasks is good.

缺点

Clarity

  • Although details of how LLM was used were clearly written inside Appendix D, I feel like the author could illustrate the details in the main paper and also a better explanation of how stage termination and training details are implemented. Since how LLM was involved in this work seems to be one of the contributions of this paper, I do feel like making this part intuitive is a must.

Method

  • Trade-off: Planning without a library of pre-defined skills is mentioned as a strength in the paper, but this comes at the cost of relearning the whole process compared to other methods.
  • Also the paper seems to overlook the fact that the learning might fail. Did not see how the method handles this situation.
  • What would the PSL react to the situation when the agent failed to reach the termination condition?
  • What would happen if there are more than enough terms for the LLM to choose from, for example, unlearnable skill terms that may confuse the LLM in choosing?

Related work

  • Although Citing 'Inner Monologue' and 'Bootstrap Your Own Skills(BOSS)', they are not used for comparison or experiments, as these methods share many similarities. Therefore, it's a bit of a missed opportunity.

Experimental conclusions

  • In section 4.3, the author noted that "For E2E and RAPS, we provide the learner access to a single global fixed view observation from O^global for simplicity and speed of execution, as we did not find meaningful performance improvement in these baselines by incorporating additional camera views.". However, this results in an unfair comparison because PSL has taken O^local as an additional input, and may cause some questionable issues. If performances are similar, I believe that adding O^local for E2E and RAPS would result in a more convincing conclusion that PSL performs better.

问题

As stated above.

评论

We thank the reviewer for their detailed review and for recognizing the clear motivation for PSL, novelty of our method, clarity of writing and strength of our experimental results.

“I feel like the author could illustrate the details in the main paper and also a better explanation of how stage termination and training details are implemented.”

We have updated the discussion in Section 3.3 of the main paper to include additional details regarding the stage termination and LLM planning implementation details. We have also included further details in Section 3.5 of the main paper. Additionally, we emphasize we will release the code for PSL upon acceptance - enabling the community to replicate our results. The code will unambiguously specify the requested implementation details.

“Although Citing 'Inner Monologue' and 'Bootstrap Your Own Skills(BOSS)', they are not used for comparison or experiments, as these methods share many similarities.”

For our experiments, the high-level planning success rate is 100% - the bottleneck is performing effective low-level control. To that end, prompting techniques such as Inner Monologue [1] would not affect the performance. Inner Monologue would achieve the same results as SayCan [2]. Furthermore, Inner Monologue could be readily incorporated into PSL to improve planning performance when necessary; we leave this extension to future work.

With regards to BOSS [3], as we note in the paper, this is concurrent work with our own. It was released on Arxiv on October 16, 2023 - after the ICLR submission deadline. That notwithstanding, there are several reasons why comparisons to BOSS is currently infeasible: 1) The code for BOSS is not released and re-implementing the method is non-trivial as it requires training policies online using IQL in the loop with an LLM. Once the code is released, we will attempt to perform a fair comparison if possible. 2) Their method operates with a different assumption set than ours: existence of a pre-trained skill library while we evaluate training from scratch to learn unseen low-level skills 3) BOSS specifically uses a language labeled demonstration dataset to pre-train skills - no such dataset exists on most of the tasks we evaluate. Furthermore, the environment code for BOSS is not publicly available either - they use a “modified version of the ALFRED [4] simulator” which is not released to our knowledge. Finally, we would like to note that the contribution of BOSS is orthogonal to our own, our method focuses on learning to efficiently solve a single task while BOSS aims to expand a pre-existing repertoire of skills. In principle PSL can be combined with BOSS to efficiently learn and incorporate a new skill into an existing library, particularly when starting from any empty skill set.

“Trade-off: Planning without a library of pre-defined skills is mentioned as a strength in the paper, but this comes at the cost of relearning the whole process compared to other methods.”

In our experiments, we show that learning policies from scratch can outperform methods that use pre-trained/defined skills such as SayCan [2] and RAPS [5] by over 2x in terms of raw success rate. Online learning enables the agent to adapt its low-level control to the task it is solving while avoiding cascading failures. However, we acknowledge that pre-trained skill libraries come with many practical benefits and in principle PSL can also take advantage of and fine-tune pre-defined skills as we discussed above. We leave this extension to future work. Ultimately, we agree with the reviewer that it is desirable to combine online learning with pre-defined skills, however, in this work our aim was to study and improve the learning process in isolation from pre-defined skills.

评论

“Also the paper seems to overlook the fact that the learning might fail. Did not see how the method handles this situation.”

The reviewer makes a valid point that the learner may not always be able to solve the task. This may happen due to general RL issues such as inherent task difficulty / lack of effective reward shaping or PSL specific failures such as incorrect high-level plans, sequencing module executions that go to the wrong region, stage termination condition estimation failures.

In the Limitations Section (Sec. B.3), we acknowledge that as defined in the paper, if the Plan Module or Sequence Module fail catastrophically (incorrect plan or moving to the wrong region in space), there is currently no concrete mechanism for the Learning Module to adapt.

However, we ran an experiment in which we train the agent using PSL using an incorrect high-level plan on two stage tasks (MW-Assembly, MW-Bin Picking, MW-Hammer) and find that in some cases, the agent can still learn to solve the task, achieving performance close to E2E [6]. Intuitively, this is possible because in PSL, the high-level plan is not expressed as a hard constraint, but rather as a series of regions for the agent to visit and a set of exit conditions for those regions. In the end, however, only the task reward is used to train the RL policy so if the plan is wrong, the Learn Module must learn to solve the entire task end-to-end from sub-optimal initial states. We have also updated the paper with this result.

Plot Link: https://drive.google.com/file/d/17DJCQAJBASfrl3f3bMRKvahPkK19cPVd/view?usp=sharing

As scope for future work, we note in the Limitations Section (Sec. B.3) that the Plan and Sequence modules could be fine-tuned using RL as well. One way to resolve high-level plan failures would also be to re-prompt the LLM to form a new plan if the agent fails to learn to solve the task within a predefined number of episodes. We include simple proof of concept examples of such re-planning below:

Example #1:

Prompt:

Stage termination conditions: (grasp, place).

Task description: The milk goes into bin 2 and the cereal box in bin 3. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>).

Don't output anything else. Let's think step by step.

[(milk, grasp), (bin2, place), (bin3, place), (cereal_box, grasp)]

This plan ([(milk, grasp), (bin2, place), (bin3, place), (cereal_box, grasp)]) failed: agent success rate after 10K episodes: 0. Replan but make sure to still solve the overall task. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>). Don't output anything else. Let's think step by step.

Plan: [("milk", "grasp"), ("bin 2", "place"), ("cereal box", "grasp"), ("bin 3", "place")]

Example #2:

Prompt:

Stage termination conditions: (grasp, place).

Task description: The milk goes into bin 2 and the cereal box in bin 3. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>).

Don't output anything else. Let's think step by step.

[("milk", "grasp"), ("bin 2", "place"), ("cereal box", "grasp"), ("bin 3", "place")]

This plan ([(milk, grasp), (bin2, place), (cereal_box, grasp), (bin3, place)] failed: agent success rate after 10K episodes: 0. Replan but make sure to still solve the overall task. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>). Don't output anything else. Let's think step by step.

Plan: [("cereal box", "grasp"), ("bin 3", "place"), ("milk", "grasp"), ("bin 2", "place")]

We leave this extension to be explored in more detail in future work.

评论

“What would the PSL react to the situation when the agent failed to reach the termination condition?”

The agent will keep attempting to solve the task until it reaches the max number of steps for the task. Based on the LLM plan, we compute the max number of steps as 25 * number stages predicted by the LLM. The agent constantly tries to achieve the next component of the sub-task and only moves on to the next planned sub-task when the termination condition succeeds. As we show in Section 5.2 of the main paper removing the termination conditions results in diminished learning performance (by 30%) due to the policy learning dithering behaviors.

“What would happen if there are more than enough terms for the LLM to choose from, for example, unlearnable skill terms that may confuse the LLM in choosing?”

The LLM chooses locations in the scene (which are all present in the prompt and estimated with vision) and stage termination conditions (a super-set of what is necessary is given in the prompt). We find that the LLM is capable of deciding the minimal set given prompting to only use the necessary subset of skills. We include an example below.

Prompt:

Stage termination conditions: (grasp, place, pull, push, turn, slide, flip, burn).

Task description: The milk goes into bin 2 and the cereal box in bin 3. Give me a simple plan to solve the task using only the stage termination conditions. Make sure the plan follows the formatting specified below and make sure to take into account object geometry. Formatting of output: a list in which each element looks like: (<object/region>, <stage termination condition>). Don't output anything else. Don’t include any stage termination conditions that are not necessary to solve the task.

Plan: [("milk", "grasp"), ("bin 2", "place"), ("cereal box", "grasp"), ("bin 3", "place")]

“I believe that adding O^local for E2E and RAPS would result in a more convincing conclusion that PSL performs better”

We perform this experiment across four tasks (RS-Lift, RS-Door, RS-Can, RS-NutRound) and include the results in the link below. In general there is little to no performance improvement for RAPS [5] or E2E [6] across the board. The additional local view marginally improves sample efficiency but it does not resolve the fundamental exploration problem for these tasks. Regardless, we will update our results to use O^{local} and O^{global} across the board for the final version of the submission.

Plot Link: https://drive.google.com/file/d/1oV2x7hkRGaQ9Qb2-Q1p9doOjx_DztMhH/view?usp=sharing

[1] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, et al. "Inner monologue: Embodied reasoning through planning with language models." Conference on Robot Learning, 2022.

[2] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, C. Finn, C. Fu, K. Gopalakrishnan, K. Hausman, A. Herzog, D. Ho, J. Hsu, J. Ibarz, B. Ichter, A. Irpan, E. Jang, R. Jauregui Ruano, K. Jeffrey, S. Jesmonth, N. Joshi, R. Julian, D. Kalashnikov, Y. Kuang, K.-H. Lee, S. Levine, Y. Lu, L. Luu, C. Parada, P. Pastor, J. Quiambao, K. Rao, J. Rettinghouse, D. Reyes, P. Sermanet, N. Sievers, C. Tan, A. Toshev, V. Vanhoucke, F. Xia, T. Xiao, P. Xu, S. Xu, M. Yan, A. Zeng. "Do As I Can and Not As I Say: Grounding Language in Robotic Affordances." Conference on Robot Learning, 2022.

[3] J. Zhang, J. Zhang, K. Pertsch, Z. Liu, X. Ren, M. Chang, S.-H. Sun, J. J. Lim. "Bootstrap your own skills: Learning to solve new tasks with large language model guidance." Conference on Robot Learning, 2023.

[4] M. Shridhar, J. Thomason, D. Gordon, Y. Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, D. Fox, "Alfred: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.

[5] M. Dalal, D. Pathak, R. Salakhutdinov. "Accelerating Robotic Reinforcement Learning via Parameterized Action Primitives." NeurIPS, 2021.

[6] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.

评论

Dear Reviewer,

We would like to follow up on our rebuttal as there is only one day remaining of the discussion period. If there are any outstanding concerns that you would like us to address, please let us know. Thank you and we look forward to your response.

评论

I have carefully read the reviews submitted by other reviewers, and the rebuttal and the revised paper provided by the authors. I appreciate the efforts put into answering my questions and improving this submission. In that regard, I am raising my score to 8.

评论

We appreciate the reviewer carefully reading our rebuttal responses and the revised paper and we thank the reviewer for raising their score.

审稿意见
6

The paper proposes a new method/framework called Plan-Seq-Learn (PSL) for solving long-horizon robotics tasks. The key idea is a decomposition of long robotic manipulation tasks, and then tackle each part using a reasonable method. Specifically, they combine LLM for highly abstract task planning, off-the-shelf visual pose estimator and motion planner (AIT*) for sequencing each sub-tasks, and RL for the sub-tasks. This allows PSL to leverage the advantage of each module. Extensive experiments show PSL can efficiently solve 20+ long-horizon robotics tasks, outperforming prior methods.

优点

  1. The approach is sensible and reasonable to leverage current popular methods for robot learning - LLM for high-level planning, classical motion planner for efficient collision-free path planning, and RL for the contact-rich manipulation stage.
  2. The general framework is novel in combining these techniques although each part is not entirely new. And the paper clearly explains how to use their advantages in solving long-horizon tasks.
  3. Extensive experiments show reasonable/good results regarding their claims and methods.

缺点

  1. The long-horizon task seems to be divided into only 'grasp' and 'place' (from the paper and appendix), it is unclear if there are more sub-tasks / skills that the LLM divided. From the webpage, I find other tasks besides the pick-and-place series so wonder how to implement these.

问题

  1. As the whole task is decomposed into stage 1, sequencing, stage 2, ..., stage n. Does it need to redesign the reward function of the RL process? Moreover, how does the sparse and dense reward influence the learning process?
评论

We thank the reviewer for recognizing the novelty of our modular framework and our extensive experiments as well as for appreciating the strengths of our results on a wide range of tasks and domains.

“The long-horizon task seems to be divided into only 'grasp' and 'place' (from the paper and appendix), it is unclear if there are more sub-tasks / skills that the LLM divided. From the webpage, I find other tasks besides the pick-and-place series so wonder how to implement these.”

To clarify, PSL is not limited to only using ‘grasp’ and ‘place’ termination conditions. In general, it can take advantage of any stage termination condition when performing LLM planning. The RL agent can then learn the corresponding local control policies. We simply require the following: a function that takes in the current state or observation(s) of the environment and evaluates a binary success criteria as well as a natural language descriptor of the condition for prompting the LLM (e.g. ‘grasp’ or ‘place’). Then the LLM can subdivide the task based on these conditions. We have updated the paper to make this point clear.

Furthermore, in our experimental results, we do not only use ‘grasp’ and ‘place’ conditions. We also use conditions such as ‘push’ (OS-Push), ‘open’ (RS-Door, K-Microwave, K-Slide) or ‘turn’ (K-Burner). These conditions can be readily estimated using vision: they are all dependent on pose estimates. We have updated the paper to include detailed descriptions of how to estimate pushing, opening and turning. Finally, we refer the reviewer to our reply to Reviewer KErm, in which we experimentally show that our method is not dependent on the user specifying exactly the termination conditions that are necessary for the task. We find that providing a superset of termination conditions will also work; the LLM will only output sub-tasks that are necessary for solving the given task.

“Does it need to redesign the reward function of the RL process?”

As we note in Section 4.3 of the main paper, we do not modify the reward function of the environment for any task. Instead, we use the Plan and Sequence modules to move the RL agent to relevant regions of space and specify conditions for exiting those regions (stage termination conditions). The RL agent then learns local interaction based on the overall task reward.

“Moreover, how does the sparse and dense reward influence the learning process?”

Our experiments include results on dense (Robosuite [1], Metaworld [2], Obstructed Suite [3]) and sparse (Kitchen [4]) reward tasks. We find that PSL performs well in both settings. One reason why PSL is capable of effectively solving sparse reward tasks is that it addresses one major component of the exploration problem in sparse settings: finding the object with which it needs to interact. By initializing the RL agent close to the region of interest, we greatly increase the likelihood that random exploration leads to coincidental successes which can be used to bootstrap learning.

[1] Y. Zhu, J. Wong, A. Mandlekar, and R. Martin-Martin. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.

[2] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, S. Levine. "Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning." Conference on Robot Learning, pages 1094–1100, PMLR, 2020.

[3] J. Yamada, Y. Lee, G. Salhotra, K. Pertsch, M. Pflueger, G. S. Sukhatme, J. J. Lim, P. Englert. "Motion Planner Augmented Reinforcement Learning for Obstructed Environments." Conference on Robot Learning, 2020.

[4] J. Fu, A. Kumar, O. Nachum, G. Tucker, S. Levine. "D4RL: Datasets for Deep Data-Driven Reinforcement Learning." arXiv preprint arXiv:2004.07219, 2020.

评论

Dear Reviewer,

We would like to follow up on our rebuttal as there is only one day remaining of the discussion period. If there are any outstanding concerns that you would like us to address, please let us know. Thank you and we look forward to your response.

评论

We thank the reviewers for their valuable feedback and insightful comments. A summary of our work: we enable LLM guided RL agents to solve long-horizon robotics tasks from raw visual input by tracking high-level language plans using motion planners. We are glad the reviewers appreciated our extensive experiments and analyses as well as our experimental results. PSL achieves state-of-the-art results on 25 long-horizon tasks across 4 diverse domains, outperforming a variety of baselines that use classical planning, end-to-end RL and language planning. We emphasize that, to our knowledge, when training visuomotor RL policies, prior methods in the literature have been unable to solve the Robosuite Nut Assembly or Robosuite Pickplace tasks.

We give comprehensive responses to each review directly. In this section, we highlight the key points raised by the reviewers and summarize the outcomes of various new experiments they recommended:

  • Extending PSL from 5 to 10 stage tasks (oGjq): By modifying the action space of the Kitchen environment to use end-effector control (as done in the baseline RAPS [1]) instead of joint space as done in our original experiments, we demonstrate that it is possible for PSL to learn solve tasks with up to 10 stages with 100% success rate.
  • Susceptibility to the quality of the high-level plan (KErm, oGjq): We evaluate PSL’s performance when provided an incorrect high-level plan. On 2 out of 3 Metaworld tasks, PSL is still able to learn to solve the task, achieving performance comparable to that of the E2E [2] baseline.
  • Baselines with local+global observations perform the same(KErm): On a set of 4 tasks, we run the baselines E2E [2] and RAPS [1] using O^{local} as well as O^{global} and find that performance is similar across each task. We will re-run all these baselines with the combined observations across all of our tasks for the final version of the paper.

We have updated the main paper with the suggested changes and highlighted the changed parts in red text.

[1] M. Dalal, D. Pathak, R. Salakhutdinov. "Accelerating Robotic Reinforcement Learning via Parameterized Action Primitives." NeurIPS, 2021.

[2] D. Yarats, R. Fergus, A. Lazaric, and L. Pinto. Mastering visual continuous control: Improved data-augmented reinforcement learning. arXiv preprint arXiv:2107.09645, 2021.

AC 元评审

The paper proposes Plan-Seq-Learn (PSL) to solve long-horizon robotics tasks. The main idea is a decomposition of long robotic manipulation tasks, and then tackle each part using a reasonable method. LLM is used for highly abstract task planning, visual pose estimator and motion planner are used for sequencing subtasks, and RL is for each subtask. Extensive experiments show PSL can efficiently solve 25 long-horizon robotics tasks, substantially outperforming prior methods.

Overall, this is a new and convincing idea to solve long-horizon robotics tasks. Most of the questions and issues raised by the reviewers were addressed by the authors' responses. This paper is worth publication.

为何不给更高分

Real robotics experiments are absent. As experiments are only on simulation. Besides, similar approaches have been explored in the game domain, as far as I know.

为何不给更低分

Most of the questions and issues raised by the reviewers were addressed by the authors' responses. The reviewers reached a consensus of acceptance.

最终决定

Accept (poster)