5.5

/10

Rejected4 位审稿人

最低3最高8标准差1.8

4.0

置信度

ICLR 2024

GenBot: Generative Simulation Empowers Automated Robotic Skill Learning at Scale

Yufei Wang,Zhou Xian,Feng Chen,Tsun-Hsuan Wang,Zackory Erickson,David Held,Chuang Gan

OpenReview PDF

提交: 2023-09-23更新: 2024-02-11

摘要

关键词

Robot Learning; Generative Model; Robotic Simulation

评审与讨论

审稿意见

评分: 3置信度: 32023-10-23

The paper studies the problem of learning diverse robotic skills through automatic task and reward generation. Specifically, the proposed method utilizes LLM to produce the task setups and identify the skill sequence for solving the task. For RL skills, LLM is also prompted to generate the reward function. Experiments are performed in simulated tasks to validate the idea.

优点

The idea of automatically generating new tasks for acquiring diverse robotic skills is novel and interesting;

The pipeline is straightforward and clear;

The paper is well-written and easy to read.

缺点

The proposed method incorporates random sampling of task objects and the robot agent during the seeding stage, taking into account the requisite skills for the desired tasks. This consideration of required skills, or the lack thereof, may potentially enhance learning efficiency;

Employing LLMs to generate task proposals can result in incomplete task information. For instance, in the context of a task such as "bowl heating," the LLM may generate a description that overlooks crucial details, like setting the timer;

The generation of LLM-based reward functions heavily hinges on in-context prompts, which are derived from human comprehension of the task. This approach may necessitate significant human input and potentially restrict its applicability to novel tasks and domains.

The method lacks a formal mathematical formula or algorithmic description.

The experiments are insufficient to validate the idea thoroughly.

问题

The low-level RL skills use object states as observations, which seems not able to fully utilize the advantages of diverse visual appearance introduced by task generation, could authors provide further explanation about this?

Additionally, could the paper provide insights into how these acquired skills can be effectively reused to accomplish new goals when faced with a novel task during testing? Reporting experimental results in novel task settings through skill reusing would further strengthen the paper;

The proposed pipeline seems to be cascaded, where the errors or infeasibility produced at the task generation stage could lead to future difficulties in skill learning, have the authors ever considered any strategies to improve the interplay between task generation and skill acquisition?

评论- Further clarification

2023-11-11

Dear reviewer,

Thank you for your constructive feedback and questions! We really appreciate your effort in reviewing our work and it is definitely helpful in improving our paper quality.

We are actively working on the rebuttal and adding experiments to address your concerns. Meanwhile, we would like to ask for some further clarifications so that we can address your concerns more clearly. We are not sure if we fully understand the first point you raised in the weakness section:

The proposed method incorporates random sampling of task objects and the robot agent during the seeding stage, taking into account the requisite skills for the desired tasks. This consideration of required skills, or the lack thereof, may potentially enhance learning efficiency;

The object sampling is for seeding the LLM to generate semantically meaningful tasks, considering the affordance of the object. Could you clarify a bit what do you mean by this sentence, and what specific weakness are you concerned with? Any further explanation would be much appreciated!

Thank you! Authors

评论- Clarification on questions

2023-11-12

Thanks for the authors' reply. A drawback of the suggested framework involves employing a random sampling approach for object selection and task generation. Is it feasible to integrate feedback from the skill-learning phase to influence object sampling during the task setup phase? Could such a design enhance the efficiency of skill learning? For instance, the framework might need to prioritize frequent sampling of challenging objects and tasks to ensure the acquired policy becomes proficient. Kindly correct any misinterpretations on my part.

2023-11-12

Thank you for your clarification! We will try to address your concern and keep you updated!

评论- Author Response (Part 6)

2023-11-22

The proposed pipeline seems to be cascaded, where the errors or infeasibility produced at the task generation stage could lead to future difficulties in skill learning, have the authors ever considered any strategies to improve the interplay between task generation and skill acquisition?

Thank you for this suggestion! We have added a number of experiments analyzing the validity and success rate of each stage, as we described above. We hope these quantitative number help convince the robustness of our system. Another point we would like to clarify is, in the downstream policy distillation stage, we could incorporate additional feedback using e.g. latest GPT-4V model to help filter out unsuccessful skills, and distill using only successful demonstrations. Another interesting idea is to use feedback from skill learning (success/failure, duration taken, etc.) to iteratively improve the scene configuration and fix potential issue in the scene generation, as you suggested earlier. Such iterative feedback is also used in contemporary work using LLMs for reward generation [1]. We leave such improvement to our future work.

[1] Ma, Yecheng Jason, et al. "Eureka: Human-Level Reward Design via Coding Large Language Models." arXiv preprint arXiv:2310.12931 (2023).

Thank you again for your constructive feedback and suggesting these room for potential improvements! They are really helpful for improving the quality of our submission. We hope our response, clarification and revised manuscript could address your major concerns.

评论- Author Response (Part 5)

2023-11-22

Skill learning success rate: We added a quantitative analysis on the skill learning success rate, i.e., given a set of generated tasks (with correct training supervisions), if we run the skill learning pipeline for a fixed number of times, how many of those would be successful in learning the skill. The success in learning a skill is determined by a human evaluator watching the video of the learned policy. We test the success rate of our skill learning pipeline on a subset of 50 generated object manipulation tasks, and achieve an average success rate of 0.75, i.e., if we run the skill learning pipeline for 4 times, 3 of the runs will successfully learn the skill. Details on the individual success rate on each of the 50 tasks can be found in Appendix D.2.

We hope these newly added quantitative results and analysis could help strengthen the quality of our paper. Thank you for your feedback!

6. Questions

The low-level RL skills use object states as observations, which seems not able to fully utilize the advantages of diverse visual appearance introduced by task generation, could authors provide further explanation about this?

This is a very good question! In fact, this is done intentionally, and is actually one of the key advantages in scaling up demonstration collection in simulation. As we stated above, the main goal of our framework is to automate and scale up demonstration generation associated with diverse tasks and environments, rather than building an ultimate unified policy conditioned on language and vision input. The latter will be made possible once we have a scalable way for collecting demonstration data. Therefore, in order to generate the demonstration for each task, having access to these low-level state in simulaiton is a clear advantage, which reduces the dimension of observation space in RL training, leading to much more efficient policy learning compared to training vision-based policy. In fact, this type of "cheating" for obtaining demonstration is a common technique used in numerous related works [1][2]. On the other hand, the diversity in visual apperance in the tasks generated, will be used in the downstream distillation, which is to train a unified and vision-conditioned policy using the generated demonstrations and imitation learning, potentially leading to a multi-task generalist policy that operates using vision and language input. The latter, as we discussed above, has been a long-standing goal for robotics research, and we believe our proposed paradigm and framework for automating demonstration generation takes one important step towards this goal.

[1] Chen, Tao, Jie Xu, and Pulkit Agrawal. "A system for general in-hand object re-orientation." Conference on Robot Learning. PMLR, 2022. [2] Lin, Xingyu, et al. "Diffskill: Skill abstraction from differentiable physics for deformable object manipulations with tools." arXiv preprint arXiv:2203.17275 (2022).

Additionally, could the paper provide insights into how these acquired skills can be effectively reused to accomplish new goals when faced with a novel task during testing? Reporting experimental results in novel task settings through skill reusing would further strengthen the paper;

Thank you for asking this. As we discussed above, the focus of our work is to generate demonstrations, rather to study training a unified policy in a lifelong/continual learning setting. Therefore, generating diverse enough tasks, and generate demonstration given any generated task efficiently is our main focus. Skill reusing in case of novel tasks or novel goals is an orthogonal research direction and is beyond the scope of this work. That said, we agree that this is a interesting research direction, and we believe our data generation pipeline could be beneficial to research in related directions. On a related note, a relevant improvement in the current pipeline could be generating the task in a more structured curriculum, and generate new tasks based on previously similar tasks and re-initialize policy using the existing skills in similar task settings. This could potentially help improve the generation efficiency, and we are happy to explore this in the next step.

评论- Author Response (Part 4)

2023-11-22

4. Formula or algorithm

The method lacks a formal mathematical formula or algorithmic description.

We respectfully contend that GenBot, as a system paper, primarily demonstrates a new paradigm and its initial realization, rather than delving into specific mathematical details. Given that it employs readily replaceable foundation models for its backend modules, we argue that a formal mathematical formulation is not mandatory in its current form.

However, we definitely agree that having a more structured algorithmic description, along with more clear structure of the method description, coule greatly help conveying our method and its major components. Per your suggestion, we have added a detailed algorithm describing the whole pipeline in Appendix C. In addition, we have also extensively revised Section 3.2 into a structured bullet list with a more clear logical flow. We hope our updated manuscript is more clear in describing our method!

5. Request for more thorough experiments

The experiments are insufficient to validate the idea thoroughly.

Thank your for bringing up this concern! We acknowledge that our initial submission could definitely benefit from more extensive quantitative experiments. We have significantly expanded the experimental section in our revised paper, including statistical evaluation of generated tasks, scenes and skills, as well as more failure analyses. We explain in details below:

Diversity: In our original submission, we compared with several popular human-designed manipulation task suites in terms of semantic diversity in language space using both Self-BLUE and embedding similarity in Table 1. We now added additional diversity comparison also in image space, using both pre-trained ViT and CLIP models to compute the embedding similarity of the rendered images of the scenes (see updated Table 1 and associated explanations). We copy the main results below. Compared to other human-designed tasks suites, GenBot achieves a lower image-space embedding similarity measured using both ViT and CLIP. We believe our current comparison in both language and image space, with different metrics and different embedding models, could sufficiently demonstrate the diversity of our generated tasks.

	GenBot	Behavior-100	RLbench	MetaWorld	Maniskill2
Number of Tasks	106	100	106	50	20
Task Description (Self-BLEU) ↓	0.284	0.299	0.317	0.322	0.674
Task Description (Embedding Similarity) ↓	0.165	0.210	0.200	0.263	0.194
Scene Image Embedding Similarity (ViT) ↓	0.193	0.389	0.375	0.517	0.332
Scene Image Embedding Similarity (CLIP) ↓	0.762	0.833	0.864	0.867	0.828

Scene validity: In order to verify the validity of the generated scenes, we added a round of human evaluation of the generated scenes on whether the generated scene aligns with the language annotations, and if the scene configurations and retrieved objects are correct for the task. (See updated parts in section 4.2 and 4.3) We generated 155 tasks and performed human evaluations on them (full list in Appendix D). There are 12 failures due to incorrect scene generation, leading to a 92% validity rate of our generated scene. (See scene validity in section 4.3 for failure case analysis.)
Decomposition and reward function validity: In order to further verify the validity of the generated task decomposition and reward functions, we added an additional round of human evaluations on the generated training supervisions, by asking a human expert to carefully inspect whether the generated decompositions and reward functions are reasonable and could in principle lead to successful skill learning. We also manually inspected the generated decompositions and reward functions, and found 6 failure cases in the 155 generated tasks (a 96% validity rate). We revised corresponding paragraphs in section 4.2 and 4.3 with failure analysis, and included full details of all the tasks and additional statistics in Appendix D.

评论- Author Response (Part 3)

2023-11-22

2. Potential incomplete task information using LLMs

Employing LLMs to generate task proposals can result in incomplete task information. For instance, in the context of a task such as "bowl heating," the LLM may generate a description that overlooks crucial details, like setting the timer;

We would like to argue that generating tasks considering these semantically meaningful details, as well as object affordances, is actually a clear advantage of using LLMs for task generation, compared to prior approaches that rely on heuristic-based procedural generation. Since LLMs have developed reasonable common sense capabilities, they are able to produce these crucial low-level steps, while also providing natural language descriptions of them. In fact, in both Figure 1 and 2, we show that for the heating task, the generate task incorporates the crucial step for setting the timer in both task decomposition and language description.

If you are referring to necessary fine-grained details regarding the exact angle of the knob to be turned, we acknowledge that due to the lack of information of the knob, there's no pre-defined mapping from the turning degree to the exact time set in our current asset data. In these cases, our pipeline self-generates reasonable scale of the timer and then correspondingly the amount to turn during each generation run. Such generation also allows the data generated to cover a range of reasonable mapping scales. As long as such scale information is given at the time of deployment, a distilled policy will be able to condition on such extra information. Also, we would like to point out that other similar tasks such as "turning the clock by 1-hour", or "siding the window half-way" will produce accurate information, since the LLM encodes the common sense knowledge of computing the extent of "half-way" given a window length, and what angle corresponds to 1-hour. In addition, tasks with such fine-grained details has rarely been presented in previous robotic demonstration datasets (such as Metaworld or RLBench), as these prior works mostly consider changing the coarse spatial relationship between objects in the scene.

3. Requires human input

The generation of LLM-based reward functions heavily hinges on in-context prompts, which are derived from human comprehension of the task. This approach may necessitate significant human input and potentially restrict its applicability to novel tasks and domains.

Regarding human input: we would like to clarify that our prompts, though seemingly lengthy, is not as laborious as it appear to be. The prompts are relatively long mainly because there are multiple stages we would like to automate: from task generation, to scene configuration generation, object configuration generation, and reward generation. For each of these stages, we only provided 2-4 in-context examples to regulate the response of the LLM. In fact, contemporary work using LLMs for producing policy learning typically uses comparable or more in-context examples (e.g. Voxposer [1] used 10 examples).

[1] Huang, Wenlong, et al. "Voxposer: Composable 3d value maps for robotic manipulation with language models." arXiv preprint arXiv:2307.05973 (2023).

Regarding novel tasks: The in-context prompts are mainly used to regularize the output of the LLM so that its response follows an ideal format. We have multiple seeding strategy to increase the diversity of the tasks generated. Note that, since LLMs have encoded common sense knowledge to reason about object affordances, we can feed the LLM any novel sampled objects that are not in the in-context prompts, and the LLM can reason about its affordance and generate correspondingly meaningful task, and then also other objects needed for the tasks. We believe the diversity of tasks are sufficient to demonstrate that the generalization and reasoning capability of the LLM are helpful in generating novel tasks. For example, while the prompt we feed to the LLM only includes tasks involving one single type of articulated object (oven), the generated tasks include various semantically novel interactions and tasks. We present several examples below:

flushing the toilet
retrieving goldbar from the safe
adjusting time on clock
changing lighting direction of a lamp
heating soup with microwave
adjusting the lever for operating coffee machines
extending suitcase handles
storing objects in fridges

All of these tasks have novel semantic meanings and interaction types that are not present in the given prompt, demonstrating that such task generation process indeed extracts the common sense knowledge embedded in LLMs for automated task generation, which are otherwise not possible if using heuristic based procedural generation pipelines. We hope our explanation on this matter help address your concern.

评论- Author Response (Part 2)

2023-11-22

1. Potential improvement

The proposed method incorporates random sampling of task objects and the robot agent during the seeding stage, taking into account the requisite skills for the desired tasks. This consideration of required skills, or the lack thereof, may potentially enhance learning efficiency; A drawback of the suggested framework involves employing a random sampling approach for object selection and task generation. Is it feasible to integrate feedback from the skill-learning phase to influence object sampling during the task setup phase? Could such a design enhance the efficiency of skill learning? For instance, the framework might need to prioritize frequent sampling of challenging objects and tasks to ensure the acquired policy becomes proficient. Kindly correct any misinterpretations on my part.

We really appreciate your suggestion on this! This is definitely a interesting direction we could explore. In fact, we are planning of incorporating curriculum task design using LLMs to provide a more structured hierarchy of task difficulties, which could potentially benefit down stream policy learning later.

That said, we respectfully argue that as we discussed above, the main contribution of this paper is to generate task-specific policy (demonstrations) for a diverse range of tasks, instead of directly building a multi-task foundational policy model at the current stage. After all, a foundational multi-task policy model that can generalize to a range of diverse language-conditioned tasks is by itself a long-standing and ambitious goal of robot learning research, where the current SOTA multi-task policies [1][2] usually only handle tens of different tasks if they consider realistic manipulation settings beyond pick-and-place while requiring satisfactory success rate. In fact, one major limitation for building such a unified policy is lack of diverse demonstration data, which our work is trying to address.

Therefore, the focus of our work is not on learning a proficient policy to acquire a range of skills continuously, but rather producing demonstration data for this goal. Furthermore, we acknowledge that as an initial implementation of this paradigm, there are numerous areas for future enhancement in this long-term research endeavor. However, we argue that these opportunities for improvement should not be viewed as drawbacks of our framework, but rather as integral steps in future development towards this direction.

Still, we agree that as a data engine, it's practically beneficial to be able to control the difficulty of the generated tasks and allocate resources towards more challenging settings. One simple heuristic for achieving this is to use the success rate of the learned skill as a criterion, and generate more diverse tasks initialized with those challenging settings. We would also like to argue that in order to obtain a unified policy model in the end, apart from the difficulty of the tasks, other factors including the diversity of tasks, objects, scenes, and language descriptions also carry significant importance to ensure a wide enough data distribution to be used during training, so that the policy can generalize well. We believe our current design presents certain value in optimizing these factors.

We thank you again for your valuable suggestion! We hope our response help clarify and explain our framework better.

[1] Shridhar, Mohit, Lucas Manuelli, and Dieter Fox. "Perceiver-actor: A multi-task transformer for robotic manipulation." Conference on Robot Learning. PMLR, 2023.

[2] Gervet, Theophile, et al. "Act3d: Infinite resolution action detection transformer for robotic manipulation." arXiv preprint arXiv:2306.17817 (2023).

评论- Author Response (Part 1)

2023-11-22

Thank you for your constructive critisims and suggestions! We are grateful that you found our idea novel and interesting, and our pipeline clear. We would also like to thank you for your swift response in clarifying our questions. In order to address your concern, we have extensively revised our paper and added a number of additional quantitative experiments.

More context and explanation

Before we attempt to address your concerns in detail, we would like to take this chance to further elaborate the main motivation of our proposed framework, which could potentially address any possible confusion in the initial submission, and provide a clearer context for the individual response to your comments below. Unlike tremendous progress made recently in research areas such as NLP and vision, robotics hasn't seen a great breakthrough in terms of building a generic (or foundation) policy model that operates on language and vision input, and executes necessary actions to perform a variety of tasks. There are multiple challenges that robotics researchers are facing, and among which, lack of demonstration data in diverse tasks and settings is probably one of the most critical challenges preventing us from scaling up and building a unified policy model.

Our work is a step towards addressing such data problem, by scaling up demonstration data collection in simulation. Prior works in designing tasks and environments in simulation usually requires laborious human effort, or resort to rule-based procedural generation which do not consider semantically rich information related to object affordances, scene component distributions and natural language descriptions of tasks. GenBot attempts to leverage the common sense knowledge and generative capabilties of the latest foundation models and automate such scalable data collection in simulation. Therefore, GenBot is not trying to serve as a challenging robotic benchmark, nor to tackle the lifelong/continual learning problem of robotic policy learning. Rather, it aims to build an automated data generation engine, so that diverse demonstrations associated with different semantically meaningful tasks and various different environments can be obtained with minimal human effort. We believe the results we show in the paper (both qualitative on our website and quantitive evaluation, with several additional experiments we will discuss below), help demonstrate the diversity of the generated tasks and thus the effectiveness of our proposed paradigm.

Therefore, our paper's primary contribution, from our perspective, is the introduction of the generative simulation paradigm, merging generative models with physics-based simulation to automate the whole pipeline for scalable and automated robot learning in simulation. Our focus therefore lies in establishing a comprehensive and adaptable framework, prioritizing its foundational development over specific backend models used in the initial implementation. We believe this is a new research direction and opens up possibilities for various future research opportunities and improvements. When designing the framework, we intentionally incorporated various potentially promising choices (e.g. database retrieval versus mesh generation), and the design is agnostic to the backend LLM/VLM/generative model used, therefore it's able to be further improved by replacing the backend modules with the latest progress in related research areas.

We hope our understanding and perspective on the proposed framework helps better elaborate our work. We try to address your individual concerns in detail below.

评论- Author Comment

2023-11-23

Dear Reviewer,

As the rebuttal period draws to a close, we would like to thank you again for your valuable comments and feedback. We hope our response and additional experiments and analysis could address you concerns. Please let us know if you have any further comments.

Authors

审稿意见

评分: 8置信度: 32023-10-31

GenBot proposes a method for automating the large-scale learning of diverse robotic skills through generative simulation. The approach is based on the propose-generate-learn cycle, where tasks and skills are proposed, related simulation environments are generated, and the agent learns policies. The study leverages foundation models for each of these components, enabling automation and demonstrating the learning of various manipulation tasks.

优点

This research introduces an automated pipeline capable of generating diverse tasks, which is considered novel.
The method for generating tasks is intriguing. It instructs GPT-4 on how objects can be manipulated, the meanings of each joint and link, enabling GPT-4 to learn the affordances of each object and generate tasks accordingly.
The subsequent modules are also very interesting and plausible. In the case of scene generation, it generates the right objects for each scenario through an LLM, and the entire pipeline is connected to load the 3D mesh assets, resulting in appropriate scenes for each situation.
The results of task decomposition in Figure 3 are very interesting, showing that the proposed method is effective in inducing meaningful skills.
GenBot is shown to generate a variety of tasks for skill learning, including object manipulation, locomotion, and soft body manipulation.

缺点

The assumption that the decomposed shorter-horizon sub-task can be solvable by one of the policy categories within this framework is needed. However, this is not a drawback unique to this research, so it may not be considered a weakness.

问题

Are the physical characteristics of objects (e.g., weight) also determined by LLM during generation?
Were there cases where the decomposed shorter-horizon sub-task could not be resolved within one of the policy categories in this framework?
What is the overall computational cost involved in the framework?

评论- Author Response

2023-11-22

Thanks a lot for your constructive and encouraging feedbacks! We are very grateful that you recognize the novelty of our proposed framework. Below we address your concerns.

1. Solving decomposed sub-task

The assumption that the decomposed shorter-horizon sub-task can be solvable by one of the policy categories within this framework is needed. However, this is not a drawback unique to this research, so it may not be considered a weakness.

Thank you for bringing up this! Indeed, LLM-based task decomposition and planning is by itself an active research area, and our system could benefit from using the latest progress in this direction.

We have also added a number of additional evaluations and failure case analysis in section 4.3 and Appendix D, regarding the validity of the generated scenes, the stats of task decompositions and success rate of learned skills after decomposition.

2. Questions:

Are the physical characteristics of objects (e.g., weight) also determined by LLM during generation?

In our current pipeline, we use the default rigid contact model and density in the simulation. We only ask the LLM to generate the size. That said, it's definitely possible to query for additional information such as mass and friction coefficient between different properties of surfaces, which are well supported by GPT-4 since the massive text data (e.g. knowledge in wikipedia) it has been trained on encodes such information. We will also add this in our future version. Thanks for this suggestion!

Were there cases where the decomposed shorter-horizon sub-task could not be resolved within one of the policy categories in this framework?

We added a round of quantitative analysis on the skill learning success rate, i.e., given a set of generated tasks (with correct training supervisions), if we run the skill learning pipeline for a fixed number of times, how many of those would be successful in learning the skill. The success in learning a skill is determined by a human evaluator watching the video of the learned policy. We test the success rate of our skill learning pipeline on a subset of 50 generated object manipulation tasks, and achieve an average success rate of 0.75, i.e., if we run the skill learning pipeline for 4 times, 3 of the runs will successfully learn the skill. Details on the individual success rate on each of the 50 tasks can be found in Appendix D.2.

What is the overall computational cost involved in the framework?

Task generation is simply done by calling GPT-4's APIs, and generating one single task typically can be done within few minutes. If the task's subgoals can all be solved via planning, typically each task can be solved within 10 minutes. If certain subgoals requires RL to solve, it usually takes around 2-3 hours for each RL-necessary step, and the total duration thus depends on both the number and nature of the subtasks. Taking these into account, a task typically takes 4-5 hours on average to solve. This is done using 8 threads of a CPU running at 2.5Ghz, meaning that each node with a 32-core (64-thread) CPU in a typical cluster could run 8 jobs in parallel at the same time. We have added these into Appendix D.1.

Thank you again for appreciating the value of our work and your constructive feedback! We hope the revised manuscript and the new experiments now further improve the quality of our paper.

评论- Author Comment

2023-11-23

Dear Reviewer,

Authors

审稿意见

评分: 5置信度: 52023-11-01

This paper introduces a method to learn diverse skills in simulation at scale. The method first uses an LLM to propose tasks from a pool of possible robots and objects. This is then used to generate assets and configure a scene. The generated task is decomposed by a LLM into sub-tasks, a solution method is automatically determined (eg. RL or planning), and finally, the skill is learned in simulation.

The paper is clearly written, easy to follow, and proposes a promising idea. I find the method to be compelling and potentially very impactful, but I feel the experimental validation could be stronger.

优点

Clear narrative and mostly easy to follow.
Promising idea leveraging powerful, large-scale pre-trained model architectures

缺点

Could be clearer in some sub-sections given the large number of moving parts
Experiments need to be more thorough for each of the components (task diversity, scene validity, training supervision validity, etc) I elaborate on these points further in the Questions section below.

问题

The presented approach has a very large number of moving parts (involving many different large pre-trained models), which I tried to understand as follows. (This is also to ensure my correct understanding of the method, which the authors can correct if needed and further clarify in the paper).

GPT-4 is used as the main LLM for task proposal. The assets can then be either retrieved or generated.
For asset retrieval, Sentence-Bert is used to embed the description of the asset, which is then matched to the top-k similar embeddings for assets in the Objaverse. Since retrieval based on the language embedding may not be perfect, the asset is verified by captioning an image of it with a VLM, then feeding this along with the desired asset description and task description into GPT-4 to verify its validity. For reliability, two VLMs (Bard and BLIP-2) are used, and both captions must be valid for the asset to be used.
If asset retrieval fails, then the mesh is generated; the method uses Midjourney for text-to-image generation followed by Zero-1-to-3 for image-to-mesh generation.
GPT-4 is then used several further times, to decompose the proposed task into sub-goals; choose a method (RL, motion planning with action primitives, trajectory optimization); generate reward functions via in-context learning; and select the action space for some tasks (delta-translation or target location of the end effector).

I have the following comments:

The text in sections 3.1 and 3.2 could perhaps be shorter; I found it a bit difficult to follow with a lot of text including details interspersed throughout.
Consider an additional graphic for section 3.2, or at least a structured/bullet list. There are a lot of moving parts, and it took a while for me to understand how they all interact at the different stages.
There is some additional work on learning diverse skills in simulated environments (in some cases in addition to real-world) that should be cited.

Jiang et al, 2022, VIMA: General Robot Manipulation with Multimodal Prompts

Majumdar et al, 2023, Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence?

Bousmalis et al, 2023, RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation
Given the significant complexity proposed, I think the experiments should be more thorough and quantitative to do justice to the complexity of the method. I address each of the experimental sections below:

Task diversity: Measuring task diversity using just the language descriptions may be prone to biases (eg. task suites may describe tasks differently, with different levels of verbosity). Providing the same measures in state space (eg. perhaps just the diversity of robot joint motions required to solve the tasks) or image space (eg. the final image showing goal configurations for each task) would be more convincing.

Scene validity: While Figure 4 shows the BLIP-2 scores for asset retrieval (and ablates some of the verification stages), it’s not clear (i) how much the method relies on retrieval versus generation (ie. when retrieval fails), (ii) how viable the generated assets are versus retrieval; and (iii) how important the different moving parts are beyond the specific verification stages (ie. how important is it to have both Bard and BLIP-2? Why Sentence-Bert?)

Training Supervision Validity: This would be more convincing with any quantitative results, even something like the average number of decompositions per proposed task; the average duration to solve each full task versus sub-goals; performance if solving the full task directly via planning (if possible), etc.

Skill Learning Performance: The quantitative results show improvement over an RL-only baseline, but it would be more helpful to show this over many more than 4 tasks; and also report the relative performance of all three methods (ie. separating trajectory optimization and planning over action primitives). Action primitives look pretty high-level: grasping, approaching and releasing a target object. How often is this route selected? And how much of the performance is due to working with an easier planning problem in a much higher-level action space rather than RL?

System: I think final system performance needs to be a quantitative analysis. As it stands, I unfortunately don’t have a good sense for how well the overall method works, in terms of how many different tasks it can solve and to what degree, and the nature of those tasks (eg. what objects, what behaviour/affordance, etc).

All in all, I was intrigued by the ideas proposed in this paper, and believe that such a method can be impactful. I would like to be in a position to accept this for publication, but feel that more quantitative analysis is required before that is possible.

评论- Author Response (part 6)

2023-11-22

4.4 Skill Learning Performance

The quantitative results show improvement over an RL-only baseline, but it would be more helpful to show this over many more than 4 tasks; and also report the relative performance of all three methods (ie. separating trajectory optimization and planning over action primitives). Action primitives look pretty high-level: grasping, approaching and releasing a target object. How often is this route selected? And how much of the performance is due to working with an easier planning problem in a much higher-level action space rather than RL?

Thank you for your suggestion. We have replaced the original Table 2 with the new Figure 5 to show quantitative numbers on 12 rigid-body manipulation tasks (instead of 4). For each task, we run each method with 4 different seeds, and report the mean and std of the task return (we manually verified that the reward functions generated by GPT-4 are correct for the evaluated tasks). Due to the randomness in the skill learning process (sampling is used in the motion planning-based action primitive, and RL inherently has randomness during exploration and training), we also provide quantitative analysis on the skill learning success rate, i.e., given a generated task (with correct training supervisions), if we run the skill learning pipeline for multiple times, how many of those would be successful in learning the skill. The success in learning a skill is determined by a human evaluator watching the video of the learned policy.

We find that allowing learning algorithm selection is beneficial for achieving higher performance for completing the tasks. When the option of motion planning based action primitive is removed and only RL is used, the skill learning completely fails for most tasks. We test the success rate of our skill learning pipeline on a subset of 50 generated object manipulation tasks, and achieve an average success rate of 0.75, i.e., if we run the skill learning pipeline for 4 times, 3 of the runs will successfully learn the skill. Details on the individual success rate on each of the 50 tasks can be found in Appendix D.2. Note that since gradient-baed trajectory optimization is used solely for soft-body manipulation, we do not include comparison with it here. This is a conclusion derived from a series of prior work studying soft-body manipulation setting [1][2][3], all of which suggest fined-grained shape matching task involving soft-body would be better solved with gradient-based optimization technique, and RL or planning would usually fail in solving them. This selection choice is injected to GPT-4's response by including this information in the prompt.

[1] Huang, Zhiao, et al. "Plasticinelab: A soft-body manipulation benchmark with differentiable physics." arXiv preprint arXiv:2104.03311 (2021).

[2] Lin, Xingyu, et al. "Diffskill: Skill abstraction from differentiable physics for deformable object manipulations with tools." arXiv preprint arXiv:2203.17275 (2022).

[3] Xian, Zhou, et al. "Fluidlab: A differentiable environment for benchmarking complex fluid manipulation." arXiv preprint arXiv:2303.02346 (2023).

4.5 System

I think final system performance needs to be a quantitative analysis. As it stands, I unfortunately don’t have a good sense for how well the overall method works, in terms of how many different tasks it can solve and to what degree, and the nature of those tasks (eg. what objects, what behaviour/affordance, etc).

In case you missed it earlier, we have included a number of different tasks proposed and solved using GenBot on our website. We have also added details of generated tasks, their statistics, and analysis on failure cases in Appendix D.1, D.2 and D.3. Shortly speaking, our system reached 92% scene validity rate, 96% training supervision validity rate, 75% skill learning success rate, and meanwhile presenting higher diversity than popular human-designed task sets.

We hope our response could address your major concerns. We would like to thank you again for the critical and helpful feedbacks and suggestions. They are really helpful in improving the quality of our work!

评论- Author response (part 5)

2023-11-22

4.3 Training Supervision Validity

This would be more convincing with any quantitative results, even something like the average number of decompositions per proposed task; the average duration to solve each full task versus sub-goals; performance if solving the full task directly via planning (if possible), etc.

Thanks for raising this up! We added an additional round of human evaluations on the generated training supervisions, by asking a human expert to carefully inspect whether the generated decompositions and reward functions are reasonable and could in principle lead to successful skill learning. We found 6 failure cases in the 155 generated tasks (a 96% validity rate). We revised corresponding paragraphs in section 4.2 and 4.3, and included full details in Appendix D, including average number of decompositions (Figure 6 in Appendix D). The errors can be categorized into 1) referring to undefined variables, e.g., referring to “standard time” in task “Correct Clock Time” yet it is undefined; 2) reward does not encode the intended behavior. This includes incorrect semantic understanding of articulated object state, e.g., the task is to fold the chair, yet the generated reward actually encourages unfolding the chair due to misunderstanding of the mapping between joint angle values and object state. We also find it hard to generate correct rewards for continuous motions such as “moving robotic hand back-and-forth”, or “knock the door”. However, our reward generation component can be further improved by incorporating environment feedback into the system following very recent work focused on reward generation using LLMs [1], which we leave as future work.

Solving the long-horizon tasks without sub-goals is not practically feasiable, since solving the task requires reaching the intermediate goal specified by the sub-goal state and reward functions. Also, planning alone is not sufficient in solving many tasks, as we explained in section 3.3: reinforcement learning better suits tasks that are contact rich and involving continuous interaction with other scene components, e.g., legged locomotion, or when the required actions cannot be simply parameterized by discrete end-effector poses, e.g., turning the knob of an oven. Regarding duration for solving the task: if the task's subgoals can all be solved via planning, typically each task can be solved within 10 minutes. If certain subgoals requires RL to solve, it usually takes around 2-3 hours for each RL-necessary step, and the total duration thus depends on both the number and nature of the subtasks. Taking these into account, a task typically takes 4-5 hours on average. This is done using 8 threads of a CPU running at 2.5Ghz, meaning that each single node with a 32-core (64-thread) CPU in a typical cluster could run 8 jobs in parallel at the same time. We have added these into Appendix D.1.

[1] Ma, Yecheng Jason, et al. "Eureka: Human-Level Reward Design via Coding Large Language Models." arXiv preprint arXiv:2310.12931 (2023).

评论- Author response (part 4)

2023-11-22

Regarding Sentence-Bert: We used Sentence-Bert simply because it was one of most popular SOTA text embedding models for information retrieval [3][4], and we simply used it out-of-the-box. It can be replaced by other language embedding model but in practice we found it work well.

In order to further justify the validity of scenes generated by our current system, we also added a round of human evaluation on whether the generated scene aligns with the language annotations, and if the scene configurations and retrieved objects are correct for the task. (See updated parts in secion 4.2 and 4.3) We generated 155 tasks and performed human evaluations on them (full list in Appendix D.1). There are 13 failures due to incorrect scene generation, leading to a 92% validity rate of our generated scene. The failures can be categorized into 1) limited asset functionality, where GenBot generates tasks that require functionalities the retrieved assets do not support, e.g., pouring contents from a dispenser where the lid cannot be separated from the body, or loading paper into a printer asset which do not have a movable tray. 2) GPT-4 has incorrect semantic understanding of articulated object joint state, e.g., for tasks such as of opening the window/eye-glass, the corresponding joint angle needs to be initialized to a value such that the asset is in the “closed” state. However, the mapping between the joint angle values and the “closed” state is unique to each asset, and since GPT-4 has no such information, it might set wrong values of the joint angle leading to incorrect initialization of the assets. Since we are upgrading our VLM backend to GPT-4V, we will also incorporate vision feedback to 1) use vision recognition capability to better obtain the mapping from the joint to its correct semantic function, and 2) perform scene-level verification using GPT-4V, which still requires human effort now. These remain as our future work.

[3] Thakur, Nandan, et al. "Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models." arXiv preprint arXiv:2104.08663 (2021).

[4] Rasiah, Vishvaksenan, et al. "SCALE: Scaling up the Complexity for Advanced Language Model Evaluation." arXiv preprint arXiv:2306.09237 (2023).

评论- Author Response (Part 3)

2023-11-22

4.2 Scene validity

While Figure 4 shows the BLIP-2 scores for asset retrieval (and ablates some of the verification stages), it’s not clear (i) how much the method relies on retrieval versus generation (ie. when retrieval fails), (ii) how viable the generated assets are versus retrieval; and (iii) how important the different moving parts are beyond the specific verification stages (ie. how important is it to have both Bard and BLIP-2? Why Sentence-Bert?)

Before answering your questions in detail, we would like to further clarify a bit the motivation behind having both retrieval and generation for obtaining the object assets. This paper's primary contribution, from our perspective, is the introduction of a generative simulation paradigm, merging generative models with physics-based simulation to automate the whole pipeline for scalable and automated robot learning. Our focus therefore lies in establishing a comprehensive and adaptable framework, prioritizing its foundational development over specific backend models for the initial implementation. Considering the rapid advancements in LLMs and 2D/3D generative models, with SOTA models being updated on a weekly basis, our framework is designed to be able to upgrade seamlessly by using the latest progress in these orthogonal research areas; while GenBot, serving as the first comprehensive proof-of-concept, showcases the feasibility and effectiveness of this novel research direction.

While human-designed database for 3D assets is growing rapidly, with massive-scale database like Objaverse-XL with millions of objects, they are mostly designed by human artists and would require massive human effort to continue scaling up. On the other hand, automated text-to-3D generative pipelines utilize massive 2D image resources available online, coupled with inductive biases such as view consistency, holds a better potential in further scaling up and automating the process for obtaining 3D assets. While at the time of submission, retrieval from existing databases usually presents higher-quality meshes and faster speed, we believe text-to-3D pipeline holds a greater potential in the future, in terms of both asset fidelity and generation speed. Therefore, we intentionally designed our framework to support both of these two pipelines to verify both of them could work. As a result, in the first implementation and design of GenBot, we tried to unify both pipelines and prioritize retrieval over generation: the initial design we had is resorting to the generative pipeline whenever retrieval fail to pass the verification stage. However, in practice we found that for rigid-body manipulation settings, objects retrieved from Objaverse already work very well and the generative pipeline is rarely needed. Therefore, in practice, we now only use the text-to-3D pipeline for soft-body manipulation tasks, in order to obtain a more consistent and controllable target shape for the soft-body under manipulation. Another reason is that soft-body manipulation's goal shape doesn't require detailed texture, which is also one limitation of generative models at the time of submission. We have revised section 3.2 to make this more clear. We also included some visual results showing the quality of the generated assets in Table 5 in Appendix E.

We would also like to point to some latest work in 3D asset generation (one-2-3-45 [1], instant3d [2], etc.) after our initial submission, which dramatically improved the generation speed and results. The models in our current pipeline can be easily swapped with these latest progresses, which we will incorporate in future development.

[1] https://one-2-3-45.github.io/

[2] https://instant-3d.github.io/

Regarding using both Bard and BLIP-2: The main reason for using both is that at the time of submission, GPT-4V is not available yet and these 2 models are among the most advanced VLM models available, while each of them still presents certain limitations. BLIP-2 is trained on a relatively small dataset and doesn't generalize well to data distribution present in Objaverse. Bard is a more general model, but it doesn't provide official APIs and we used a 3rd party interface to access it, making its generation temperature beyond our control, hence leading to somewhat inconsistent responses. Also, the vision capability of Bard is still below a satisfactory level at the submission time. We empirically observe that using both model to cross-validate would make the verification process more conservative and hence more robust.

Now that GPT-4V just released its API, we will replace all these modules with GPT4-V in our future development so that we can remove the messy dependencies on various difference back-end VLMs, which could further improve the performance and robustness of our system.

评论- Author Response (Part 2)

2023-11-22

4. More experiments and quantitative results

I think the experiments should be more thorough and quantitative to do justice to the complexity of the method

Indeed, we acknowledge that the experiments in our initial submission are not rigorous enough, potentially affecting the persuasiveness of our paper. We are grateful for your valuable suggestion on this matter! We have significantly expanded the experimental section in our revised paper with numerous new experiments and quantitative analyses, including quantitative evaluation on task diversity (4.1), scene validity (4.2), training supervision validity (4.3), skill learning (4.4) and system (4.5). We discuss them in details below.

4.1 Task diversity

Measuring task diversity using just the language descriptions may be prone to biases (eg. task suites may describe tasks differently, with different levels of verbosity). Providing the same measures in state space (eg. perhaps just the diversity of robot joint motions required to solve the tasks) or image space (eg. the final image showing goal configurations for each task) would be more convincing.

Thank you for your suggestion. Since measuring diversity in robot joint space could be largely affected by the range of joint motions and the size of the workspace (our generated tasks consider room-size workspace, which is naturally at advantage compared to other table-top manipulation settings), and there lacks any established and systematic methods for measuring task diversity this way, we decided to add additional results measuring diversity in image space following your suggestion. We revised section 4.2 and 4.3, and included additional diversity in image space, using both ViT (imagenet pre-trained) and CLIP models to compute the embedding similarity of the rendered images of the scenes (see updated Table 1 and associated explanations). We copy the main results here:

	GenBot	Behavior-100	RLbench	MetaWorld	Maniskill2
Number of Tasks	106	100	106	50	20
Task Description (Self-BLEU) ↓	0.284	0.299	0.317	0.322	0.674
Task Description (Embedding Similarity) ↓	0.165	0.210	0.200	0.263	0.194
Scene Image Embedding Similarity (ViT) ↓	0.193	0.389	0.375	0.517	0.332
Scene Image Embedding Similarity (CLIP) ↓	0.762	0.833	0.864	0.867	0.828

Compared to other human-designed tasks suites, GenBot achieves a lower image-space embedding similarity measured using both ViT and CLIP, in addition to the previously shown better diversity in language space. We hope the updated results is more convincing now.

In addition, we also provide the list of generated tasks, including the task name and task descriptions, as qualitative evaluations of the task semantics diversity in Appendix D.1. In case you missed it in the first round of review, we also provided visualizations of the generated tasks on our project website for better quantitative understanding of the scene and robot motion diversity.

评论- Author Response (Part 1)

2023-11-22

Thank you for your detailed comments and suggestions! We are encouraged that you found our proposed idea promising and potentially impactful, and really appreciate your suggestions for improving the quality of our experiments. We have added numerous additional experiments and results based on your comments, and we hope to address all your concerns below.

1. Understanding of the pipeline

The presented approach has a very large number of moving parts (involving many different large pre-trained models), which I tried to understand as follows.

Yes, your understanding is correct. We apologize for the potential confusion caused by not explaining these details in the initial submission. We have extensively revised our manuscript (see details below). We hope the updated manuscript is more clear and explains better our full pipeline.

2. Improve writing

The text in sections 3.1 and 3.2 could perhaps be shorter; I found it a bit difficult to follow with a lot of text including details interspersed throughout.

Consider an additional graphic for section 3.2, or at least a structured/bullet list. There are a lot of moving parts, and it took a while for me to understand how they all interact at the different stages.

Thank you for your suggestion! Section 3.1 mainly explains how the tasks are proposed, and 3.2 mainly talks about how we generate a corresponding scene based on the proposed task. We acknowledge that our initial presentation might have appeared complex, potentially leading to a perception of excessive complexity and too many parts involved. However, the pipeline is actually simple without too many stages interacting with each other: for generating the scene, we first obtain the names of objects (assets) needed, then either retrieve or generate them (with additional verification for improving the validity of the scene), then putting them into the scene with plausible configurations (poses, configuration of links and joints, etc). Each of these stages is necessary for obtaining a semantically meaningful scene that could lead to successful skill learning.

We made the following changes per your advice to make these sections more clear and easier to follow:

We have extensively revised Section 3.1 and 3.2, prioritizing the most critical details in the pipeline, while eliminating redundant explanations and unimportant details. This helped shorten the sections by half a page.
We moved less-important details, such as details of asset retrieval using sentence-bert, using VLMs (Bard and BLIP-2) for verifying the assets, why we use 2 models to cross-validate, and details of the text-to-3D pipeline into the appendix, since these models are not the main contributions of our work and multiple models at play could confuse readers. (In the later part of this response, we also explain why these modules are used and how they can be replaced to be more unified and robust in the future work.)
We re-structured Section 3.2 into a structured bullet list: we now first talk about the necessary components needed for generating the scene in the first paragraph, and then elaborate details of each of these componenets in the following structured list. To make it more clear, we also restructured the components so that they now match the 4 stages depicted in Figure 2 (B), and obtaining asset queries and their meshes are now grouped together into the first bullet point.
We also added a algorithmic pipeline in Appendix C to better summarize the whole process.

We hope that the revised version of our paper presents a more structured and focused narrative, and will facilitate a better understanding of our method.

3. Missing citations to relevant work

There is some additional work on learning diverse skills in simulated environments (in some cases in addition to real-world) that should be cited. Jiang et al, 2022, VIMA: General Robot Manipulation with Multimodal Prompts Majumdar et al, 2023, Where are we in the search for an Artificial Visual Cortex for Embodied Intelligence? Bousmalis et al, 2023, RoboCat: A Self-Improving Foundation Agent for Robotic Manipulation

Thank you for bringing up these relevant works! We have updated the Related Work section with additional citations to the 3 papers you suggested.

评论- Author Comment

2023-11-23

Dear Reviewer,

Authors

审稿意见

评分: 6置信度: 52023-11-01

The paper introduces "GenBot", a generative robotic agent designed to automatically learn a variety of robotic skills on a large scale via generative simulation.

GenBot utilizes advancements in foundational and generative models. Instead of directly employing or adapting these models to formulate policies or specific actions, the authors suggest a generative approach. This approach employs the models to automatically generate diversified tasks, scenes, and training supervision. The goal is to enhance robotic skill learning with minimal human intervention.

GenBot follows a "propose-generate-learn" cycle. Initially, the agent suggests intriguing tasks and skills. Following that, it generates simulation environments, populating them with relevant objects and assets in the appropriate spatial configurations. After obtaining all the required information for the proposed task, including scene components, GenBot proceeds with the actual skill learning.

The contributions of this paper go as follows.

The paper introduces "GenBot", a robotic agent that automates the process of task and environment generation and subsequently learns skills. This framework potentially reduces the need for human intervention in the process of creating simulation tasks.
A figure in the paper showcases 25 example tasks generated by GenBot and the corresponding skills it learned, highlighting the diversity and applicability of the system.

优点

Overall, this paper demonstrates that the entire pipeline—from creating tasks to learning skills—can potentially be automated by large models. While a lot of details are still missing, I commend the quality of this work, especially considering the engineering efforts involved.

Specifically, the strengths of this paper include:

The paper introduces GenBot as an automated pipeline that can be endlessly queried to generate a continuous stream of skills for diverse tasks. This automation is a significant strength as it reduces human intervention and can potentially scale up robotic skill learning.
Task diversity is essential for generalizable robotic skill learning. If GenBot can produce a diverse set of tasks and learn corresponding skills, it signifies a robust and versatile system.

缺点

Major

Task diversity

I am concerned regarding the diversity of the generated tasks. With tasks proposed by LLMs and only qualitative examples provided, it's challenging for readers to gauge the true diversity of these tasks. Specifically:

How many semantically distinct tasks are generated? By "semantically distinct," I refer to tasks that are fundamentally different. For instance, "opening a cabinet" and "lifting a bucket" are semantically distinct, whereas "walking forward" and "walking backward" are not.
What is the range of diversity in scene configurations? Upon reviewing the prompts, it seems that certain elements, like a table, have fixed poses and heights. If this is a recurring theme, then scene configuration diversity appears limited.

Task verification

The construction of tasks in simulation typically requires validation to ensure correct implementation. This involves examining success conditions, initial state distributions, physical parameters, and more. However, the paper lacks a systematic method for this crucial verification, especially given the automation of task creation. Mistakes at any stage could result in flawed tasks. Specifically:

What percentage of the tasks can be successfully solved? How does this compare to the total number of generated tasks?
Are trivial tasks, such as picking up a block when given the grasp action primitive, filtered out?

Use of LLMs

While the paper demonstrates the potential for automating the entire pipeline, from task creation to skill learning, using large models, the necessity of LLMs is questionable. Could simple heuristics or random placements of objects yield similar results? Given the extensive prompting involved with GenBot, it's unclear if it genuinely produces more diverse tasks with reduced human efforts.

Missing details

Numerous details are absent from the paper. Refer to the "Questions" section for more questions.

Limited quantitative results

The majority of the results are qualitative, which lacks depth for readers. Additionally, the paper's comparison of task diversity to other benchmarks based solely on task descriptions is less than persuasive.

Minor

Object Assets: Currently, the paper relies heavily on PartNetMobility and RLBench for task-relevant objects, which may restrict task diversity. Although the paper suggests using Midjourney + Zero123 for additional 3D assets, this pipeline lacks detailed elaboration.
Lack of Open-Source Code: As of now, the paper hasn't released its code. Furthermore, the underlying simulation framework, "Genesis," remains private.

问题

Regarding Task Proposal:
- How does the system handle incompatibilities between the robot and the object? For instance, if the robot is a dog and the object is a cabinet, what would the proposed task be?
- In scenarios involving non-articulated objects, if the generated tasks aren't specifically tied to the sampled object, how does the sampled object influence the task?
Regarding Scene Generation:
- Could you provide a more detailed explanation of the MidJourney + Zero123 pipeline?
- What is the precise output format from the LLMs? How is this output imported into a simulator to construct a scene?
- How does the system manage situations where the scene results in an unsolvable task? For example, if the task is to open a cabinet but the cabinet is positioned out of the robot arm's reach.
- How are potential collisions in the initial scene configuration addressed?
- Is the initial state of the scene fixed, or is it sampled from a distribution?
- How does the system generate physical parameters other than size, such as friction?
Regarding Training Supervision Generation:
- How reliable is the reward generated by the LLM? Are there instances where it may not align with the intended goal?
- How does the system define the success conditions for a task?

评论- Author Response (Part 7)

2023-11-22

How does the system generate physical parameters other than size, such as friction?

In our current pipeline, we use the default rigid contact model in the simulation and do not change the friction. We only ask the LLM to generate the size. That said, it's definitely possible to query for additional information such as mass and friction coefficient between different properties of surfaces, which are well supported by GPT-4 since the massive text data (e.g. knowledge in wikipedia) it has been trained on encodes such information. In order to prove this, we conducted a quick test using GPT-4 to return the mass and friction coefficient of several objects when they are in contact with a wooden table surface, and the response is below:

Object	Estimated Mass (kg)	Estimated Friction Coefficient (μ)
Microwave	15	0.5
Water Bottle	1	0.4
Lamp	3	0.5
Laptop	2	0.3
Clock	1	0.4

We believe these value are reasonable enough for serving as a initial guess followed by random perturbations to produce diversity. We will also add this in our future version. Thanks for this suggestion!

Regarding Training Supervision Generation:

How reliable is the reward generated by the LLM? Are there instances where it may not align with the intended goal?

Please see our above response in evaluating the reward functions and skills learned, as well as added detailed in Section 4.3.

How does the system define the success conditions for a task?

Currently, we do not have explicit success condition generation, and we evaluate the final system and learned skills with human evaluation. That said, there's indeed concurrent work trying to generate success condition using LLMs (1), similar to reward generation. We will try to incorporate this in our pipeline. We would like to re-iterate that the main contribution of our work is proposing a potentially promising paradigm, and presenting a concrete and comprehensive implementation of it to prove the validity and effectiveness of this new direction. There are definitely plenty of room for potential improvement, employing the rapid developing research directions such as multi-modal foundation models and 3D generative models. Since GPT-4V's API has just been released, one clear direction is to incorporate its vision feedback to both improve robustness in generating scenes and tasks, as well as evaluating if learned skills match the intended goal.

[1] Ha, Huy, Pete Florence, and Shuran Song. "Scaling up and distilling down: Language-guided robot skill acquisition." arXiv preprint arXiv:2307.14535 (2023).

Once again, we thank you for your detailed comments and suggestions for improvements! We hope the revised manuscript now better justifies and elaborates our proposed framework.

评论- Author Response (Part 6)

2023-11-22

6. Open-source code

As of now, the paper hasn't released its code. Furthermore, the underlying simulation framework, "Genesis," remains private.

Thank you for suggesting this! Genesis is still not publicly available yet and its various APIs are subject to change, therefore only open to a group of researchers participating in internal testing. The tentative time for releasing Genesis is next Feburary according to the development team.

We very much agree open-sourcing the code would aid the reproducibility of our method and benefit the research community. We convert GenBot to a re-implementation of GenBot using pybullet as the main backend instead, and have open-sourced this version of code here in a anonymous repo. We will release the full pipeline as soon as Genesis is released publicly. We hope this could help address your concern!

7. Missing details

Regarding Task Proposal

How does the system handle incompatibilities between the robot and the object? For instance, if the robot is a dog and the object is a cabinet, what would the proposed task be?

We apologize for not explaining this clearly in the initial version. For tasks involving legged robot, we prompt the LLM with only example-based seeding, and populate objects only when the LLM says additional objects is required for the task. We have updated section 3.1 to make this clear.

In scenarios involving non-articulated objects, if the generated tasks aren't specifically tied to the sampled object, how does the sampled object influence the task?

In such case, the object will be populated into the scene as part of the scene components, but won't necessarily be interacted with in task learning, if the sub-goal or reward function do not involve it. On the other hand, in example-based seeding, the examples are just used to hint the LLM for generating different tasks, and the objects in examples will be not used in creating the actual scene.

Regarding Scene Generation:

Could you provide a more detailed explanation of the MidJourney + Zero123 pipeline?

We added more details on this in Appendix E. See 5. Object Assets above for more detailed explanation.

What is the precise output format from the LLMs? How is this output imported into a simulator to construct a scene?

The LLM will generate a yaml file describing the scene configuration. See Scene Generation Prompt in Appendix F for examples. This yaml file will then be parsed by yaml parser to load the specified objects and generate the scene in the simulator.

How does the system manage situations where the scene results in an unsolvable task? For example, if the task is to open a cabinet but the cabinet is positioned out of the robot arm's reach.

We specifically added support for robot manipulators with mobile base so that such reachability issue won't be a typical failure cause. There are also other potential failure modes on which we added an analysis in section 4.3 (also see detailed response above). Note that our pipeline's generated tasks and skills achieves a validity rate over 90% and skill learning success rate over 75%.

How are potential collisions in the initial scene configuration addressed?

Firstly, when the LLM generate the initial pose of the objects, it has a basic spatial understanding and will try to place the objects in different locations. We use this as the initialization, and check potential collisions in the initial scene configuration. For any detected collision, we detect the collision vertex of the objects in contact and push their center of mass away to resolve collision. Thanks you for asking this! We have updated Appendix A to include details on this.

Is the initial state of the scene fixed, or is it sampled from a distribution?

Currently the initial state is given by the LLM, and we resolve any detected collision. We set the temperature properly (see Appendix A) so that the response from GPT-4 is stochastic while being reasonable. Therefore, each time the generated task and scene configuration will be different. We could also use this initial configuration as an initialization and add random disturbance sampled from a gaussian distribution, which could further enhance the diversity of the scene. In the current version we are not yet doing this.

评论- Author Response (Part 5)

2023-11-22

5. Object Assets

Currently, the paper relies heavily on PartNetMobility and RLBench for task-relevant objects, which may restrict task diversity. Although the paper suggests using Midjourney + Zero123 for additional 3D assets, this pipeline lacks detailed elaboration.

The objects in PartNetMobility and RLBench are used for only seeding the LLM, providing a ground for generating different relevant tasks. PartNetMobility is the biggest dataset available containing dozens of different types of articulated objects, covering a wide range of articulated objects we interact in daily lives. There are also active ongoing research studying how to generate articulated objects [1][2], which we can incorporate in future pipelines. Moreover, the LLM will reason about tasks relevant to the seeding objects and propose extra objects needed, which we retrieve from object database like Objaverse containing nearly 1 million objects, and can be replaced by the recently released Objaverse-XL containing 10 million objects. Such retrieval from massive mesh database further enhances the diversity of both the objects involved in tasks and scene configurations. In addition, we also incorporated example-based seeding, where the task generated by LLM are not tied to specific tasks, but rather just using the given tasks as hints and extrapolate to more tasks. In our future work, we will also consider image-based seeding, where we can leverage massive amount of image data depicting daily tasks available online, which could also significantly improve task diversity.

We would like to take this chance to further elaborate the motivation behind having both retrieval and generation for obtaining the object assets. This paper's primary contribution, from our perspective, is the introduction of a generative paradigm, merging generative models with physics-based simulation to automate the whole pipeline for scalable and automated robot learning. Our focus therefore lies in establishing a comprehensive and adaptable framework, prioritizing its structural development over specific backend models for the initial implementation. Considering the rapid advancements in LLMs and 2D/3D generative models, with SOTA models being updated on a weekly basis, our framework is designed to be able to upgrade seamlessly by using the latest progress in these orthogonal research areas; while GenBot, serving as the first comprehensive proof-of-concept, showcases the feasibility and effectiveness of this novel research direction.

While human-designed database for 3D assets is growing rapidly, with massive-scale databaset like Objaverse-XL with millions of objects, they are mostly designed by human artists and would require massive human effort to continue scaling up. On the other hand, automated text-to-3D generative pipelines utilize massive 2D image resources available online, coupled with inductive biases such as view consistenty, holds a better potential in further scaling up and automating the process for obtaining 3D assets. While at the time of submission, retrieval from existing databases usually presents higher-quality meshes and faster speed, we believe text-to-3D pipeline holds a greater potential in the future, in terms of both asset fidelity and generation speed. Therefore, we intentionally designed our framework to support both of these two pipelines to verify both of them could work.

As a result, in the first implementation and design of GenBot, we tried to unify both pipelines and prioritize retrieval over generation: the initial design we had is resorting to the generative pipeline whenever retrieval fail to pass the verification stage. However, in practice we found that for rigid-body manipulation settings, objects retrieved from Objaverse already work very well and the generative pipeline is rarely needed. Therefore, in practice, we now only use the text-to-3D pipeline for soft-body manipulation tasks, in order to obtain a more consistent and controllable target shape for the soft-body under manipulation. Another reason is that soft-body manipulation's goal shape doesn't require detailed texture, which is also one limitation of generative models at the time of submission. We have revised section 3.2 to make this more clear. We also have added more details on the text-to-3d pipeline in appendix E, with some example visuals in Table 5. We would also like to point to some latest work in 3D asset generation (one-2-3-45[3], instant3d[4], etc.) after our initial submission, which dramatically improved the generation speed and results. The models in our current pipeline can be easily swapped with these latest progresses, which we will incorporate in future development.

[1] Lei, Jiahui, et al. "NAP: Neural 3D Articulation Prior." arXiv:2305.16315 (2023).

[2] Chen, Qiuyu, et al. "URDFormer: Constructing interactive Realistic Scenes from Real Images" Corl 2023.

[3] https://one-2-3-45.github.io/

[4] https://instant-3d.github.io/

评论- Author Response (Part 4)

2023-11-22

All of the tasks above have novel semantic meanings and interaction types that are not present in the given prompt, demonstrating that such task generation process indeed extracts the common sense knowledge embedded in LLMs for automated task generation, which are otherwise not possible if using heuristic based procedural generation pipelines.

[1] Dalal, Murtaza, et al. "Imitating Task and Motion Planning with Visuomotor Transformers." arXiv preprint arXiv:2305.16309 (2023).
[2] McDonald, Michael James, and Dylan Hadfield-Menell. "Guided imitation of task and motion planning." Conference on Robot Learning. PMLR, 2022.
[3] Huang, Wenlong, et al. "Voxposer: Composable 3d value maps for robotic manipulation with language models." arXiv preprint arXiv:2307.05973 (2023).

4. Limited quantitative results

The majority of the results are qualitative, which lacks depth for readers. Additionally, the paper's comparison of task diversity to other benchmarks based solely on task descriptions is less than persuasive.

We acknowledge that due to the complexity of the system, our initial submission could benefit from more extensive quantitative experiments. We have added numerous experiments as described above in terms of both diversity measure, statistical evaluation of generated tasks, scenes and skills, as well as failure analysis (see 1 and 2 above for details). We hope the revised version is more persuasive now.

评论- Author Response (Part 3)

2023-11-22

3. Use of LLMs

While the paper demonstrates the potential for automating the entire pipeline, from task creation to skill learning, using large models, the necessity of LLMs is questionable. Could simple heuristics or random placements of objects yield similar results? Given the extensive prompting involved with GenBot, it's unclear if it genuinely produces more diverse tasks with reduced human efforts.

This is indeed a valuable question and highly related to the core value of our pipeline! We would like to elaborate this in detail:

First of all, a lot of our daily tasks heavily depend on the reasoning about object affordances and semantic understanding of the scene/robot, e.g., heating a bowl of soup requires a microwave in the scene, and the task involves turning the timer knob; a legged dog could perform a headstand, while a mobile robot arm couldn't; a safe could be used for tasks like retrieving a gold bar and locking the safe. All these semantically meaningful tasks requires reasoning about object affordances, and meaningful interactions with them. Manually designing heuristics for each different type of objects could be very labor intensive and cannot scale up easily. The intention of our framework is to unleash such common sense knowledge from the latest foundation models and transfer them to the field of robotics.

Secondly, one long-standing goal for scaling up robotic data collection is to train a foundational vision and language conditioned policy model, which requires the generated data to be coupled with semantically meaningful and rich language description labels. For example, turning a clock could have an associated language label "adjusting the time forward"; putting a bowl into the microwave is for the task "heating the soup", instead of simply "putting A into B". Such capability for generating generic and natural language description necessitates usage of the latest progress in LLMs.

Thirdly, the distribution of objects to be populated in the scene, dependent on the nature of the tasks, is far more complex them simply randomly sampling objects. For example, a task for retrieving a "gold bar" from a "safe", usually happen in an office-like setting with a desk and common office-type objects. Random sampling of objects do not consider the correlation between objects and whether they are appropriate and semantically related to the task generated, which could potentially lead to a gap when transferring a trained policy to common daily task settings.

We hope these factors justify the need for using LLMs for both generating meaningful tasks, the corresponding environments, and also the associated task descriptions. In addition, since the task generation is conditioned on the affordance of sampled objects, this could lead to more diverse task generation covering wider range of skills and interaction types with objects. In fact, many prior works [1][2] that resort to heuristic-based procedural generation of tasks and solutions could mostly generate tasks related to pick-and-place or object stacking, mainly involving changing spatial relationships between sampled objects, without considering their affordances.

[Regarding the effort involved in prompting] We would like to argue that our prompts, though seemingly lengthy, is not as laborious as it appear to be. The prompts are relatively long mainly because there are multiple stages we would like to automate: from task generation, to scene configuration generation, object configuration generation, and reward generation. For each of these stages, we only provided 2-4 in-context examples to regulate the response of the LLM. In fact, contemporary work using LLMs for producing policy learning typically uses comparable or more in-context examples (e.g. Voxposer [3] used 10 examples).

[Further examples elaborating necessity of using LLMs] We believe the diversity of tasks are sufficient to demonstrate that the generalization and reasoning capability of the LLM are helpful in generating novel tasks. For example, while the prompt we feed to the LLM only includes tasks involving one single type of articulated object (oven), the generated tasks include various semantically novel interactions and tasks. We present several examples below:

flushing the toilet
retrieving goldbar from the safe
adjusting time on clock
changing lighting direction of a lamp
heating soup with microwave
adjusting the lever for operating coffee machines
extending suitcase handles
storing objects in fridges

评论- Author Response (Part 2)

2023-11-22

2. Task verification

The construction of tasks in simulation typically requires validation to ensure correct implementation. This involves examining success conditions, initial state distributions, physical parameters, and more. However, the paper lacks a systematic method for this crucial verification, especially given the automation of task creation. Mistakes at any stage could result in flawed tasks. Specifically:

What percentage of the tasks can be successfully solved? How does this compare to the total number of generated tasks?

Are trivial tasks, such as picking up a block when given the grasp action primitive, filtered out?

Thank you for your suggestion on this matter! In order to address these, we added the following analysis:

In order to verify the validity of the generated scenes, we added a round of human evaluation of the generated scenes on whether the generated scene aligns with the language annotations, and if the scene configurations and retrieved objects are correct for the task. (See updated parts in section 4.2 and 4.3) We generated 155 tasks and performed human evaluations on them (full list in Appendix D). There are 12 failures due to incorrect scene generation, leading to a 92% valid rate of our generated scene. (See scene validity in section 4.3 for failure case analysis.)
In order to further verify the validity of the generated task decomposition and reward functions, we added an additional round of human evaluation on the generated training supervisions, by asking a human expert to carefully inspect whether the generated decompositions and reward functions are reasonable and could in principle lead to successful skill learning. We also manually inspected the generated decompositions and reward functions, and found 6 failure cases in the 155 generated tasks (a 96% validity rate). We revised corresponding paragraphs in section 4.2 and 4.3 with failure analysis, and included full details in Appendix D.3.
We now also provide a quantitative analysis on the skill learning success rate, i.e., given a set of generated tasks (with correct training supervisions), if we run the skill learning pipeline for multiple times, how many of those would be successful in learning the skill. The success in learning a skill is determined by a human evaluator watching the video of the learned policy. We test the success rate of our skill learning pipeline on a subset of 50 generated object manipulation tasks, and achieve an average success rate of 0.75, i.e., if we run the skill learning pipeline for 4 times, 3 of the runs will successfully learn the skill. Details on the individual success rate on each of the 50 tasks can be found in Appendix D.2.

We did not add explicit filtering for simple tasks. To justify this, we would like to elaborate a bit further on the main motivation for this automated pipeline: our proposed paradigm called generative simulation, does not intend to generate only hard tasks and serve as a fixed set of benchmarking tasks. Instead, the main goal is to automate the pipeline for generating diverse tasks and their associated demonstration trajectories (i.e. scaling up data collection in simulation), so that they can potentially be used to distill into a multi-task language-conditioned policy later. With this goal, tasks with various levels of difficulties all carry importance to ensure the distribution of data used to train a downstream policy is wide enough, resulting in better generalization. In addition, since this is a relatively new research direction, there lacks a well-established and systematic verification process to verify all the components rigorously, and we believe this is also an opportunity for future research in this direction. The main contribution of GenBot is a comprehensive proof-of-concept, and to showcase the feasibility and effectiveness of this novel research direction. In this first attempt, we believe our current newly added and extensive human evaluation is informative in demonstrating its potential, and would hopefully open up numerous research possibilities. That said, one clear improvement in our near-future work is to replace current VLMs with the newly released GPT-4V model, and investigate how much human effort in evaluation can be further reduced with vision feedback in the latest VLM model.

评论- Author Response (Part 1)

2023-11-22

We really appreciate your detailed comments and valuable feedback! We are very grateful that you recognized the value of our automated pipeline in automating scalable robot learning. Your requests regarding the diversity of the tasks generated, missing details and limited quantitative evaluation are very constructive. We have significantly revised our manuscript with more details and numerous additional experiments. We hope to address your concerns in our response below.

1. Task diversity

I am concerned regarding the diversity of the generated tasks. With tasks proposed by LLMs and only qualitative examples provided, it's challenging for readers to gauge the true diversity of these tasks.

Specifically: How many semantically distinct tasks are generated? By "semantically distinct," I refer to tasks that are fundamentally different. For instance, "opening a cabinet" and "lifting a bucket" are semantically distinct, whereas "walking forward" and "walking backward" are not.

What is the range of diversity in scene configurations? Upon reviewing the prompts, it seems that certain elements, like a table, have fixed poses and heights. If this is a recurring theme, then scene configuration diversity appears limited.

Thanks you for bringing this up! In our original submission, we compared with several popular human-designed manipulation task suites in terms of semantic diversity in language space using both Self-BLUE and embedding similarity in Table 1. We now added additional diversity comparison also in image space, using both ViT (imagenet pre-trained) and CLIP models to compute the embedding similarity of the rendered images of the scenes (see updated Table 1 and associated explanations). We copy the main results here:

	GenBot	Behavior-100	RLbench	MetaWorld	Maniskill2
Number of Tasks	106	100	106	50	20
Task Description (Self-BLEU) ↓	0.284	0.299	0.317	0.322	0.674
Task Description (Embedding Similarity) ↓	0.165	0.210	0.200	0.263	0.194
Scene Image Embedding Similarity (ViT) ↓	0.193	0.389	0.375	0.517	0.332
Scene Image Embedding Similarity (CLIP) ↓	0.762	0.833	0.864	0.867	0.828

Compared to other human-designed tasks suites, GenBot achieves a lower image-space embedding similarity measured using both ViT and CLIP. Regarding the example you gave on "walking forward" and "walking backward", their image space embedding will be very similar, so we think this additional image space comparison would be helpful in evaluating the diversity from another axis. We believe our current comparison in both language and image space, with different metrics and different embedding models, could more convincingly demonstrate the diversity of our generated tasks.

We also included additional details of a set of 155 generated tasks in Appendix D.1, together with their task name, task description, number of task decompositions, and skill learning success rate. When developing our pipeline, for implementation simplicity, we considered both table-top manipulation and general manipulation setting without a table. For tasks that GPT-4 deem requiring a table, we sample from a pool of 5 tables with different geometries, textures and heights (see our website for detailed visuals). We believe this is already more complex and diverse than existing human-designed table-top manipulation benchmarks (like RLBench and MetaWorld). That said, it's definitely doable to treat table the same way as other objects instead of considering it separately. We leave this unification as our future work.

评论- Author Comment

2023-11-23

Dear Reviewer,

Authors

2023-11-23

Thank the authors for such a detailed response. However, since the response is not posted until the last day, I will need extra time to go through them. I will consider the information from this response in the upcoming reviewer-AC discussion.

评论- General Response

2023-11-22

We thank all the reviewers for their detailed feedback and useful suggestions! We are very encouraged that the reviewers found our proposed paradigm "promising, compelling and potentially very impactful" (X3QY), "novel and interesting" (EQgy), our method "intriguing, effective, plausible" (niTv), with "commendable quality, efforts and significant strength" (pBsz) and "interesting results" (niTv). We address all the concerns of the reviewers in the individual responses.

The main concerns from the reviewers revolve around requests for more experimental results and analysis. We have extensively expanded our experimental section to include more quantitative results and failure analyses, and updated our manuscript with corresponding revisions highlighted in blue. We summarize the major changes below:

Presentation improvement (X3QY, EQgy)
- We have extensively revised Section 3.1 and 3.2, made them more concise and with a more clear focus on important details
- We re-structured section 3.2 into a bullet list, with an improved logical flow matching the figure better
- We added citations to some previously missing related works
- We added an algorithm pipeline in Appendix C to better summarisze the whole framework
- We added more details of the text-to-3D pipeline in Appendix E
We added additional diversity measurement in image space using both ViT and CLIP models to compute the diversity of the generated tasks (X3QY, pBsz, EQgy)
We added quantitative evaluation of the generated scenes to verify their validity, together with analyses on the failure cases (X3QY, pBsz, EQgy)
We added quantitative evaluation to examine the validity of the generated skill training supervision, together with analyses on the failure cases (X3QY, pBsz, EQgy)
We added quantitative evaluation to analyze the success rate of the skill learning stage (X3QY, pBsz, niTv, EQgy)
We open sourced our code anonymously here: https://github.com/GenerativeSimulation/GenBot (pBsz)

We hope our responses address all reviewers' concerns convincingly. We would like to express our gratitude towards all the reviewers again for their detailed and constructive feedback.

评论- Thank you and we are looking forward to your post-rebuttal feedback!

2023-11-23

Dear AC and all reviewers:

Thanks again for all the insightful comments and advice, which helped us improve the paper's quality and clarity.

The discussion phase is about to end soon and we kindly remind the post-rebuttal responses.

We would love to convince you of the merits of the paper. Please do not hesitate to let us know if there are any additional experiments or clarification that we can offer to make the paper better. We appreciate your comments and advice.

Best,

Author

AC 元评审

2023-12-07

The authors propose to prompt existing foundation model APIs to retrive objects from a database, predict their locations, and generate image backgrounds, which are then loaded into a physics engine to create a simulation environment. Similarly, prompting is used to generate code for reward functions and subgoals. Demonstrators are then generated either using motion planning or a reinforcement learning method. While the paper explores an interesting approach, the empirical evaluation of the learned behaviors is restricted to tasks their approach itself generated and in general too limited to warrant acceptance.

为何不给更高分

Missing sound evaluation of resulting control performance and an evaluation on existing tasks

为何不给更低分

N/A

最终决定Reject

2024-01-16

Reject