6.0

/10

Poster3 位审稿人

最低6最高6标准差0.0

4.3

置信度

正确性3.0

贡献度3.0

表达3.0

NeurIPS 2024

HumanVLA: Towards Vision-Language Directed Object Rearrangement by Physical Humanoid

Xinyu Xu,Yizheng Zhang,Yong-Lu Li,Lei Han,Cewu Lu

OpenReview PDF

提交: 2024-04-23更新: 2024-11-06

TL;DR

We introduce HumanVLA, which performs a varity of object rearrangement tasks directed by vision and language by physical humanoid.

摘要

关键词

Human-Scene Interaction; Object Rearrangement; Vision-Language-Action Model; Physical Humanoid

评审与讨论

审稿意见

评分: 6置信度: 42024-07-10

This work aims at building a vision-language-action (VLA) model that can control a humanoid to interact with dynamic objects without ground-truth state information. The proposed method first trains a teacher policy with ground-truth information. It then distills the policy to a VLA model via imitation learning. Additionally, the paper presents a new dataset and evaluates the paper against a recent baseline.

优点

The work is well-motivated and addresses two fundamental challenges: interacting with dynamic objects and the lack of ground-truth state information in real-world scenarios.
The method works well based on the demo videos and the quantitative evaluation. Each component of the method is well-motivated.
There are sufficient details about the method and the dataset.
The dataset can be a valuable resource for future research.

缺点

The paper combines many existing methods together and builds a very complex system. It introduces some interesting components such as active rendering, but it is unclear how general the system is for a broader class of synthesis. For instance, can this system be applied to any humanoid-object control? What about one humanoid interacting with multiple objects (e.g., one agent carrying two objects in both hands) or multiple humanoids interacting with the same object (e.g., two agents carrying a sofa)? Can you extend the object category to those that are unseen during training?
Related to the first point, there is a lack of discussion on failure and limitations.
It would be good to show videos of the baseline results for a clearer comparison.
What is the sim2real gap? I assume that the end goal is to control a real-world humanoid to complete the tasks?

问题

Please address the points in the weaknesses. Additionally, I have the following questions:

Could you please comment on whether your method can be suitable for human 3D body motion synthesis (e.g., SMPL-X like the synthesis in OMOMO)?
Would it be possible to directly train VLA to simplify the training procedure?
Could you apply your method to longer activities that involve a sequence of goals? If not, what could be missing?

局限性

There is not much discussion on limitations.

作者回复

2024-08-06

W1: Border class of synthesis.

One-agent multi-object: Our framework does not lose generalizability in muti-object interaction, which can be achieved by task reward design. However, from a model mechanism perspective, it will be more suitable with two dexterous hands.

Multi-agent one object: It can be achieved by integrating our pipeline into a multi-agent learning framework.

Since we are the first work in vision-language humanoid, we leave these border synthesis applications to future works.

W1: Unseen categories during training.

Originally, we conducted task-level unseen experiments in the test set and claimed its generalization performance. Task-level unseen includes: new compositions of objects in the scene, new placement of objects, and regenerated new text instructions from LLM describing new compositions and new spatial relations. Our approach can generalize in task-level unseen.

In addition, we agree it would be super challenging to generalize in any unseen without any similar pattern in the seen data. It is also an ultimate goal of embodied AI research. We make additional analysis on generalization in unseen data to further disclose our method. We make additional testing data: (1) Unseen texts generated for training tasks, manually reviewed to be distinct from training data. (2) Unseen objects by changing visual appearance in training tasks. (3) Unseen object category (cup) with different geometry. (4) Unseen scene layouts by repositioning static large objects.

	Success Rate (%)	Precision (cm)	Execution Time (s)
Useen Text	65	50.4	5.4
Unseen object (visual)	50	72.3	6.2
Unseen object (geometry & category)	20	118.8	7.9
Unseen scene layout	35	88.5	6.8

Results are reported in the above table. We find that our work suffers less from unseen texts and unseen visual appearance. But generalizing to unseen object categories and execution in the unseen scenes remain a main challenge.

W2: Failure and limitation.

We have discussed some limitations in Appendix A. More failures and limitations pointed out by reviewers will be supplemented.

W3: Baselines

Thanks for your suggestion. A qualitative case comparing our method and baseline can be found in Figure 11. We will add comparative demo in the final version but are unfortunately not allowed to submit demos in the rebuttal window.

W4: sim2real gap

The sim2real gap includes the robot model and environmental data. While we utilize a simulated humanoid model, exploring real-world embodiment structures is essential. However, our method is generally applicable across similar mechanical structures of humanoids. Besides, efforts in scaling up real-world data, like scanning and decomposing the real world, are also crucial.

Q1: Human 3D body motion synthesis

Our work is a kind of human 3D body motion synthesis, i.e. physical motion. The mentioned SMPL-X like synthesis in OMOMO is kinematic motion. We have made a thorough discussion of two streams in Section 2 Related Works. Both kinds of works are to generate plausible motions and human-scene interactions. A key distinction lies in the manipulation of object states: kinematic motion, usually for computer graphics, allows for direct editing of object states, whereas in physical motion, altering the object state necessitates indirectly controlling the humanoid to interact with. To this end, we think physical motion and interaction are more challenging than kinematic motion. It is not fair to directly apply our holistic pipeline to kinematic SMPL-X based synthesis. However, we think some motivations, such as adversarial training and curriculum training, can be shared for kinematic synthesis.

Q2: Directly train VLA

Directly training VLA using current techniques and computes is very hard. First, from a computation perspective, rendering an image is usually 10 times slower than physics itself in simulation. Besides, VLA inferencing is 20 times slower than state-based network. One-stage visual RL requires large parallel environments on heavy computation loads. Our two stage system leveraging state-based RL (large-scale parallelization on light task) and behavior cloning (small parallelization on heavy task) turns out to be efficient. Second, from a learning perspective, vision-language modality is a coarse, high-dimensional, and composite representation. On the contray, state based RL is precise and unambiguous. We anticipate direct Visual RL is hard to converge. Combining these facts, we believe our framework is most suitable and affordable within our computes.

Q3: Longer activity

Thanks. Applying our method to longer activities is possible by using sequences of tasks. Efforts on benchmarking long-horizon tasks are required. We have discussed this limitation of our current form in Appendix A and leave long-horizon benchmarking to future works.

评论- Thanks for the responses

2024-08-08

I appreciate the responses. They ahve addressed my questions. I have raised my rating to weak accept. I urge the authors to incorporate the discussions in the responses in the final version and also provide video demos of different methods.

审稿意见

评分: 6置信度: 42024-07-11

In this paper, the authors address the task of room arrangement with a humanoid agent and propose a teacher-student framework to achieve vision-language guided object rearrangement. The teacher policy utilizes privileged state information, including object states, goal states, and waypoints derived from A* planning. Goal-conditioned reinforcement learning together with AMP is employed to train a human-like policy that guides the humanoid agent in completing the task. This teacher policy is then distilled into a student policy, which relies on high-level language instructions and ego-centric view images instead of ground truth state information. A DAgger-like training paradigm is used, and an active rendering technique is developed to focus the camera on objects, ensuring informative ego-centric view images. The authors construct an object rearrangement dataset as a test benchmark for the proposed framework. Experimental results on this benchmark demonstrate the effectiveness of the framework, although generalization to novel tasks and environments remains challenging.

优点

The writing is clear and easy to follow.
This work is addressing a very promising problem setup - the visual-language guided policy learning, and it can have the potential to be applied to the real humanoid.
The build dataset and benchmark can facilitate the community research in human-scene interaction.

缺点

The major methodologies employed in this work, including the AMP+goal-conditioned RL and teacher-student policy distillation were widely-used in many previous works, but still it is good to see this new application in the humanoid room arrangement task.
More investigations into the generalization ability of the current pipeline is desirable, see more in the question section.

问题

Regarding the teacher policy: could you justify which modules or techniques help in achieving (slightly) higher success rate, better precision, and less execution time than the InterPhys baseline? Since the proposed teacher policy training paradigm share the similar training paradigm in the InterPhys, but it is not justified clearly about the major differences between the InterPhys and the teacher policy implemented in this work.
Regarding the student policy:
- Without the waypoint information, how is the agent capable of searching and navigation? Does this mean that the agent learns to recognize objects via the vision-language distilling?
- The active rendering technique is suggested to encourage the camera to focus on the object state, but does this still access to the privileged object state information?
- It is good to see the discussions on the generalizing performance to unseen tasks, and the performance kind of actually makes sense, and I believe it would provide more insightful views if we can investigate how can the policy generalize to unseen scene layouts, unseen visual appearances, unseen language instructions, unseen object geometries individually.
The sphere-like hand can limit the capacity of the agent, and the authors also mentioned that one future work can be include dexterous manipulation. Could the authors give some comments on what additional challenges could be by involving the dexterous manipulation skills?

局限性

Yes, the authors have addressed the limitations,

作者回复

2024-08-06

W1: Widely-used methodologies

We agree some methodologies like AMP, RL, and teacher-student distillation are widely used in applications. But we also investigate new innovative insights like style reward clipping, carry curriculum, active rendering, and more to apply these techniques in the challenging humanoid-based object rearrangement.

Q1: Improvements over InterPhys.

We introduce four new techniques compared to InterPhys in the teacher policy, as detailed on Page 5. For box loco-manipulation, the slight improvement mainly comes from style reward clipping (prioritizing task execution) and path planning (navigating complex scenes). For the general rearrangement of diverse objects, our geometry encoding and carry curriculum techniques facilitate multitasking and effectively result in significant improvements.

Q2.1: Searching and navigation.

The student policy learns from teacher-student distillation. It can learn scene layout patterns and plan navigable paths from training tasks via distillation. We also note the searching and navigation ability can be boosted by ad-hoc module design, such as the integration of Embodied LLMs, but we leave it to future works.

Q2.2: Privileged object state in active rendering

Privileged object states are only used in HumanVLA training for action supervision. However, the state information is NOT accessed in the inference phase.

Q2.3: Investigation in the unseen.

	Success Rate (%)	Precision (cm)	Execution Time (s)
Useen Text	65	50.4	5.4
Unseen object (visual)	50	72.3	6.2
Unseen object (geometry & category)	20	118.8	7.9
Unseen scene layout	35	88.5	6.8

Q3: Dexterous manipulation

The current humanoid model has 28 degrees of freedom (DoFs). However, the number of DoF may exceed 70 for a humanoid robot with two dexterous hands. The high dimension of the action space is a primary challenge. Hand motion data are required to train dexterous actions. More effort should be paid to collecting hand data, even hand-object interaction. On the algorithm side, adversarial training suffers from model collapse and generates less expressive actions. This phenomenon will be exacerbated as the state dimension increases. Though techniques like separated hand prior modules and tracking-based imitation can alleviate the issue, expressive whole-body dexterous controls remain very challenging.

2024-08-09

I appreciate for the detailed response from authors and most of my questions are addressed. I believe this work makes a first step in exploring the promising vision-language guided humanoid motion control problem, and I tend to accept this paper. Therefore I keep my original rating as weak accept.

审稿意见

评分: 6置信度: 52024-07-13

This paper proposes HumanVLA, a framework for training humanoid controllers powered by vision and language. First, a teacher control policy is trained to control a simulated humanoid to carry objects to specific positions. Then, this policy is distilled into a vision-language-action model that uses vision to guide the movement of the humanoid and objects. A few techniques such as active rendering, reward balancing, etc., are proposed to improve performance; a dataset containing sequences of object-carry locomotion is also proposed.

优点

This paper paints a very promising picture for simulated humanoid control and vision-language models. The formulation (vision + language + proprioception) provides an enticing research direction for embodied AI.
The proposed solution (teacher-student) is a well-tested formula for humanoid control and is performing well in the proposed task. The vision element has not been explored much for humanoid control due to its high dimensionality but is incorporated in the current pipeline.
- The proposed curriculum learning, reward clipping, etc., while small innovations, contribute to the overall performance of the pipeline.
Experiments show that the model achieves a high success rate compared to previous SOTA (InterPhys) in box arrangement when given privileged information. The success rate for the vision-language-action model is also promising.

缺点

The major weakness of this work is qualitative evaluation. Only four video demos are shown without any information about the text prompt or egocentric rendering. Thus, there are limited ways to know how well the proposed VLA method performs in terms of generalization.
- Along this line, since the evaluation dataset is provided by the authors, the diversity of the text prompts & tasks is unknown and not demonstrated.
The VLA part of the proposed method is relatively weak and understudied. Given only BERT text encodings, it is very hard to imagine that the MLP-based agent could complete complex tasks without any guidance. How does the agent know where the target location is? Does the agent start with a scanning phase where it locates where the target position is? For such a low error in location placement (~20cm) and such a coarse language instruction, the only way the MLP-based agent could succeed is by memorizing the training data.
- The proposed VLA agent has no memory and no planning capability, and the vision part essentially acts as an object classifier. The language instruction performs a similar role. Also, the active rendering encourages the agent to always look at the object, so there is little way it can interpret the scene layout or plan paths. In order to really prove “generalization,” truly unseen instructions and scenes need to be shown, and its success rate separately reported. At the moment, there are no real indications that the proposed method can generalize to unseen scenarios.
L20: “. specific object dynamics, such as carrying a box [15] and throwing a 22 ball [47].” - the proposed method also only handles one type of interaction, carrying objects with two end effectors.

Overall, I feel like this work shows great potential in providing a task formulation for embodied AI; however, it is a little overclaimed at this moment in terms of its language and planning capabilities. I recommend scaling down on the formulation (e.g. focus on known scenes and objects) instead of claiming that it can tackle unseen rooms and instructions.

问题

How does the agent plan its path around a new room if it can is encouraged to always look at the object? The environment-awareness and target-location awareness of the agent is not properly addressed.

局限性

Limitations on the language instructions side is not adequately discussed.

作者回复

2024-08-06

W1: Qualitative results

Thanks for your suggestions. The corresponding text prompts for demos align with Fig. 4 and Fig. 12. We provide an extra qualitative result with text and egocentric rendering in the rebuttal PDF. Due to the unavailability of demo submission in the rebuttal window, we will add it in the final version.

W1.1: Evaluation datasets and diversity

Experiments are conducted on the new proposed HITR dataset, but we kindly remind that there is no other public dataset to support humanoid-based object rearrangement research. To demonstrate our dataset, we illustrate details of scenes and moving objects in Fig.6 and Fig. 7. Examples of texts and tasks are available in Fig. 4 and Fig. 12. The diversity of texts lies in objects (pot, chair, box...), visual attributes (blue, red...), the spatial relationship between object and receptacle (center, left, bottom...).

W2: Target location

The HITR dataset ensures that the object is visible in the first view. The initial orientation in HITR is randomly sampled within 30 offset degrees from the object's initial position. Our active rendering technique encourages a localized object view at each step and boosts object localization in the whole execution trajectory. We will add details.

W2: Placement error and coarse instruction

We kindly remind the placement error for VLA model is 40cm, not 20cm (See Line 279). We agree using coarse instructions instead of precise coordinates is ambiguous for VLA model, and thus criteria relaxation is performed.

W2.1: Memory and planning capability

Thanks. We encode history actions as the memory part. We agree there is room for improvements by ad-hoc module design. Inserting extra modules, such as Embodied LLM, for explicitly memorizing and planning in the scene will be of great value to the system. We will add discussions and leave it to future works.

W2.1: Investigations in Unseen

	Success Rate (%)	Precision (cm)	Execution Time (s)
Useen Text	65	50.4	5.4
Unseen object (visual)	50	72.3	6.2
Unseen object (geometry & category)	20	118.8	7.9
Unseen scene layout	35	88.5	6.8

W3: L20 "specific object dynamics"

We will revise the text. But we kindly remind our work can rearrange diverse objects, where the diversity of objects is not supported by previous works.

W4: Scaling down on the formulation

Thanks for your suggestions. We will tune down our claim and disclose more limitations and failures in the final version.

Q1: Plan its path around a new room

The path planning capability is learned from the training room layouts. Though our active rendering encourages an object-oriented view, we deploy a large (90-degree) FoV camera, the background information is still available.

2024-08-12

A1: We ensure that the object initial location is visible during the initialization process. Besides, statistically, in 89% of tasks, the target location is also visible during initialization. An interesting statistical observation is that the average distance from the robot initial position to the object is 3.5 meters, which is greater than the 2.0 meters from the object initial position to the target location. Geometrically, when considering a triangle formed by these three points, the robot starts from a distant point, orientates the object, and moves toward it. This makes it more likely to identify a closer target location. Note that the camera has a large field of view. In corner cases where the target location is not visible during initialization, it can be specified by language instructions, such as the description of the spatial relationship of the target receptacle. We will revise the text and include more details in the final version. Thank you for your suggestion.

A2: Following your suggestion, we have conducted an additional experiment using geometrics instead of vision. Results are reported in the following table. We found the behavior cloning suffers from an ambiguous learning process because the object state (position, rotation) is not presented to the humanoid only using geometric features. The unawareness of the object state leads to a crash in a long-horizon task of multiple steps, where the proprioception serves as the only discriminative input for the control at different steps. The policy is hard to converge due to the action ambiguity with poor performances.

	Success Rate (%)	Precision (cm)	Execution Time (s)
Train set	5.6	176.4	9.7
Test set	0	189.8	10.0

Thanks again for your insightful and constructive suggestions. We are happy to address any further concerns about the work

评论- Follow up on discussions

2024-08-14

Thanks for the answers. A1 addresses my concerns. I think including the information about "We ensure that the object initial location is visible during the initialization process" and " 89% of tasks, the target location is also visible during initialization" is important. "Note that the camera has a large field of view. In corner cases where the target location is not visible during initialization, it can be specified by language instructions, such as the description of the spatial relationship of the target receptacle." -- this observation is not really supported by any experiment/evaluation.

As for A2, sorry for the confusion. It is clear that the object state can be replaced by the vision part; I am referring to adding back the object state and geometric information, but now I realize that would be the oracle policy. A better experiment would be to limit the field of view, and answer the question "does the model actually leverages vision for locating the target position".

I would keep my original score of weak accept and recommend accepting this paper nonetheless.

2024-08-14

Thank you again for your review with insightful and constructive suggestions, which helped a lot in further improving our paper. We will revise our paper, clarify the task setting, and add experimental results as suggested in the final version.

评论- Reviewer Response

2024-08-11

The reviewer thank the authors for the detailed response!

"The initial orientation in HITR is randomly sampled within 30 offset degrees from the object's initial position." This is a very important detail that must be included in the paper. Does this mean that the target location is always in the view of the humanoid during initialization?

Having a 90-degree FoV does not mean that the model actually leverages vision for locating the target position. I am wondering have the authors tested removing the vision part (but added back the object geometric information) and tested its performance?

作者回复

2024-08-06

General Response

We sincerely thank all reviewers for dedicating their time to review our work. And we highly appreciate their positive ratings and recognition of our work:

Our work addresses fundamental challenges for humanoid and points out a very promising and potential research direction of embodied AI.
We propose new techniques, well motivated with small innovations.
Our data efforts have value in facilitating future research.

We also welcome constructive comments by reviewers and add additional results to disclose failures and limitations of our system in unseen generalization. We will add discussions in the final version. In addition, as the very first work directed at vision-language humanoid, we leave research on ad-hoc policy module design, scaling large data and large models, long-horizon tasks, dexterous manipulation, and multi-humanoid collaboration to future works.

最终决定Accept (poster)

2024-09-25

The general idea of the paper is to develop a vision-language-action model to control a human avatar performing rearrangement tasks. The paper also proposes the Human-in-the-Room dataset which includes various rearrangement tasks. The reviewers unanimously rated the paper as Weak Accept. The reviewers had concerns such as the gap between simulation and reality, diversity of the proposed dataset, and generalizability of the pipeline. Generally, the rebuttal addressed the concerns well. However, there are some inaccurate statements such as “there is no other public dataset to support humanoid-based object rearrangement research”. Frameworks such as Habitat 3.0 study rearrangement tasks with humanoids. Overall, the paper has interesting contributions, and the AC follows the recommendation of the reviewers.