Grounding Multimodal Large Language Models in Actions
摘要
评审与讨论
This paper studies how to "ground" multimodal LLMs (MLLMs) to the action spaces of agents with various embodiments. The authors examine "Action Space Adapters" (parameterization strategies; ASAs) for various embodiments and MLLMs and identify principles for constructing ASAs based on the target action space. The authors consider 5 embodied AI environments (3 continuous and 2 discrete) and over 100 tasks.
优点
The paper addresses an important issue for the future generalization of MLLMs to embodied environments and different action spaces. The authors conduct a variety of experiments in different environments and action spaces. The motivation for the method is given a theoretical grounding, and the experimental results provide concrete recommendations for adapting MLLMs to novel action spaces and environments. The combination of theoretical motivation and actionable takeaways is a strong contribution.
缺点
Overall the paper is very dense. This is a complex topic so understandably there is a log packed into 9 pages and as a result there are many parts of the paper that are hard to follow. There are a fair number of parts of the paper that are not clearly written, which impacts my scores. If the authors' rebuttal clarifies these points satisfactorily, I would be willing to reconsider my evaluation. The major points are given in Questions.
Some other issues are: Some terms are not clearly defined for the reader. For example, "codes" as used in Figure 3. (Again, if I simply missed where this was provided, please point it out). Similarly for "codebook".
No specific example outputs are provided. Some examples of environments are provided in the appendix but there's no clear discussion of, for example, adaptations that the best performing methods get right or wrong. Providing these would help clarify the definitions issue and make the problem much more grounded (ironically) than the plot and charts in the paper currently allow.
问题
-
Can you be clear about the difference between SemLang and Lang? The way it is written right now it reads as if the only difference between the two is that there are different numbers in the sequence being predicted. But since SemLang is predicting tokens that correspond to words, I assume that there is some underlying embedding that when softmaxed produced the token index. In that case, where to the numbers in the vocabulary of Lang come from and how are they being predicted if it's not using an underlying semantic (embedding) representation like SemLang?
-
Section 3.3: "In the complex environments we consider, the zero-shot performance of MLLM is poor across all ASAs, even with detailed prompts." Is this demonstrated anywhere, either in this paper (in the appendix perhaps?) or elsewhere (if elsewhere, please provide the citation)?
-
"We train with supervised finetuning for CALVIN, Meta-World, HabPick, and BabyAI. We train with reinforcement learning on Language Rearrangement." Can you please explain why this does not result in an invalid comparison between the environments, if they are trained for each environment using a) different methods and b) to the specifications of the environment themselves? Does this not risk overfitting the ASA to the environment and the training method? Or am I misunderstanding the goal here?
-
How precisely do the differences in embodiments affect the results? Are the embodiments those provided out of the box in each environment or do you do any adjustment or controlling for the effects of the differing embodiments?
局限性
The only limitations the authors mention is that they use only a single MLLM and the data collection requirements on RVQ and SemLang. These are good to mention but does not constitute a serious discussion of limitations IMO. There are differences in the way the models are trained/fine-tuned (see Q3 above) that complicate the axes of comparison, as do the wide variety of experiments conducted. Some discussion on the limitations of the analysis would also be welcome.
We thank the reviewer for appreciating the technical motivation, importance of the problem, and the exhaustive experimentation. The reviewer's clarification questions are addressed below and will be added to the final manuscript, which we believe will greatly improve it.
1. No specific example outputs are provided.
In the rebuttal PDF, we added visualizations of success and failure examples for the best performing action space adapter in the environments. We will include these visualizations in the paper.
We also highlight the existing Figure 4 which analyzes adapting to different interaction types in CALVIN and Meta-World. This plot shows that RVQ struggles to adapt to rotating and pushing tasks in CALVIN. Additionally, in Appendix E.1, we provide a per-task breakdown of performance, illustrating which tasks models struggle on. For example, Table 4 shows that SemLang struggles with generalizing to new instructions with spatial relationships. We will update Section 4 of the paper to summarize these qualitative analyses.
2. Some terms are not clearly defined for the reader. For example, "codes" and "codebook".
Thank you for pointing this out. We will update L164-179 to clarify the definition of "codes" and "codebook". In the VQ-VAE architecture [3], the "codebook" is defined as a matrix where is the size of the codebook and is the dimension of each codebook element. The "codes" are the -dimensional row vectors of . The VQ-VAE encoder network maps a continuous action to a vector in and then quantizes this vector to the closest of the row vectors in . Thus, each action is mapped to a single "code" from the "codebook" by the VQ-VAE encoder (or several codes in the case of Residual VQ-VAE).
3. Can you be clear about the difference between SemLang and Lang?
We will incorporate in L136-L149 the following clarification: "For Lang and SemLang, we represent each action as a short text . The difference between the two options is how this text is selected. In the case of SemLang, we pick a text that is representative of the action itself. For example, if we work with high-level actions and action represents picking an apple, then the corresponding text is . In the case of Lang, we map each action to a randomly chosen text. In our implementation, this is a sequence of integers. Note that we can pick any text for this mapping and the choice of integers is arbitrary. However, the selected text is not semantically representative of the action. To connect it with the underlying MLLM, we tokenize the action text with the MLLM text tokenizer."
4. Is poor zero-shot MLLM performance demonstrated?
The reviewer is correct that our paper does not empirically validate this claim, so we will remove it from the paper. We note that prior works [1,2] that adapt MLLMs for continuous control also only show results of finetuning MLLMs without considering zero-shot performance.
5. Why does training with supervised and reinforcement learning not result in an invalid comparison between environments?
The reviewer is correct that we do not compare across environments. We only compare across different ASAs while keeping environments, learning settings, and learning algorithms fixed for these comparisons. However, we provide a large number of environments (6 altogether) and two training schemes (RL and supervised learning) for which the different ASA exhibit similar comparative properties, which reinforces the correctness of our conclusions. For example, when we compare which method performs best on LangR, all methods are trained with reinforcement learning, and on BabyAI, all methods are trained with supervised learning. We do not compare LangR to BabyAI success rates; we only compare success rates in the same setting. Our goal is to analyze the relative rankings of which ASA performs best across a variety of environments to derive a conclusion about which ASA is best.
6. How do the differences in embodiments affect the results?
The embodiments are those provided out of the box for each environment. These affect the results by affecting how the agent interacts with the environment via the action space. The three continuous control environments have different robots with different degrees of freedom (DoF) in the action space which affects how methods perform. For example, the "Uniform" ASA is greatly affected by the embodiment's action dimension, with it performing well with MetaWorld with 4D actions but performing worse as the action dimension increases on CALVIN and Habitat Pick, respectively. RVQ performs the best consistently, regardless of the action dimension. We will explicitly add these embodiments properties to Section 4.1.
7. Lacking serious discussion of limitations.
We thank the reviewer for pointing this out and will update our Limitations in Section 5 with the following text: "While our investigation of ASAs enables connecting a MLLM to various action spaces, the performance of these methods is still subpar for real-robot deployment where high success and safety are critical. MLLMs with the best ASA still struggle on simple environments like BabyAI, only achieving 40% success rate. Further work is needed to improve the performance of these methods for real-world usage. Our investigation also only studies adapting MLLMs through behavioral cloning or on-policy RL. Future work can investigate if the choice of ASA varies when adapting the MLLM with other learning algorithms such as off-policy RL or offline RL."
References:
[1] Li, Xinghang, et al. "Vision-language foundation models as effective robot imitators." 2023. [2] Brohan, Anthony, et al. "Rt-2: Vision-language-action models transfer web knowledge to robotic control." 2023. [3] Van Den Oord, Aaron, and Oriol Vinyals. "Neural discrete representation learning." 2017.
Dear Reviewer eGot, we would be grateful if you can comment on whether our response addressed your concerns or if issues remain. To summarize, we included qualitative examples, clarified details and limitations which will be added to the final version of the paper, and demonstrated RVQ achieves even better performance with PaliGemma and full finetuning.
In this paper the authors present a way to adapt a Vision and Language model to perform action execution tasks in embodied environments. Specifically, systematically evaluate different ways of predicting actions both on task having both discrete and continuous action spaces. Thanks to this evaluation, it is possible to assess which are the most performant action prediction losses to use for different use cases. According to the authors' results, approaches based on VQ-VAEs are able to obtain the best performance in many tasks.
优点
-
One of the first papers that finally sheds light on the different approaches to performing action prediction in embodied environments. This is the most important contribution of this paper and I believe it will be really useful to refer to this set of experiments for future research.
-
They propose a VQ-VAE variant to generate latent codes for encoding actions. These codes can be intended as a way to learn "latent bins" to cluster the action space. Additionally, they propose a variant of this model based on the RVQ-VAE architecture to model a set of codebooks that are used to generate more precise actions.
缺点
-
The VQ-VAE variants are indeed really interesting and novel. However, I find the description of this method a bit unsatisfactory because it omits some details regarding "how" you train these models. See my question below for details.
-
The authors chose a good set of tasks for their evaluation however I believe that a benchmark that would have been perfect for this work is VIMA-Bench because of its focus on systematic generalisation. CALVIN somehow offers this but I don't think it's as systematic as VIMA-Bench.
-
For discrete action spaces they use BabyAI however they only test with a very limited size of the grids and with few tasks. Please see my questions related to this point as well.
-
Some related work missing:
- Team, Octo Model, et al. "Octo: An open-source generalist robot policy." arXiv preprint arXiv:2405.12213 (2024).
- Pantazopoulos, G., Nikandrou, M., Parekh, A., Hemanthage, B., Eshghi, A., Konstas, I., ... & Suglia, A. (2023, December). Multitask Multimodal Prompted Training for Interactive Embodied Task Completion. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (pp. 768-789).
问题
-
I believe that a reader would appreciate the details of how you have arranged the dataset to train the VQ-VAE variants. The appendix contains some information but it doesn't fully address the nature of the data (e.g., "What is each example?").
-
Why did you decide to follow the work from Carta et al. considering that it uses only a subset of the trajectory instead of the BabyAI benchmark which contains a range of different tasks of different complexity? Additionally, when you evaluate, are you sure that 100 episodes are enough to experience different configurations from the ones you used at training time?
-
Could you please clarify what is the language instructions for environments that do not have one such as Meta-World?
局限性
The authors acknowledge the limitations of their experimental setup rather than the limitations of their work more broadly. Specifically, I would recommend the authors consider the societal impact of developing embodied AI models. For instance, it would be important to acknowledge that these models are still far from being deployed in real-world scenarios considering that their performance, even on simple grid-world-like environments, is barely over the 30% success rate bar.
We thank the reviewer for the comments and suggestions. We address the reviewer’s points below.
1. Details on VQ-VAE training.
Thank you for the suggestion; we will add these details to the paper. We train the VQ-VAE learned tokenizers to reconstruct actions from the same dataset used for supervised finetuning. Specifically, we randomly sample a batch of actions from the supervised finetuning dataset. A batch of single actions is then encoded by the VQ-VAE into the codebook elements. These discretized codes are then used to predict the original continuous action. Thus, the “examples” the VQ-VAE reconstructs are single continuous actions. The VQ-VAE is trained with the mean-squared error over the action reconstruction and commitment error. We train for 3 epochs over the dataset with a batch size of 32 and use the same AdamW optimizer and learning rate schedule as described in L248-250 of the paper. We will update Section 4.2 with all these details.
2. Using VIMA-Bench as a benchmark.
We don’t benchmark on VIMA-Bench because its task specifications are multimodal, meaning they include videos and images, whereas our study focuses on language instruction-specified tasks. Future work can investigate how MLLMs can interpret multimodal instructions with text and images to complete tasks.
3. Clarify what are the language instructions for environments that do not have one such as Meta-World.
For Habitat Pick, the instruction is “pick a” followed by the name of the instruction, e.g. “pick a mug” or “pick a plate” (L637). For Meta-World, we use the task descriptions provided by the Meta-World authors in Appendix Section A of the original Meta-World paper, e.g. “Sweep a puck off the table” is the instruction for the sweep task (L607). For CALVIN, LangR and BabyAI, we use the instructions exactly as provided by the benchmark implementation.
4. Acknowledge broader limitations of work.
We thank the reviewer for pointing this out and will add the following the limitations in Section 6: “While our investigation of ASAs enables connecting a MLLM to various action spaces, the performance of these methods is still subpar for real-robot deployment where high success and safety are critical. MLLMs with the best ASA still struggle on simple environments like BabyAI, only achieving 40% success rate. Further work is needed to improve the performance of these methods for real-world usage.”
5. Why evaluate on Carta et al. BabyAI variant?
We evaluate with this BabyAI variant [1] because it is built for generalization to unseen language instructions. The standard BabyAI benchmark has no separate train and test split for language instructions. The Carta et al. variant evaluates generalization to unseen synonyms. We use this version to better test the language understanding capabilities of MLLMs.
6. Are you sure that 100 episodes are enough to experience different configurations from training ones?
Yes, we believe 100 episodes are sufficiently covering differences to test generalization since these episodes all have unseen instructions from training by replacing words with synonyms. We also change the environment random seed from that used to generate the training demonstrations.
7. Missing related work.
We thank the reviewer for pointing out these relevant works and will update our related work to include them.
References:
[1] Carta, Thomas, et al. "Grounding large language models in interactive environments with online reinforcement learning." International Conference on Machine Learning. PMLR, 2023.
Thanks for the clarification. Just a quick follow-up on some pointers that I don't think are fully addressed:
On VIMA-Bench: I reported VIMA-bench as an archetype of a benchmark that truly challenges models at different generalisation levels which I believe is a must to showcase that we're training truly robust and versatile models. Also, VIMA-bench instructions are multimodal in the sense that certain instructions contain visual references to the target objects (i.e., in the form of bounding boxes). This can be easily changed by replacing the bounding box with the corresponding name of the objects (easily accessible from the metadata). Additionally, authors could use the COLOSSEUM benchmark which is another relevant benchmark for testing the generalisation skills of agents: https://arxiv.org/abs/2402.08191.
BabyAI variant: I'm a bit sceptical about the fact that only 100 episodes can cover different generalisation levels. Particularly, I don't think that testing only on 100 examples is a suitable way to demonstrate the capabilities of these models in grid worlds. This becomes particularly important because the BabyAI variant is the benchmark with the worst performance in the paper. There is a chance that increasing the number of examples might bring your success rate further down. I invite the authors to reconsider this choice.
Finally, I would like to ask the authors to please provide all the details that are currently missing from the paper. I think this would make your paper much stronger and generally more useful to anybody interested in Embodied AI.
Thank you for the response. We added the details of VQ-VAE training, clarified language instructions for all environments, and updated the limitations discussion, but the paper PDF cannot be updated during the rebuttal, and we will add these changes to the camera ready version. Let us know what other information is missing and we are glad to provide it.
1. On VIMA-Bench: We agree that VIMA-Bench and COLOSSEUM are great benchmarks for systematically testing the generalization capabilities of models and will investigate their use in future works. Converting the visual references to target objects in VIMA-Bench is also a good idea to make it compatible with our text instruction conditioned policies.
2. BabyAI variant: We re-ran our BabyAI evaluations for 1,000 episodes for all action space adapters and found the results are almost the same as evaluating on 100 episodes, with SemLang still performing the best with 41% success rate compared to the previous 40% under 100 evaluation episodes. With 1,000 evaluation episodes, the success rate of the MLP ASA is 32%, and the Lang ASA is 30% compared to 32% and 29% success rate with 100 episodes. We updated the BabyAI numbers in the paper with these new results.
Let us know if there are any remaining unanswered questions or new reservations.
Apologies for the delay. I really appreciate your effort. It would be great to also clearly specify the distribution of tasks during training and eval just to make your number more sound.
Assuming that the authors will refine the manuscript following the reviewers suggestions, I'm happy to increase my score and support this paper.
I would suggest the authors to provide a list of changes that they plan/have done to address the comments in order to help the AC decide.
Dear Reviewer YrxP, we would be grateful if you can comment on whether our response addressed your concerns or if issues remain. To summarize, we increased the number of BabyAI evaluation episodes to 1000, clarified paper details and limitations which will be added to the final version of the paper, included qualitative examples, and demonstrated that RVQ achieves even better performance with PaliGemma and full finetuning.
Thank you for the suggestion, we posted a message to all reviewers listing changes that will be included in the final paper. For BabyAI training, we uniformly sample grid states and collect an equal number of demonstrations for each of the 5 instruction types. For BabyAI evaluation, we uniformly sample grid states with a different random seed, in each instruction replace nouns and adjectives with the out-of-vocabulary substitutions defined by Carta et al, and evaluate on 200 episodes for each of the 5 instruction types for 1000 total evaluation episodes.
The paper empirically studies how to properly ground MLLMs into embodiments, with a particular focus on the action representations, including the continuous and discrete solutions. The authors conduct a thorough study on 7 methods, encompassing over 114 tasks. The research indicates that for continuous actions, optimal results are achieved by learning action tokens that precisely represent the action distribution (RVQ). For discrete actions, superior outcomes are obtained by aligning actions with the original semantic token space.
优点
- How to properly ground MLLMs in embodied tasks is important and under-explored.
- The paper is the first to systematically and comprehensively study the optimal recipe for action tokens.
- The conclusions drawn from the empirical studies could provide guidance for subsequent research.
缺点
- All experiments are performed on LLaVA and LoRA. Conclusions may not be applicable to other MLLM architectures or scales.
- Based on existing conclusions, the quality of the paper could be further enhanced if the authors could suggest whether to use continuous or discrete methods, how to break through the accuracy limit of current methods, or where future work should focus on improvements.
问题
Please see the Weaknesses.
局限性
The authors discussed limitations in Sec. 5.
We thank the reviewer for the comments and suggestions. We address the reviewer’s points below.
1. Conclusions may not be applicable to other MLLM architectures or scales.
In the main rebuttal PDF, we show results with a different MLLM (PaliGemma [1]) and a different finetuning method (full finetuning as opposed to LoRA). Firstly, we swap the LLaVA MLLM with the recently released PaliGemma model, which uses a different LLM and visual encoder architecture and a higher visual resolution at 448 pixels per dimension. We finetune all 3B parameters of the PaliGemma MLLM and achieve 92% success rate in Meta-World, higher than 84% success rate with LLaVA and LoRA finetuning. Secondly, we finetune all 7 billion parameters of the LLaVA MLLM with the best performing RVQ ASA in the same Meta-World setup as from Section 4. This full finetuning achieves 85% success rate, which is slightly higher than the 84% success rate when finetuning 140 million parameters with LoRA.
These initial results in Meta-World indicate that our conclusions around RVQ being a strong action space adapter extend to other MLLMs and finetuning methods.
2. Where should future work focus on improvements?
One improvement of the results in the current analysis is to improve the generalization performance to unseen instructions through the core MLLM understanding. For example, even the best action space adapters struggle with generalizing to new spatial instructions (see the low “putnext” task performance in BabyAI in Table 6 and the low “Spatial” split for LangR in Table 4). Another area of improvement is the ability to handle very precise action sequences. As seen in Figure 4, which breaks down the performance by interaction type, MLLMs struggle to do precise manipulation tasks like rotating or pushing. We will add these to the discussion at the end.
References:
[1] Beyer, Lucas, et al. "PaliGemma: A versatile 3B VLM for transfer." arXiv preprint arXiv:2407.07726 (2024).
Dear Reviewer f4rS, we would be grateful if you can comment on whether our response addressed your concerns or if issues remain. To summarize, we demonstrated RVQ achieves even better performance with PaliGemma and full finetuning, included qualitative examples, and clarified paper details and limitations that will be added to the final version of the paper.
We thank the reviewers for their useful and insightful feedback. Reviewers highlighted that our study is one of the first on how to properly ground MLLMs in embodied action (f4rS, YrxP) and provides concrete and actionable takeaways that will be useful for future research (YrxP, eGot). We highlight that our rebuttal added:
-
New results showing our RVQ tokenization works for PaliGemma in addition to the LLaVA results in the paper and full finetuning in addition to the LoRA results in the paper in Table 1 of the rebuttal PDF (f4rS).
-
Expanded on the limitations section in regards to the learning algorithm (eGot), performance for real-world deployment (YrxP), and how future work can improve performance (f4rS).
-
Included qualitative examples of successes and failures in Figure 1 of the rebuttal PDF (eGot).
-
Clarified the details of the VQ-VAE training and architecture (eGot,YrxP).
We will add these new results and details to the main paper, and believe these strongly address all of the major concerns of the reviewers. We again thank the reviewers for their great suggestions.
We again thank the reviewers for their useful feedback. Below, we summarize the changes we will include in the final paper based on this feedback:
-
Extend the limitations in Sec. 5 in regards to the learning algorithm (eGot), performance for real-world deployment (YrxP), and how future work can improve performance (f4rS).
-
Update Sec. 4.2 of the paper with the details of VQ-VAE training and architecture (eGot, YrxP) along with clarifying the difference between SemLang and Lang (eGot).
-
Add the qualitative examples from the rebuttal PDF and summarize the existing qualitative analyses in Sec. 4 (eGot).
-
Include the results showing the RVQ action space adapter works for PaliGemma and full finetuning in MetaWorld.
-
Further discuss how different embodiments affect results in Sec. 4.1 and remove the claim about MLLM zero-shot performance (eGot).
-
Update all BabyAI evaluation numbers for 1000 evaluation episodes and clarify the train and evaluation distributions (YrxP). Also, clarify the language instructions for all environments in Sec. 4.1.
-
Discuss the two works mentioned by YrxP in the related work section.
This paper provides a comparative analysis of various action space adaptors that enables the use of VLMs for embodied agents. The reviewers have agreed that this work will be valuable for future research in the field, leading to the decision to accept this paper.
However, the reviewers have raised several concerns about the current form of the paper. The authors are strongly encouraged to address these issues as outlined in their official comment. Addressing these concerns will enhance the paper's clarity, rigor, and overall impact on the research community.