Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better
摘要
评审与讨论
This paper proposes a method for training generalist VLA models by co-training on large robot and VLM datasets. To address the challenge that naive co-training may hinder effective knowledge transfer from the VLM to robotics, this paper proposes a stop-gradient strategy. Specifically, it stops the gradient flow from the action expert to the backbone, allowing the backbone to be fine-tuned solely through a next-token prediction loss on both discretized actions and general language tokens, while the action expert is independently trained using flow-matching on continuous actions.
优缺点分析
Strengths:
- This paper conducts thorough experiments to demonstrate the importance of co-training in improving the language following ability of VLA. The experiments are challenging, the performance of the model is astonishing.
- The authors propose a novel training recipe, termed as knowledge insulation, which effectively mitigates gradient interference between the VLM backbone and the action expert in naive co-training strategies.
Weaknesses:
-
The training data used in the experiments is outlined in the paper, but further clarification or elaboration may be beneficial.
-
The concept of knowledge insulation is intriguing, and similar phenomena have been explored in prior work. For example, ChatVLA[1] identifies that training exclusively on robot data causes the VLA to lose its multimodal capabilities entirely after fine-tuning, and it proposes a method for unifying multimodal understanding with robot control. ChatVLA-2[2] enables the VLA to preserve pretrained knowledge from the VLM while following reasoning in manipulation tasks. A discussion on how this approach relates to and differs from these works would be valuable.
[1] ChatVLA: Unified Multimodal Understanding and Robot Control with Vision-Language-Action Model [2] ChatVLA-2: Vision-Language-Action Model with Open-World Embodied Reasoning from Pretrained Knowledge
问题
- Are reasoning language tokens generated at test time?
- The example of reasoning is missing in the demo. Can you provide some demo samples includes the reasoning generated during inference?
局限性
Yes
格式问题
No
Thank you for the positive review and feedback!
The training data used in the experiments is outlined in the paper, but further clarification or elaboration may be beneficial.
We will expand the description of the training data for a revised version of the paper and include more detailed descriptions in the appendix.
The concept of knowledge insulation is intriguing, and similar phenomena have been explored in prior work. For example, ChatVLA[1] identifies that training exclusively on robot data causes the VLA to lose its multimodal capabilities entirely after fine-tuning, and it proposes a method for unifying multimodal understanding with robot control. ChatVLA-2[2] enables the VLA to preserve pretrained knowledge from the VLM while following reasoning in manipulation tasks. A discussion on how this approach relates to and differs from these works would be valuable.
Thank you for pointing us to these papers, we agree that they are relevant and will include a discussion of them in a revised version of the paper.
- Are reasoning language tokens generated at test time?
- The example of reasoning is missing in the demo. Can you provide some demo samples includes the reasoning generated during inference?
For the experiments with the mobile manipulators, the model gets as input a “high-level” task such as “clean the bedroom”, and then first outputs a “low-level” task in language such as “pick up the pillow”. The model then predicts an action corresponding to the “pick up the pillow” sub-task. Currently, those are two separate inference calls to the same model (one call for autoregressive generation of the subtask followed by a call to generate the actions through integrating the flow field). We will be more explicit in a revised version of the paper.
Thank you for your thoughtful and detailed addressing the two concerns I raised about this work, and for agreeing to make further improvements. I have carefully reviewed the authors’ responses and would like to express my appreciation for their clarifications and updates.
This paper primarily investigates how to better train VLA models that incorporate diffusion or flow matching action experts. The proposed approach involves training the VLM backbone with discrete actions, training the action expert with continuous actions, and stopping gradient flow from the action expert to the VLM backbone. The authors conduct extensive real-world experiments to validate both the performance of their model and the effectiveness of the proposed training scheme.
优缺点分析
Strength
-
The paper clearly states its motivation and is well-written and easy to follow.
-
It proposes an effective training method for VLA models that use diffusion or flow matching action experts.
-
The experiments are thorough and well-designed, providing strong empirical evidence to support the paper’s claims and demonstrating the advantages of the proposed model.
Weaknesses
- It appears that part of the robot action data used to train the model is not open-sourced, which may hinder reproducibility.
问题
Questions
-
If the VLM backbone is first pretrained using discrete action data (with VLM data), and then frozen while only finetuning the action expert, would this yield similar results to jointly training the VLM backbone and action expert with gradient stopping?
-
In Figure 7, why do Pi0 and joint training without VLM data perform better in the OOD setting than in the ID setting? Additionally, the performance gap between ID and OOD for the proposed method is relatively small. Does this suggest that the OOD tasks are too simple or not sufficiently distinct from the ID tasks?
-
Section A.3 of the supplementary material lists several VLM tasks used during training. To what extent does the object localization task contribute to the model’s overall performance?
局限性
yes
最终评判理由
The paper is technically solid, and the authors have addressed my concerns. Therefore I am keeping my score
格式问题
None
Thank you for your positive review and questions!
It appears that part of the robot action data used to train the model is not open-sourced, which may hinder reproducibility.
We present experiments on LIBERO and DROID, datasets and setups which are fully open source. We will open-source a model checkpoint that has been trained on the DROID dataset. Further, while our whole dataset is very large and hard to open-source, we will also open-source a model checkpoint that has been trained on our whole data. The release will be accompanied by fine-tuning scripts for further model training. We hope this should address your concerns.
If the VLM backbone is first pretrained using discrete action data (with VLM data), and then frozen while only finetuning the action expert, would this yield similar results to jointly training the VLM backbone and action expert with gradient stopping?
This is an excellent point, thanks for raising it. In principle, it is possible to do what you suggest and we found that it yields good results. However, the total training time in this case, both in terms of wall-clock and total number of gradient steps increases significantly. For example, we tried first training only the VLM backbone and then training the action expert on the full data mixture, but even after 160% of the training time (i.e. allowing for another 60% of total steps to train only the action expert) the models are not yet fully comparable (even though we would assume convergence to yield a policy that is as good eventually). In contrast, the computational overhead of adding the action expert with stop-gradient is minimal and does not lead to a significant reduction in the train steps/s, hence it is significantly more efficient to have the action expert in the model from the beginning.
In Figure 7, why do Pi0 and joint training without VLM data perform better in the OOD setting than in the ID setting? Additionally, the performance gap between ID and OOD for the proposed method is relatively small. Does this suggest that the OOD tasks are too simple or not sufficiently distinct from the ID tasks?
The OOD tasks contain objects not seen in the robot training data and were designed prior to knowing the performance of the policies on these objects. In the OOD setting, we prompt the model with both the object category and object color, while in the ID setting with the object category only. Adding the color reference indeed seems to make the task easier for “Pi0” and “joint training without VLM data”. It turns out that for our method, removing the color reference leads to the same performance. We will include details in the appendix of each set of objects and a picture of the two tasks for clarity.
Section A.3 of the supplementary material lists several VLM tasks used during training. To what extent does the object localization task contribute to the model’s overall performance?
Performing this ablation cleanly would involve a full pretraining run, which is very resource intense. Some of the web datasets such as Cambrian also include localization as part of their diverse mixture. Preliminary experiments have qualitatively shown that object localization helps with robustness of the mobile policies in unseen environments.
Thank you for your informative response, I appreciate it. My concerns have been addressed, and I will maintain my score.
This paper investigates the integration of continuous action prediction and knowledge perservation of pretrained VLM. They demonstrates current approach significantly harms both training speed and knowledge transfer. This paper provides an extensive analysis of various design choices, their impact on performance and knowledge transfer, and propose a technique for insulating the VLM backbone during VLA training that mitigates this issue.
优缺点分析
pros:
- This paper investigates an interesting question: the integration of continuous action prediction and vision-language ability.
- This paper demonstrates a clean improvement compared to the recent state-of-the-art pi0.
- The robot experiments are extensive and solid.
cons: The paper claims the proposed approach achieves better knowledge insulting for VLA models from VLM. However, the evaluation is mainly based on manipulation performance according to experience. Meanwhile, this paper does not quantitatively demonstrate the knowledge preservation of VLM, e.g., evaluating it on the corresponding benchmarks.
Overall, this paper is good and expands the knowledge boundary of vision-language-action models.
问题
- L80-L81, "The core idea of VLAs is to fine-tune pre-trained vision-language models (VLMs) for action prediction.", which is very confusing. Do you think VLA is limited to the finetuned VLM? How about the VLA models from scratch? I think it is also possible to jointly train the VLA model with both vlm dataset and the robot control data from scratch. Therefore, L80-L81 is quite confusing. Currently, the major attention of VLA lies on fine-tuning a VLM.
- This paper proposes to stop the gradient propagation from the continuous action expert, together with the discretized actions for gradient propagation. Did you have evaluate the training scheme without the discretized actions but with VLM supervisions? I found the text state refer to the discretized actions, and thus did not figure out which ablation is related to the experiment without discretized actions.
局限性
yes
最终评判理由
The response has well addressed my concerns. I maintain my rating.
格式问题
n/a
Thank you for your positive and detailed feedback!
L80-L81, "The core idea of VLAs is to fine-tune pre-trained vision-language models (VLMs) for action prediction.", which is very confusing. Do you think VLA is limited to the finetuned VLM? How about the VLA models from scratch? I think it is also possible to jointly train the VLA model with both vlm dataset and the robot control data from scratch. Therefore, L80-L81 is quite confusing. Currently, the major attention of VLA lies on fine-tuning a VLM.
In our understanding, the term VLA refers to a model that combines large scale language, vision-language, and robot action data. While it is possible to train such a model from scratch, we are not aware of this ever being done at scale comparable to VLM pretraining from scratch.
We will reword this sentence to say that “The idea of most VLAs is to fine-tune pretrained …”.
This paper proposes to stop the gradient propagation from the continuous action expert, together with the discretized actions for gradient propagation. Did you have evaluate the training scheme without the discretized actions but with VLM supervisions?
In Figure 4a and Figure 8 the “frozen backbone” experiment can be interpreted as a training scheme with (web) VLM supervisions only. Since the model has been pretrained on web VLM data, training it with VLM supervisions only (as you suggest) is very similar to just freezing the backbone. For the “items in drawer” task, this can have some success (around 30%), while for the “shirt folding” task, the success is 0, both indicating that VLM supervision alone is not sufficient.
I found the text state refer to the discretized actions, and thus did not figure out which ablation is related to the experiment without discretized actions.
The ablation that corresponds to training without discretized actions is the pi0 ablation in all figures. “Text state” refers to the representation of the robot’s proprioceptive state (joint angles) as text.
Thanks for your response, and it has marginally addressed my concerns.
For the second point, I mean that we do not utilize the discretized actions but the common QA supervision(e.g. grounding) for the same robot observations.
Best,
For the second point, I mean that we do not utilize the discretized actions but the common QA supervision(e.g. grounding) for the same robot observations.
Thank you so much for the clarification and raising this excellent point. Training the VLM backbone on QA supervision on the same robot observations as a representation learning objective actually performs worse than freezing the backbone entirely in our experiments. The reason for this is that those QA supervisions are comparably easy for the model to learn, and hence we hypothesize that the representations in the backbone are then not expressive enough anymore for the more complex action prediction task. It might very well be the case that more complex QA supervisions might be sufficient, but we found the type of robot planning and object detection tasks on the robot observations we are using to be insufficient.
Does this address your point?
just checking in if you have seen this comment. Thank you so much again for raising this point.
The paper investigates the integration of VLM into robotics, noting that to adapt the VLM architecture to a continuous robotic control, often ad-hoc adapters are trained from scratch, and fine-tuning the VLM backbone results in performance degradation. The paper finds that the gradient coming from continuous actions does not adapt well to the VLM's discrete nature, and proposes to train a continuous flow matching expert while stopping the gradient flow between it and the VLM. Such a technique is presented with the name of "knowledge insulation", and is tested on both simulated and real-world tasks.
Reviewers have anonymous positive opinions: the research question has been found relevant, experiments satisfactory, with a clear improvement on the previous approaches, and overall the paper is well motivated and written. The main weaknesses emerged during the reviewing process appear to be minor clarifications (e.g., discussion of previous works) and the fact that the dataset will not be entirely open-sourced. During the discussion phase, all the reviewers showed satisfaction with the provided answers, and the general opinion remained unanimously in favour of acceptance.
After careful examination, the AC does not find reasons to overturn the reviewers' decision. The paper can be considered for a spotlight, since the addressed problem can be of interest for a large audience, and the simplicity of the idea could be easily picked up by future works.