Thank you for your follow-up question. Below, we provide additional technical details:

Input Modalities
- Both the high-level policy and low-level policy are conditioned on two or three images, depending on the specific task. For each task, we use:

One third-person camera view.
One wrist-mounted camera per robot arm (one or two arms).
- Each image has a resolution of 224×224 pixels. These images are separately processed by the model’s vision encoder, and their resulting latent tokens are then concatenated.

Language Conditioning
- We also condition the policies on natural language instructions, tokenized using the language tokenizer from the underlying LLM. The language tokens are concatenated with the vision tokens inside the model to enable multimodal reasoning.
Model Initialization
- While our method can be trained from scratch or finetuned from any VLM backbone, in practice we use PaliGemma [1] as the base model. This is an open-source, 3-billion-parameter VLM that offers a good balance between performance and computational efficiency.
- We unfreeze the full model for finetuning.
Optimizer and Hyperparameters
- We use the AdamW optimizer [2] with , , and no weight decay.
- Gradient norm is clipped to a maximum magnitude of 1.
- We maintain an Exponential Moving Average (EMA) of network weights with a decay factor of 0.999.
- The learning rate starts with a short warm-up (1,000 steps) and then remains constant at .
- Batch size is 512.
Training Duration and Resources
- Training the high-level policy is highly efficient, taking about 2 hours on 8×H100 GPUs.
- The low-level policy follows a similar training pipeline, though training times can vary depending on the dataset size and complexity of the target tasks for action prediction.

We hope these details clarify our training pipeline and hyperparameters. Please let us know if there is any other information we can provide.

References

[1] Beyer, Lucas, et al. “PaliGemma: A versatile 3B VLM for transfer.” 2024.

[2] Loshchilov, Ilya, and Frank Hutter. “Decoupled weight decay regularization.” 2017.