Real-World Experiments.

Task Setting. The task settings are too easy with only “picking” operations. Despite that the authors claim that it is a challenging setup with small object (L370-371), I believe that only picking small objects would not be so difficult. I would suggest adding real-world manipulation experiments besides picking (and pick-and-place), e.g., open drawer/door, pouring (with rotation actions), and long-horizon tasks, etc.
Few-Shot Setting. Collecting 50 trajectories for each task in the first three tasks (L368-369) results in totally 150 demonstrations, which is not a few-shot settings. Previous works [1, 2] has proved that 50 demonstrations in the real world are sufficient for the policy to learn complex behaviors far beyond picking, even w/o pre-training. Generally, using less than 10 demonstrations can be considered as “few-shot” [3, 4] for robot learning.
Few-Shot Baselines. While the authors employ Octo [5] and OpenVLA [6] as zero-shot generalization baselines, it would be consistent and convincing to have both methods also as few-shot generalization baselines.
Few-Shot Results. The current “few-shot fine-tuning” results in Table 4 seems a little low for the simple picking operations. Please provide additional object informations (Question 4 & 5) for task difficulty evaluations. The videos in the supplementary results lack unseen object evaluations (task-4 and task-5) and baseline performance.

Improper Baseline Choices in CALVIN Experiment. Developing an expressive generalist policy with a better generalization performance seems to be the key motivation of the paper. The generalization ability of the DiT policy is heavily boosted by large-scale OXE [7] pre-training (in CALVIN experiment, the policy outperform its w/o pre-training variant by 1.2 in success sequence length, L022-023), which is reasonable since large-scale pre-training exposes the policy to more robot data. Therefore, the baselines in CALVIN experiment should also access to a similar amount of robot data. Specifically,

(a) If the authors want to emphasize on the generalization ability of the DiT policy itself (not boosted by the large-scale pre-training), they should compare the DiT policy w/o pre-training with some of the following policies: 3D diffusor actors [8], MDT [9], MT-ACT [10], Multitask-DP [11], etc., all trained with similar amount of data.
(b) If the authors want to emphasize that the DiT policy is suitable for large-scale pre-training and able to gain better generalization performance after large-scale pre-training, they should compare the DiT policy w/ pre-training with some of the following policies: RT-1/2-X [7], Octo [5], OpenVLA [6], MT-ACT [10] w/ OXE pre-training, Multitask-DP [11] w/ OXE pre-training, all fine-tuning with the same amount of data.

Also, it is worth noticing that MDT [9] performs well in CALVIN benchmark even w/o large-scale robot data pre-training, though it has not been evaluated under the ABC→D settings in its paper. Evaluating MDT under this setting can serve as a strong baseline in CALVIN benchmark, even for situation (b). For most image-based policies [5, 6, 7, 9, 10, 11], the observation modalities can be easily aligned with the DiT policy (single-view RGB images) for fairness considerations.

Writings. There are several typos and unclear statement in the paper:

L137: QFormer is not properly cited for the first appearance.
L276: Table I, Disc ActionHead, SingleYCB “22.%”. Missing number.
L311: “…, and 1 dimension for gripper pose.” Is it “gripper position”?
L358: the first letter of “setup” should be capital.
L361: “… different from the scenes in Open X-Embodiment (Stone et al.)” Incorrect citation.

[1] Tony Z. Zhao, et al., “Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware”, RSS 2023.

[2] Cheng Chi, et al., “Diffusion Policy: Visuomotor Policy Learning via Action Diffusion”, RSS 2023.

[3] Vivek Myers, et al., “Policy Adaptation via Language Optimization: Decomposing Tasks for Few-Shot Imitation”, CoRL 2024.

[4] Kourosh Hakhamaneshi, et al., “Hierarchical Few-Shot Imitation with Skill Transition Models”, ICLR 2022.

[5] Octo Model Team et al., “Octo: An Open-Source Generalist Robot Policy”, RSS 2024.

[6] Moo Jin Kim, et al., “OpenVLA: An Open-Source Vision-Language-Action Model”, arXiv 2024.

[7] Open X-Embodiment Collaboration, et al., “Open X-Embodiment: Robotic Learning Datasets and RT-X Models”, ICRA 2024.

[8] Tsung-Wei Ke, “3D Diffuser Actor: Policy Diffusion with 3D Scene Representations”, CoRL 2024.

[9] Moritz Reuss et al., “Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals”, RSS 2024.

[10] Homanga Bharadhwaj, et al., “RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking”, ICRA 2024.

[11] Huy Ha, et al., “Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition”, CoRL 2023.