Additionally, the paper seems to use a limited training set for salient point training, such as 30 samples for cup stacking. Without pre-training, I am skeptical that these datasets can achieve accurate and generalizable visual perception.

Good question. We actually leverage much more than 30 samples via data augmentation techniques, clarified here:

Raw Dataset Format: Each demonstration consists of multiple waypoint actions, each associated with a particular salient point. For example, to collect a single demonstration for the simulated Square task, the demonstrator first specifies an initial waypoint above the square tool handle, a second waypoint to grasp the tool, a third waypoint to move it above the pole, etc.. Thus, a dataset of 50 Square demonstrations yields ~300 paired data points of the form (observation, waypoint, salient point) with which to train the policy.
Temporal Augmentation (discussed in Section 4.2, page 6): At time step t, to reach a waypoint (wt+k), we use a controller that moves the arm over the interval [t, t+1, …, t+k]. Observations at intermediate steps share the same waypoint label (wt+k), allowing us to use these steps for data augmentation. In all experiments, we use the first 20% of interpolated steps (t, …, t + 0.2*k) as additional training data for the waypoint policy. For example, in the Square task, this technique augments 300 demonstrations into a dataset of approximately 1,800 data points. The effect of this temporal augmentation is shown in Table 2 of Appendix B.
Positional Augmentation: We also apply random translations to the point cloud input before feeding it to the network, a common augmentation technique for point clouds. The preprocessing step of FPS (furthest point sampling) can also be seen as a minor form of data augmentation, as the network sees different subsets of the points in different epoch.

We have further added the above clarifications to Appendix B.1 (changes highlighted in blue).

Concerns related to the robotics aspect. The paper does not seem to mention object segmentation or part-level pose estimation, which also appear to be integrated within the waypoint transformer.It is unclear if this approach is effective or if additional processing was performed in the real-world experiments.

There is no object segmentation nor part-level pose estimation in this paper. We simply crop out points that are not part of the workspace of the robot, i.e. only keeping points where: , and > 0m

I believe the EE pose should be 6 DoF, but the paper frequently mentions 7 DoF. Is this due to the use of quaternions for rotation, or does it include the EE’s open/close state as an extra DoF?

We count the gripper dimension action as additional DoF for simplicity.

Generalization problem. Does the generalization reported in the paper include generalization across objects of different appearances and shapes, or is it limited to generalization over the initial pose?

We assume that the reviewer refers to generalization over the initial pose of the robot end-effector. We did not randomize initial end-effector poses, but we did test generalization over unseen object poses, new camera poses, and visual distractors, all of which are significantly harder since the robot needs to generalize across vastly different spatial arrangements even if starting from the same end-effector pose.

We have not tested generalization over different objects yet, as all of the models, including baselines, are not trained to do so, but we fully agree that this is an exciting avenue for future work. In particular, we are excited to potentially offload salient point annotation to LLM/VLMs, which could allow SPHINX to generalize at the object instance or category level.

Mode switching and subtask decomposition. The paper does not seem to address subtask decomposition for long-term tasks, appearing to rely on mode switching. For the two proposed modes, how does the model ensure continuity across temporal sub-steps, and how does it detect the salient point corresponding to the next subtask (distinct from the previous step)?

During data collection, the demonstrator can seamlessly switch between teleoperating waypoint actions using the web UI or providing dense actions using the Spacemouse. For each timestep, we detect whether movement is sensed on the computer mouse or Spacemouse, allowing us to obtain mode labels for each timestep automatically. From these mode labels, the model learns when to switch modes from the data. Every time the waypoint or the dense policy makes a prediction, it also predicts what the mode should be for the next time step. The model learns that well from data. The dense policy sometimes switches to the waypoint mode a bit early or late, but the waypoint policy is able to generalize over different starting conditions thanks to the temporal data augmentation mentioned earlier.