3D Equivariant Visuomotor Policy Learning via Spherical Projection
We introduce Image-to-Sphere Policy (ISP), a novel SO(3)-equivariant policy learning framework with single eye-in-hand RGB inputs.
摘要
评审与讨论
This paper proposes a visuomotor policy learning framework that is SO(3)-equivariant for robot manipulation using an eye-in-hand monocular camera.
This approach first extracts SO(2)-equivariant features from an in-hand camera using an equivariant encoder. These features are then projected onto a sphere using spherical harmonics, enabling SO(3)-equivariant policy learning via monocular images. Next, equivariant correction is done by rotating invariant wrist-mounted image representation with gripper orientation. Finally, the authors extend EquiDiff from SO(2) to SO(3) to achieve SO(3)-equivariant visuomotor policy learning.
The main contribution is the equivariant correction step that bridges the gap between 2D equivariant policy learning using top-down image and 3D equivariant policy learning based on point cloud data.
优缺点分析
Strength
- The paper is clearly written and well motivated, with extensive literature review that describes current robot manipulation policy learning. Superior performance improvement is shown in the MimicGen benchmark with only 100 demos and in real-world data with around 60 demos.
- The proposed spherical projection and equivariant correction presents a practical method for extending 2D image-based manipulation policy learning to 3D equivariant settings.
- The paper includes detailed background on necessary backgrounds such as spherical harmonics and diffusion policy.
Weakness
- The introduction and literature review of equivariant robotics manipulation describes the current research gap of using the point cloud input whereas the proposed method uses monocular image. There could be some discussion on how the proposed method compares or complements the point cloud-based approach. Additionally, Figure 2 is potentially misleading as it shows point clouds as an example even though the paper focuses on image.
- The projection from image features to the sphere is described only as “orthographic projection”. Compared to the detailed background on representations of SO(3), the projection step lacks detail for the reader to fully understand. For example, what defines the center of the sphere, and how is the projection spatially interpreted?
- The description in section 4.3 is confusing. Earlier sections focus on construction of SO(3) equivariant features, but this section starts with SO(2)-equivariant encoder and SO(2)-invariant features without clearly explaining how these are integrated. The benefit of combining invariant and equivariance (as shown in Figure 3) is also not convincing, especially since the network does not separate translation (which needs invariant) and rotation (needs equivariant to invariant). In addition, equivariance is the general form of invariance. The role of invariant features could be better justified.
- The paper mentioned that the SO(3) version is more computation intensive than SO(2). Adding runtime or memory usage could be helpful to quantify the trade-off between the two versions.
- The literature review overlooks a recent SE(3)-equivariant point cloud-based method: Zhu, Xupeng, Fan Wang, Robin Walters, and Jane Shi. "SE (3)-Equivariant Diffusion Policy in Spherical Fourier Space." In Forty-second International Conference on Machine Learning.
问题
- What are the trade-offs between using SO(3)-equivariant features from image and using equivariant features from point cloud inputs?
- The real-world experiments are conducted using a similar robot platform as in simulation. A discussion of sim-to-real transfer, hardware generalizability, and data collection requirements would improve the paper’s practical relevance. For instance, how does the system perform with different camera configurations, image resolutions, or frame rates? Is retraining required when switching hardware?
- For spherical projection, what defines the center of the sphere, and how is the projection spatially interpreted?
- The paper mentioned that the SO(3) version is more computation intensive than SO(2). Could using a sparser symmetry group improve efficiency?
局限性
The authors have listed out some critical limitations of this work. Additionally, monocular images inherently suffer from scale ambiguity. The paper does not address how this issue is mitigated. A discussion of this limitation would be beneficial for readers to be aware.
最终评判理由
All of my questions have been addressed. I appreciate the clarification on the comparison between learning SO(3)-equivariant features from images versus equivariant features from point clouds. With all the corrections and additional information, I maintain my positive rating for this paper.
格式问题
N/A
We thank the reviewer for the helpful comments. Please see our response below.
The introduction and literature review of equivariant robotics manipulation describes the current research gap of using the point cloud input whereas the proposed method uses monocular image. There could be some discussion on how the proposed method compares or complements the point cloud-based approach. Additionally, Figure 2 is potentially misleading as it shows point clouds as an example even though the paper focuses on image.
Our method is designed to address the unique challenge of learning SO(3)-equivariant policies directly from monocular RGB inputs, in contrast to prior SO(3)-equivariant approaches that rely on point clouds [a,b]. Point cloud methods typically require accurate depth sensing, 3D calibration, and are computationally expensive due to heavy geometric processing. In contrast, our method provides a lightweight, calibration-free alternative that achieves strong performance using only monocular images. We view these approaches as complementary: point clouds offer richer geometry, while image-based methods are more scalable and easier to deploy. We will revise the related work section to clarify this distinction. Additionally, we appreciate the feedback on Figure 2 and will update it to avoid confusion.
The projection from image features to the sphere is described only as “orthographic projection”. Compared to the detailed background on representations of SO(3), the projection step lacks detail for the reader to fully understand. For example, what defines the center of the sphere, and how is the projection spatially interpreted? ( For spherical projection, what defines the center of the sphere, and how is the projection spatially interpreted?)
The orthographic projection in our paper refers to a signal remapping operation that lifts 2D feature maps onto the unit sphere in the camera frame. Unlike traditional geometric projections that require depth or camera calibration, our approach does not rely on a 3D center or physical camera parameters. The projection is learned. In practice, the spherical signal is sampled using a HEALPix grid, which provides an equal-area, hierarchical discretization of the sphere. For each point on the sphere, we apply a learnable weighted aggregation over the full 2D feature map to compute the signal value. This allows the model to flexibly adapt how spatial features are projected onto the sphere, rather than relying on fixed kernels.
We acknowledge that our initial explanation was insufficient, and we will expand this section in the final version to describe the projection mechanism. Thank you for pointing this out.
The description in section 4.3 is confusing. Earlier sections focus on construction of SO(3) equivariant features, but this section starts with SO(2)-equivariant encoder and SO(2)-invariant features without clearly explaining how these are integrated. The benefit of combining invariant and equivariance (as shown in Figure 3) is also not convincing, especially since the network does not separate translation (which needs invariant) and rotation (needs equivariant to invariant). In addition, equivariance is the general form of invariance. The role of invariant features could be better justified.
Before addressing your question, we would like to clarify a typo in Section 4.3 (line 240): “Combining the SO(2)-equivariant encoder (Proposition 1).” This should instead be an SO(3)-equivariant encoder, as our method is designed to achieve end-to-end SO(3) equivariance. We acknowledge that this error may have caused confusion regarding the motivation and design described in this section, and we will correct it in the final version.
Our method maintains end-to-end SO(3)-equivariance by combining an SO(3)-equivariant encoder with an SO(3)-equivariant diffusion U-Net. This ensures that predicted actions transform consistently under global scene rotations, enabling strong generalization and sample efficiency.
In parallel, we use an SO(2)-equivariant image encoder to extract SO(2)-invariant features prior to spherical projection. These invariant features are not responsible for preserving the overall SO(3)-equivariance of the policy. Rather, we adopt this design because the resulting image encoder performs better at extracting robust and consistent visual features. In practice, this leads to improved image representations without compromising the model’s equivariant structure.
This design choice is empirically supported: as shown in the ablation study, replacing the SO(2)-equivariant image encoder (EquiEnc) with a standard CNN leads to a 6.8% average drop in success rate across four tasks. This confirms the practical benefit of using invariant features for image encoding. We will revise Section 4.3 to clarify this point and better explain the motivation behind using SO(2)-invariant image features.
The paper mentioned that the SO(3) version is more computation intensive than SO(2). Adding runtime or memory usage could be helpful to quantify the trade-off between the two versions.(The paper mentioned that the SO(3) version is more computation intensive than SO(2). Could using a sparser symmetry group improve efficiency?)
The SO(3) version indeed incurs higher computational cost compared to the SO(2) variant. Specifically, the SO(2) model trains about 2x slower than the original Diffusion Policy, while the SO(3) model is roughly 2x slower and requires 1.8x more memory than the SO(2) version.
However, for inference, our method remains highly efficient and suitable for real-time deployment. On a single RTX 4090 GPU:
- Diffusion Policy: ~65 ms per inference
- Ours (SO(2)): ~58 ms
- Ours (SO(3)): ~123 ms
The SO(2) variant is even faster than the baseline due to our use of a lighter-weight diffusion U-net. While the SO(3) model is slower, its inference speed is still well within real-time requirements (e.g., 10 Hz control frequency), and the use of multi-step action chunks further supports efficient closed-loop execution very well.
For the question of whether using a sparser symmetry group could improve efficiency: yes, our framework is fully compatible with such groups. We use the icosahedral group () as the default due to its strong trade-off between rotational coverage and performance. However, sparser groups like the octahedral can be employed to further reduce computational overhead.
We will include detailed comparisons of training time, inference time, and memory usage in the final version.
The literature review overlooks a recent SE(3)-equivariant point cloud-based method: Zhu, Xupeng, Fan Wang, Robin Walters, and Jane Shi. "SE (3)-Equivariant Diffusion Policy in Spherical Fourier Space." In Forty-second International Conference on Machine Learning.
Thank you for pointing out this important related work. We acknowledge that Zhu et al. (ICML 2024) also study SE(3)-equivariant policy learning via diffusion in spherical Fourier space, but using point cloud input. We will add a discussion of this method in the final version and clarify how our approach differs.
What are the trade-offs between using SO(3)-equivariant features from image and using equivariant features from point cloud inputs?
Compared to point cloud-based SO(3)-equivariant methods, our image-based SO(3)-equivariant framework requires no depth sensing, camera calibration, or 3D reconstruction. This makes it easier to deploy in real-world scenarios, particularly for eye-in-hand setups with limited sensing.
While point clouds provide richer 3D geometry and enable more precise SO(3)-equivariance, they typically rely on accurate depth sensors and careful calibration. Nonetheless, we believe that leveraging monocular RGB and point cloud inputs to achieve SO(3)-equivariance is not mutually exclusive but complementary.
The real-world experiments are conducted using a similar robot platform as in simulation. A discussion of sim-to-real transfer, hardware generalizability, and data collection requirements would improve the paper’s practical relevance. For instance, how does the system perform with different camera configurations, image resolutions, or frame rates? Is retraining required when switching hardware?
We agree that discussing hardware generalizability and deployment conditions helps improve the practical relevance of our method. However, our method does not perform sim-to-real transfer; we focus on the same domain training and evaluation.
A key advantage of our approach is that it does not rely on any camera intrinsics, depth sensing, or calibration. The spherical projection is a model-agnostic transformation applied to image features and operates independently of specific sensor parameters. This design makes our method naturally adaptable to different camera configurations.
Furthermore, our method can generalize across different robot arms without retraining, as long as the same gripper is used. Since part of the gripper naturally appears in the in-hand camera view, maintaining a consistent gripper appearance should suffice to preserve the validity of the learned policy. If the gripper changes, a small number of additional demonstrations may be required to adapt the model.
Regarding data collection requirements, we will include additional discussion based on our extensive experimental experience. This will cover practical guidelines for efficiently collecting demonstration data, strategies for increasing diversity and informativeness of samples, and preprocessing steps for trajectory handling.
[a] Tie, C., Chen, Y., Wu, R., Dong, B., Li, Z., Gao, C., & Dong, H. (2024). Et-seed: Efficient trajectory-level se (3) equivariant diffusion policy. arXiv preprint arXiv:2411.03990.
[b] Yang, J., Cao, Z. A., Deng, C., Antonova, R., Song, S., & Bohg, J. (2024). Equibot: Sim (3)-equivariant diffusion policy for generalizable and data efficient learning. arXiv preprint arXiv:2407.01479.
Thank you for the detailed responses. All of my questions have been addressed. I appreciate the clarification on the comparison between learning SO(3)-equivariant features from images versus equivariant features from point clouds. With all the corrections and additional information, I maintain my positive rating for this paper.
Dear Reviewer GXof,
As the reviewer-author discussion period comes to a close, we would greatly appreciate it if you could kindly let us know whether our responses have addressed your concerns and if you have any additional questions about our work.
Thank you again for your time and feedback!
The paper introduces a framework that projects monocular RGB images onto a sphere and corrects for camera pose to encode SO(3)-equivariance for visuomotor policy learning. It applies spherical convolutions to these world-aligned features and uses an equivariant diffusion process to predict multi-step action sequences conditioned on observations. The authors prove that their encoder maintains global SO(3)-equivariance and local SO(2)-invariance under camera rotations by leveraging spherical harmonics and Wigner D-matrices. In simulation across twelve manipulation tasks, the SO(3)-equivariant policy outperforms diffusion policy, planar-equivariant diffusion, and transformer-based baselines in terms of success rate and sample efficiency. On a UR5 robot with a wrist-mounted GoPro, the method yields higher success rates on pipe disassembly and grocery bag retrieval tasks using 60–65 demonstrations.
优缺点分析
+++ The paper is very well-written, with a clear thesis, problem formulation, proofs, and experiments. It's a very solid effort and self-contained, regardless of how significant or impactful the idea is.
+++ The paper formally defines an “equivariance correction” mechanism using the end-effector rotation, proves global SO(3)-equivariance (Proposition 1) and local SO(2)-invariance (Proposition 2) via spherical harmonics and Wigner D-matrices, and provides concrete spectral proofs in the appendix.
+++ Evaluation includes twelve diverse simulation tasks and four real-robot tasks, with comparisons against diffusion policy, planar-equivariant diffusion, and transformer baselines—showing consistent, large gains in both success rate and data efficiency. The real-world robot demo looks quite convincing.
+++ Ablation studies isolate the impact of spherical projection, the SO(2)-equivariant encoder, and the equivariant diffusion U-Net, demonstrating that each component contributes significantly to overall performance (drops of 6–9% when removed).
--- The method enforces equivariance by sampling a finite subgroup (e.g. I₆₀ or C₈) of SO(3) and applying a shared U-Net over those elements. This discretization may fail to capture rotations outside the sampled set, potentially degrading performance under arbitrary continuous rotations. Is this an issue during experiments?
--- The orthographic lifting of a wide-FOV image onto the sphere can introduce distortions—especially near the edges—that the model must learn to correct. The paper does not evaluate robustness to changes in camera intrinsics (e.g. different FOVs or lens distortion), which could affect real-world deployment.
--- One potential drawback of enforcing strong inductive biases such as symmetry and equivariance is the potential loss of generalization ability. All experiments (simulation and real) use environments and objects drawn from the same benchmarks. The paper does not assess generalization to novel tasks, varied backgrounds, lighting conditions, or object textures. I believe experiments under more diverse setting and contexts will be very helpful in assessing the significance of the proposed method.
问题
- How sensitive are the results to calibrations?
- How were the finite subgroup and the sampling set chosen in practice, and how sensitive is performance to the size and structure of them?
- How are camera intrinsics and lens distortions handled during the orthographic projection onto the sphere, and what happens if the FOV or intrinsics change at test time?
- Can you clarify the preprocessing steps applied to the GoPro images and whether they differ between simulation and real-world experiments?
局限性
The authors have provided high-level discussions of the limitations of the method.
最终评判理由
I appreciate the authors for the really detailed response. The response has addressed most of my concerns, especially the ones regarding calibration and camera intrinsic. However, my concerns for the generalization issue with newly added structure to the policy still hold. I will maintain my current rating.
格式问题
N/A
We thank the reviewer for the careful review. Please see our response below.
The method enforces equivariance by sampling a finite subgroup (e.g. I₆₀ or C₈) of SO(3) and applying a shared U-Net over those elements. This discretization may fail to capture rotations outside the sampled set, potentially degrading performance under arbitrary continuous rotations.
We did not observe performance degradation due to discretization in our experiments. We empirically found that using discrete subgroups like and yielded better performance than continuous irreducible representations (irreps), which is consistent with prior findings in equivariant learning [a,b]. While discrete groups reduce theoretical symmetry, they enable scalable and expressive models by avoiding the computational overhead and the choice constraint of activation functions associated with continuous irreps. These subgroups are also empirically effective in robotics applications [c,d].
To further mitigate potential limitations of discretization and improve generalization, we apply random SO(2) rotations to both observations and robot states during training to augment the data, as described in Appendix D (Training Details). Moreover, although each sampled set contains a finite number of rotations (e.g., 60 in ), we are not limited to a single set — in practice, we can sample multiple such sets to improve angular coverage when necessary.
Finally, while equivariant modeling imposes useful inductive structures, it does not prevent the network from generalizing beyond the sampled rotations. The model retains the capacity to interpolate smoothly across SO(3), especially when supported by diverse data and augmentations.
We will clarify these points in the final version. Thank you for highlighting this.
The orthographic lifting of a wide-FOV image onto the sphere can introduce distortions—especially near the edges—that the model must learn to correct. The paper does not evaluate robustness to changes in camera intrinsics (e.g. different FOVs or lens distortion), which could affect real-world deployment.
This is a very insightful question. Unlike traditional methods relying on geometric modeling, our approach uses a learnable orthographic lifting mechanism that operates directly on feature maps. After extracting 2D features via an equivariant CNN backbone, we lift them to the sphere using a learnable weighted aggregation kernel. The lifting operation is a purely signal-level transformation that does not depend on camera intrinsics, depth, or calibration, making it naturally agnostic to FOV and lens distortion. As a result, the model can learn to compensate for distortions such as wide-FOV warping or lens variation during training.
We also provide empirical support to address this concern. In our experiments, we use different cameras for simulation and real-world setups, each with different focal lengths and fields of view. Despite this discrepancy, our method performs well in both domains, suggesting that the model successfully learns to compensate for intrinsic variations. In real-world deployment, we tested our method with several different FOV settings and observed reliable performance across these configurations. While we ultimately chose the best-performing FOV setting for evaluation, our experience suggests that the learned projection remains effective under all optical conditions. In future works, we plan to explore training with mixed camera intrinsics to improve generalization to arbitrary test-time camera configurations.
We believe this learnable and sensor-agnostic design is a strength of our method, enabling robust deployment across varied hardware setups. While we initially omitted a detailed explanation due to prior coverage in the I2S paper [e], we agree that adding clarification in the final version would be beneficial.
One potential drawback of enforcing strong inductive biases such as symmetry and equivariance is the potential loss of generalization ability. All experiments (simulation and real) use environments and objects drawn from the same benchmarks. The paper does not assess generalization to novel tasks, varied backgrounds, lighting conditions, or object textures. I believe experiments under more diverse setting and contexts will be very helpful in assessing the significance of the proposed method.
To answer this question, we conducted additional real-world experiments to evaluate the robustness of our method under domain shifts. Specifically, we varied lighting conditions by introducing a strong white point light source near the table, replacing the original overhead warm ambient lighting. This significantly altered the shadows and color temperature. Additionally, we introduced background perturbations by placing various household objects on the table to create a cluttered scene, and further increased the complexity by manually moving these objects during task execution. Finally, we introduced occlusions by briefly blocking part of the in-hand camera view with a human hand during execution. For the Box-Pipe Disassembly task, our policy achieved 85% success under lighting changes, 75% with noisy backgrounds, and 85% under partial camera occlusion (each evaluated with 20 rollouts). Compared with the 80% of no perturbation reported in Section 5.2, our method demonstrates strong generalization to realistic perturbations. We will include these results in the final version.
Moreover, we would like to clarify that the primary form of generalization targeted by our method is spatial generalization—specifically, generalizing across object and scene orientations via SO(3)-equivariance. Although task/instance-level generalization (e.g., novel tasks or objects) is not the focus of this work, our approach is fully compatible with such extensions. For instance, integrating pretrained encoders such as CLIP or DINO, or incorporating language instructions, could provide strong generalization across domains. This is a promising direction we plan to explore.
How sensitive are the results to calibrations?
Our method does not require any calibration, as it only requires an eye-in-hand image input. In both simulation and real-world experiments, no camera calibration is performed. This independence from camera intrinsics and distortion parameters is a key strength of our approach, contributing to its robustness and ease of deployment across different camera setups.
How were the finite subgroup and the sampling set chosen in practice, and how sensitive is performance to the size and structure of them?
We chose the finite subgroups based on a trade-off between rotational coverage and computational efficiency. Specifically, we use the icosahedral group for SO(3), which provides one of the densest and most uniformly distributed finite approximations of the rotation group, and for SO(2), which is a lightweight cyclic group commonly used in prior works. For each subgroup, we use all group elements as the sampling set.
In practice, we find that performance is not highly sensitive to the exact choice of subgroup, as long as the sampling provides reasonable angular coverage. Additionally, we apply random rotation augmentations during training, which expose the model to diverse data and enhance both generalization and continuity.
How are camera intrinsics and lens distortions handled during the orthographic projection onto the sphere, and what happens if the FOV or intrinsics change at test time?
Our method does not require or process camera intrinsics or lens distortion during the orthographic projection. The lifting step is a learnable mapping from the feature map onto a unit sphere, and does not rely on camera calibration, depth, or distortion parameters. This design allows the method to be sensor-agnostic and simplifies deployment.
Regarding FOV or intrinsics changes at test time: similar to other behavior cloning-based visuomotor methods, our model is trained and evaluated under the same camera setup. Significant changes in FOV at test time would introduce a domain shift in the input distribution, which would affect any policy trained under fixed-camera data. However, future works could explore training with mixed camera FOV or intrinsic, enabling the model to generalize to arbitrary test time camera settings.
Can you clarify the preprocessing steps applied to the GoPro images and whether they differ between simulation and real-world experiments?
The preprocessing steps are detailed in Appendix D (see “Training Details”). For real-world experiments, we capture 720×720 RGB images using a GoPro with the Max Lens Mod, then apply a center crop to 480×480 and resize to 224×224 to reduce spatial resolution under the wide field of view. In simulation, we use a lower resolution of 76×76, which is sufficient given the simpler visual complexity and is consistent with baseline settings. Data augmentations, including cropping and rotation, are applied in both domains, while color jitter augmentation is used only in the real-world setting to improve robustness against lighting variation.
[a] Cesa, G., Lang, L., & Weiler, M. (2022). A program to build E (N)-equivariant steerable CNNs. In International conference on learning representations.
[b] Weiler, M., & Cesa, G. (2019). General e (2)-equivariant steerable cnns. Advances in neural information processing systems, 32.
[c] Wang, D., Hart, S., Surovik, D., Kelestemur, T., Huang, H., Zhao, H., ...& Platt, R. (2024). Equivariant diffusion policy. arXiv preprint arXiv:2407.01812.
[d] Huang, H., Howell, O., Wang, D., Zhu, X., Walters, R., & Platt, R. (2024). Fourier transporter: Bi-equivariant robotic manipulation in 3d. arXiv preprint arXiv:2401.12046.
[e] Klee, D. M., Biza, O., Platt, R., & Walters, R. (2023). Image to sphere: Learning equivariant features for efficient pose prediction. arXiv preprint arXiv:2302.13926.
Dear Reviewer yUji,
As the reviewer-author discussion period comes to a close, we would greatly appreciate it if you could kindly let us know whether our responses have addressed your concerns and if you have any additional questions about our work.
Thank you again for your time and feedback!
In “3D Equivariant Visuomotor Policy Learning via Spherical Projection” the authors propose a novel approach for SO(3)-equivariant policies only using monocular RGB input from the robot's wrist camera. Specifically, the authors propose to project 2D equivariant features onto a sphere. The sphere is further rotated using the gripper orientation before being used as a condition variable for an equivariant 3D diffusion policy. The authors compare their approach against various baselines and ablate components, showing an average performance gain of more than 10%.
优缺点分析
Strengths:
- The paper is well written.
- The proposed method is technical sound.
- The authors show their applicability of their method both in simulated experiments and on a real world robot.
Weakness:
- My biggest concern is the necessity of the rotation correction. If I understand the paper correctly, the authors perform the correction to align the observations with the robot in the world frame.
This introduces a mismatch in that (x) stays constant while the world and actions rotate, thereby breaking global SO(3)-equivariance. (L. 161f.)
I would argue that this step is not necessarily needed if the action space is also changed to be within the gripper rotation frame. In such changes the whole system becomes invariant to g. As this is a fundamental pillar of the paper, I would be interested in comparing against this change. If I am not mistaken, this would be similar to EquiDiff with relative control as in their paper described. The same change should also be applied to Diffusion Policy. A discussion about it could also be relevant or other potential ways to achieve equivariance.
-
While I generally agree with eye-in-hand cameras being useful for robotic tasks, I dislike the overly bold statements that it is either eye-in-hand RGBs or PointCloud. There are no reasons why not both of them could be combined [1]. Also calibrating a camera is quite easy.
-
I lack a comparison against visual servoing baselines.
-
The anonymous repository seems broken. -No motivation about the used tasks, authors reference related work.
-
evaluated with 50 rollouts per task with 3 random seeds
If I understand this sentence correctly this means 3 different evaluation seeds were used. I would consider this is insufficient as essentially this will be a 150 rollouts per task with binary results for which in that case only a confidence interval can be provided.
问题
- What is the performance of a policy trained using no correction but relative control?
- What are the true statistical variances? i.e. the model should be trained with at least 3 different seeds and be evaluated across the same seed.
- Can you provide a numerical toy example that highlights the claims made in Sec 4.3.
- I am lacking a differentiated discussion/analysis about tasks. For example, when you want to screw or grasp something, you want to do that independent of the global gripper rotation but rather in the local frame. Differently though, when pouring something you want to follow the gripper orientation in the world frame.
局限性
The authors addressed reasonable limitations in the text. In addition to the mentioned limitations, I would like to point out the dependency on a constant global robot pose for demonstration generation and evaluation.
最终评判理由
The authors clearly clarified my doubts about using relative actions in the grippe frame and provided solid arguments for future integration.
格式问题
- Mixing absolute and delta improvements in the same column is confusing and distracting, consider adding a separate column for the delta
- There is some issue with the pdf for which line text boxes overlap.
- Table 1, Second row Equi Diff has an extra space
We thank the reviewer for the insightful suggestions. Please see our response below.
My biggest concern is the necessity of the rotation correction. The authors perform the correction to align the observations with the robot in the world frame. I would argue that this step is not necessarily needed if the action space is also changed to be within the gripper rotation frame. In such changes the whole system becomes invariant to g. As this is a fundamental pillar of the paper, I would be interested in comparing against this change. If I am not mistaken, this would be similar to EquiDiff with relative control. The same change should also be applied to Diffusion Policy. A discussion about it could also be relevant or other potential ways to achieve equivariance. (What is the performance of a policy trained using no correction but relative control?)
We agree that a relative action leads to an invariant system, potentially achieving similar symmetry as our model. However, we argue that aligning both observations and actions to the world frame provides practical and theoretical advantages, supported by the following evidence:
Empirical Evidence
To directly compare the effectiveness of relative vs. absolute control, we conducted additional experiments on two tasks from MimicGen: Square D2 and Nut Assembly D0. We evaluated a variation of our method, as the reviewer suggested, which removes the rotation correction and uses relative action. In addition, we also test the original Diffusion Policy with relative control. The results (with 100 demos) are summarized below:
| 100 demos | Square D2 | Nut Assembly D0 | ||
|---|---|---|---|---|
| Ours SO(2) (abs) | 32 | 74 | ||
| Ours SO(2) (rel) | 22 (-10) | 57 (-17) | ||
| Diffusion Policy (abs) | 12 (-20) | 51 (-23) | ||
| Diffusion Policy (rel) | 14 (-18) | 22 (-52) |
These results suggest that absolute actions outperform relative actions, despite the symmetric benefit of the relative actions. Notice that similar findings were reported in both EquiDiff and the original Diffusion Policy paper, where relative (or velocity) control often leads to a worse performance compared to absolute control.
Theoretical Perspective
While gripper-relative control maintains invariance under single-camera conditions, it does not generalize seamlessly to multi-camera or hybrid sensing configurations. In such settings, the external cameras will break the invariant assumption of the whole system. However, our world-frame alignment strategy circumvents this limitation by maintaining a consistent global reference frame across all observations and actions, enabling broader applicability and more robust generalization in complex environments.
I dislike the overly bold statements that it is either eye-in-hand RGBs or PointCloud. There are no reasons why not both of them could be combined. Also calibrating a camera is quite easy.
We agree that monocular RGB and point cloud inputs are not mutually exclusive and can be complementary. A key focus of this paper is to demonstrate how SO(3)-equivariance can be achieved purely from monocular RGB inputs, without requiring depth sensing or camera calibration. Nevertheless, it does not preclude the integration of richer sensors. In fact, a key advantage of our method is that all equivariant features are explicitly transformed into the world frame. This alignment enables seamless integration of multiple viewpoints/input modalities while preserving SO(3)-equivariance across the full observation space.
We will clarify these points and revise the bold statement in the final version to better present our method as a modular and extensible approach.
lack a comparison against visual servoing baselines
Visual servoing methods fundamentally differ from our approach in both capability and scope, making them an unsuitable baseline for comparison. While these methods are a key part of the vision-based control literature, they are typically limited to free-space motion guided by visual error minimization and are not designed to handle contact-rich manipulation. For example, visual servoing can move an end-effector to touch a drawer handle, but not execute the multi-step contact dynamics required to open the drawer. In contrast, diffusion policy can learn complex manipulation policies that involve repeated contact and task-specific interactions. A now-classic example of this is the Push-T task that involves making and breaking contact many times in order to move the T into the target pose. Standard visual servo methods are not directly applicable to the Push-T task without significant additional engineering.
As a consequence of the above, and because of the additional engineering that would be required, we do not directly compare against visual servo methods. We would also note that this omission is consistent with prior diffusion literature.
No motivation about the used tasks, authors reference related work.
We chose to use the MimicGen because it offers a high-quality simulation platform with well-designed tasks that span diverse manipulation challenges. It is widely adopted in recent closed-loop policy learning works [a,b,c] and enables fair and reproducible comparisons across methods.
In the real-world setting, our four tasks are designed to emphasize different aspects of closed-loop control. The first three involve pipe disassembly tasks that test precise 2D manipulation requiring accurate pulling actions, long-horizon sequential execution across multiple pipes, and precise 3D alignment involving complex geometries, respectively. The fourth task focuses on retrieving objects from a deformable grocery bag, where egocentric wrist-mounted vision is crucial. These tasks highlight the strengths of our method in handling varied scenarios.
“evaluated with 50 rollouts per task with 3 random seeds.” If I understand this sentence correctly this means 3 different evaluation seeds were used. I would consider this is insufficient as essentially this will be a 150 rollouts per task with binary results for which in that case only a confidence interval can be provided. (What are the true statistical variances? i.e. the model should be trained with at least 3 different seeds and be evaluated across the same seed.)
The reviewer's suggestion is exactly what we did in our experiments. We apologize for the confusion caused by our original wording. To clarify: for each task, we train three independent models using different random seeds (0, 1, and 2), and each model is evaluated using the same seed. The success rates reported in Table 1 are averaged across these three runs, and the full results with corresponding variance are provided in Appendix E (Table 4). We appreciate the opportunity to clarify this and will revise the description in the final version to avoid ambiguity.
provide a numerical toy example that highlights the claims made in Sec 4.3.
We provided a numerical example in our ablation study to support the core claim in this section. As shown in Table 2 of the main paper, replacing the SO(2)-equivariant image encoder (EquiEnc), which is used to extract SO(2)-invariant features, with a standard CNN results in a 6.8% average drop in success rate across four tasks. This result empirically supports Proposition 2 by demonstrating the benefit of using SO(2)-invariant image features.
lack a differentiated discussion/analysis about tasks. For example, when you want to screw or grasp something, you want to do that in the local frame. Differently though, when pouring something you want to follow the gripper orientation in the world frame.
We agree with the reviewer that analyzing task requirements from a local/global frame perspective could provide additional insight into the capabilities of our method. Importantly, our approach avoids the need for explicit canonicalization or task-specific frame selection by unifying all observations in the world frame, enabling a consistent and generalizable formulation across tasks.
Inspired by the reviewer's comment, we categorize the 12 simulation tasks into two groups: those are more naturally aligned with local frame (Stack D1, Stack Three D1, Square D2, and Coffee D2), and those depend more on global orientation (Threading D0, Three-Piece Assembly D0, Hammer Cleanup D1, Mug Cleanup D1, Kitchen D1, Pick and Place D0, Coffee Preparation D1, and Nut Assembly D0).
Using 100 demonstrations for training, our method, despite using a unified world-frame representation, achieves significant improvements over Diffusion Policy across both categories: a 17.25% gain for tasks better suited to local frames, and an 8.88% gain for tasks that rely more on global orientation. Compared with the baseline, the benefit of our approach is stronger on the tasks more naturally aligned with local-frame reasoning. This empirical result suggests that our SO(3)-equivariant world-frame representation is expressive enough to capture diverse task constraints while maintaining a unified and consistent formulation without requiring manual frame selection.
Paper Formatting Concerns
Thank you for pointing out the formatting issues. We will fix the related errors in the final version.
[a] Wang, Y., Zhang, Y., Huo, M., Tian, R., Zhang, X., Xie, Y., ... & Tomizuka, M. (2024). Sparse diffusion policy: A sparse, reusable, and flexible policy for robot learning. arXiv preprint arXiv:2407.01531.
[b] Wang, D., Hart, S., Surovik, D., Kelestemur, T., Huang, H., Zhao, H., ... & Platt, R. (2024). Equivariant diffusion policy. arXiv preprint arXiv:2407.01812.
[c] Funk, N., Urain, J., Carvalho, J., Prasad, V., Chalvatzaki, G., & Peters, J. (2024). Actionflow: Equivariant, accurate, and efficient policies with spatially symmetric flow matching. arXiv preprint arXiv:2409.04576
Thank you for the additional information. I appreciate the thorough answers and the discussion about my raised points. I agree with all of your statements and would love to see if you put these inside the paper or the appendix to make the work well rounded and related to the different fields of robotics.
These results suggest that absolute actions outperform relative actions, despite the symmetric benefit of the relative actions. Notice that similar findings were reported in both EquiDiff and the original Diffusion Policy paper, where relative (or velocity) control often leads to a worse performance compared to absolute control.
Thanks so much for running this experiment and showing yet another time that Diffusion policies work best in the world frame.
all equivariant features are explicitly transformed into the world frame.
One minor note, I find your arguments about multi-camera setup and global reference frame slightly contradciting. As in the main paper (L. 50) you motivate one of your methods advantages as to not requiring explicit calibration. In the proposed case of a multi-camera setup you would though perform extrinsic calibration, but as stated in my initial review, IMO performing an extrinsic camera calibration is a straight-forward process. If you have the time to run another set of experiments combining Equivariance wrist and external cameras to show performance increase, that could be interesting. And aligned with my initial review, I would still love if you do not paint a black and white image for wrist and external cameras but rather highlight the complementary aspect more.
We are glad to hear that the additional analyses were helpful, and we will incorporate these results and explanations into the final version of the paper or appendix as appropriate, to make the work more complete.
We also appreciate your comment on absolute vs. relative actions. We will include the corresponding results in the final version, as they reinforce our conclusion and align with prior findings in EquiDiff and Diffusion Policy.
Regarding your point about multi-camera setups and the need for calibration, you are right. In our current setting, which uses only an eye-in-hand camera, no extrinsic calibration is required. This is the context under which we made the “no calibration” claim, and we believe it is valid and self-consistent within the scope of the paper. That said, we fully acknowledge that when external cameras are introduced, extrinsic calibration becomes necessary. More broadly, we agree that monocular RGB and point cloud inputs can be complementary.
While we may not be able to complete these additional experiments before the discussion period ends, we plan to explore this extension in the final version. We are confident that our framework would benefit from combining eye-in-hand and external views, and we will present preliminary results and discussion accordingly.
We hope this addresses your concerns, and we appreciate your thoughtful engagement throughout the review process.
Sound good to me. Thank you for the discussion. I will raise my score to accept.
This paper proposes l diffusion policy framework that incorporates SO(3)-equivariance into eye-in-hand visuomotor learning (while previous works mostly leveraged 3D point cloud), by projecting RGB inputs onto the sphere, with rotations for camera motions. Experiments have demonstrated the performance of the proposed framework.
优缺点分析
The motivation of leveraging equivariance of monocule RGB image is very good, with potential impact to the field. This setting will facilitate policy learning with broader applications. The experiments are generally good and results are promising.
Weakness: (1) Sim2Real Transfer and Robustness: There is limited discussion/experimental evidence on how robust the representation is under severe domain shift, eg, lighting changes, partial occlusions, or unexpected camera movement/perturbations. Most real-world tasks could be regarded as within well-prepared settings. (2) Computational Efficiency Analysis: The supplementary mentions that equivariant networks increase training time. However, the main paper lacks quantitative analysis of inference or training speed relative to baselines, and does not address potential impacts on deployment in real-time applications. (3) Requirement for Segmentation. Many recent equivariant robotic manipulation models exhibit strong robustness to unsegmented inputs. In contrast, this model relies on object segmentation. While segmenting target objects is not particularly difficult these days due to open-vocab segmentation models like SAM, this requirement restricts model’s applicability to object-centric tasks. With segmentation, the model is unable to manipulate non-object entities (e.g., a pile of tiny objects as a whole) or understand global scene context. (4) Failure Cases and Limits: Are there specific types of manipulation (e.g., heavy occlusion, transparent objects) where the method's SO(3)-equivariance leads to failure or decreased performance?
Limitation: The authors provide a clear limitations section, candidly discussing the restrictions in camera, lack of translational equivariance, single-arm policy scope, and compute cost. Suggestions for improvement (e.g., incorporating more failure cases study) are constructive.
问题
See weaknesses.
局限性
Yes. The authors provide a clear limitations section, candidly discussing the restrictions in camera, lack of translational equivariance, single-arm policy scope, and compute cost.
最终评判理由
Please see the comment the reviewer has raised.
格式问题
N/A
The authors thank the reviewer for the thoughtful review. Please see our response below:
Sim2Real Transfer and Robustness: There is limited discussion/experimental evidence on how robust the representation is under severe domain shift, eg, lighting changes, partial occlusions, or unexpected camera movement/perturbations. Most real-world tasks could be regarded as within well-prepared settings.
Our paper does not do any sim-to-real transfer; we consider only sim2sim and real2real train/test scenarios. This is the same setting that is used in our baselines. That said, we agree with the reviewer that analyzing the domain shift between training and testing is important. Toward that end, we have performed additional real-world experiments as follows. Specifically, we varied lighting conditions by introducing a strong white point light source near the table, replacing the original overhead warm ambient lighting. This significantly altered the shadows and color temperature. Additionally, we introduced background perturbations by placing various household objects on the table to create a cluttered scene, and further increased the complexity by manually moving these objects during task execution. Finally, we introduced occlusions by briefly blocking part of the in-hand camera view with a human hand during execution. For the Box-Pipe Disassembly task, our policy achieved an 85% success rate under the lighting change, 75% with the noisy backgrounds, and 85% under partial camera occlusion (each evaluated with 20 rollouts). Compared with the 80% of no perturbation reported in Section 5.2, our method demonstrates strong generalization to realistic perturbations. We will include these results in the final version of this manuscript.
Computational Efficiency Analysis: The supplementary mentions that equivariant networks increase training time. However, the main paper lacks quantitative analysis of inference or training speed relative to baselines, and does not address potential impacts on deployment in real-time applications.
We provide the following quantitative comparisons to clarify the computational efficiency of our approach. For training, our SO(2) version is approximately 2 times slower than the original Diffusion Policy, while the SO(3) version is around 4 times slower, due to the added complexity of equivariant layers. However, for inference, our method remains highly efficient and suitable for real-time deployment. On a single RTX 4090 GPU:
-
Diffusion Policy: ~65 ms per inference
-
Ours (SO(2)): ~58 ms
-
Ours (SO(3)): ~123 ms
The SO(2) variant is even faster than the baseline due to our use of a lighter-weight diffusion U-net. While the SO(3) model is slower, its inference speed is still well within real-time requirements (e.g., 10 Hz control frequency), and the use of multi-step action chunks further supports efficient closed-loop execution very well. We will add these results to the final version to more fully analyze the computational efficiency of both training and inference.
Requirement for Segmentation: Many recent equivariant robotic manipulation models exhibit strong robustness to unsegmented inputs. In contrast, this model relies on object segmentation. While segmenting target objects is not particularly difficult these days due to open-vocab segmentation models like SAM, this requirement restricts model’s applicability to object-centric tasks. With segmentation, the model is unable to manipulate non-object entities (e.g., a pile of tiny objects as a whole) or understand global scene context.
Our model does not rely on any form of object segmentation, as clarified in the final paragraph of the Related Work section. We believe this is a misunderstanding regarding our model's assumptions. A core advantage of our approach is its ability to achieve SO(3)-equivariance directly from raw in-hand RGB inputs, without relying on object-centric representations such as segmented masks or point clouds. This stands in contrast to many prior works, where SO(3)-equivariance is typically achieved through carefully processed object-level inputs or multiple fixed, calibrated depth cameras [a, b, c]. In our framework, the in-hand camera captures the scene, and the model learns to act holistically without any manual pre-processing or object isolation. Contrary to the concern raised, we believe this design significantly enhances — rather than limits — the flexibility and generality of our approach, particularly for tasks beyond neatly segmented, object-centric scenarios. We appreciate the opportunity to clarify this point, and we will make sure to highlight this distinction more clearly in the final version.
Failure Cases and Limits: Are there specific types of manipulation (e.g., heavy occlusion, transparent objects) where the method's SO(3)-equivariance leads to failure or decreased performance?
The suggested failure cases (e.g., transparent objects or partial occlusions) are not issues for our method. In fact, in our real-world experiments, the policy successfully detects randomly placed transparent boxes and completes the task of placing objects inside them, which demonstrates the robustness to these challenging conditions. Nevertheless, it is useful to analyze potential limitations of our method, and SO(3)-equivariance may introduce challenges in certain manipulation scenarios where asymmetries in the physical world are important. One example is tasks involving left-right distinctions in world coordinates. Although equivariance allows the model to generalize across rotated scenes, the robot’s kinematic constraints (e.g., joint limits on different arms) may not remain symmetric under rotation, potentially leading to failures. Another example is manipulation involving heavy objects, where gravity breaks rotational symmetry in practice — an object that is easy to manipulate in one orientation might become unstable when rotated. Nonetheless, as noted in [d], such equivariant models can still perform well in the presence of symmetry-breaking factors. This is because the input (e.g., visual appearance, pose, shadows, etc) already encodes those asymmetries, allowing the model to learn appropriate behaviors without violating equivariant structure. While such cases are relatively uncommon, they do highlight scenarios where symmetry-breaking mechanisms may be beneficial. A promising direction is to augment SO(3)-equivariant models with non-equivariant components or inductive biases (e.g., gravity-aware priors or joint-limit encodings) that capture task-specific asymmetries. We leave this as a promising direction for future research.
[a] Tie, C., Chen, Y., Wu, R., Dong, B., Li, Z., Gao, C., & Dong, H. (2024). Et-seed: Efficient trajectory-level se (3) equivariant diffusion policy. arXiv preprint arXiv:2411.03990.
[b] Yang, J., Cao, Z. A., Deng, C., Antonova, R., Song, S., & Bohg, J. (2024). Equibot: Sim (3)-equivariant diffusion policy for generalizable and data efficient learning. arXiv preprint arXiv:2407.01479.
[c] Wang, D., Hart, S., Surovik, D., Kelestemur, T., Huang, H., Zhao, H., ... & Platt, R. (2024). Equivariant diffusion policy. arXiv preprint arXiv:2407.01812.
[d] Wang, D., Park, J. Y., Sortur, N., Wong, L. L., Walters, R., & Platt, R. (2022). The surprising effectiveness of equivariant models in domains with latent symmetry. arXiv preprint arXiv:2211.09231.
Dear Reviewer SQeg,
As the reviewer-author discussion period comes to a close, we would greatly appreciate it if you could kindly let us know whether our responses have addressed your concerns and if you have any additional questions about our work.
Thank you again for your time and feedback!
The reviewer thanks the authors for the clarifications. After carefully reading the rebuttal, and the comments of other reviewers, most concerns have been clarified and the reviewer still stays positive to this paper.
The idea of leveraging SO(3) equivariance to monocular RGB images is particularly relevant to robotics settings, where cameras are attached to moving arms. The authors develop a method based on spherical projection and an equivariant diffusion process to predict multistep action sequences conditioned on observations. The method and experiments are comprehensive and described clearly, and all reviewers have provided substantive technical reviews and engaged with the authors in resolving any issues and concerns. I believe the review scores accurately indicate the technical impact and significance of the contributions.