6.4

/10

Poster4 位审稿人

最低4最高4标准差0.0

2.5

置信度

创新性2.3

质量2.8

清晰度3.3

重要性2.5

NeurIPS 2025

A Practical Guide for Incorporating Symmetry in Diffusion Policy

Dian Wang,Boce Hu,Shuran Song,Robin Walters,Robert Platt

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

TL;DR

We propose practical methods to incorporate symmetry into Diffusion Policy, improving performance while maintaining simplicity.

摘要

关键词

Robotic ManipulationEquivarianceDiffusion Model

评审与讨论

审稿意见

评分: 4置信度: 22025-06-16

This paper systematically studies different approaches to incorporate symmetries (invariance/equivariance) into diffusion policies. Specifically, the authors compared three action representations (abs, rel, delta), and four image encoders (cnn, equi cnn, pretrained cnn, frame averaging) using 12 robotic manipulation tasks and found that, in general, incorporating symmetries improves the performance of diffusion policies.

优缺点分析

Strengths:

The paper is clearly written and easy to follow. The systematic evaluation of various techniques for incorporating symmetries provides practical value and can help guide practitioners in the design of network architectures and data collection pipelines. The empirical results showing that equivariant diffusion policies using a single view can perform competitively with more complex voxel-based baselines are particularly interesting and suggest promising directions for future work.

Weaknesses:

The action representations proposed in this paper appear to be specifically designed for eye-in-hand robotic setups, and it remains unclear how well these techniques would generalize to other types of tasks or embodiments.
One common problem with incorporating symmetries is that the tasks usually do not fully respect the group structures in general, due to factors like imperfections of the observations. I think it would be beneficial to consider soft variants like [1].

[1] Finzi, Marc, Gregory Benton, and Andrew G. Wilson. "Residual pathway priors for soft equivariance constraints." Advances in Neural Information Processing Systems 34 (2021): 30037-30049.

问题

How important is FOV in terms of performance? It appears that there are no direct comparison between normal FOV and large FOV in the experiments, but only large FOV and normal FOV + external/voxel.
Related to (1), would it be possible to also incorporate equivariant CNNs in the normal FOV + external setting? Would it require significant changes to the architecture?
How sensitive is the method's performance to the discretization of SO(2) used in equivariant CNNs? (e.g., would $C_4$ improve/degrade performance)

局限性

yes

最终评判理由

All my questions were adequately addressed.

格式问题

作者回复

2025-07-31

We thank the reviewer for their helpful comments. Please see our response below.

The action representations proposed in this paper appear to be specifically designed for eye-in-hand robotic setups, and it remains unclear how well these techniques would generalize to other types of tasks or embodiments.

Although relative actions with eye-in-hand perception can achieve a theoretical benefit as discussed in Proposition 2, our action representations are not limited in eye-in-hand robotic setups. As shown in Table 1 and Table 4, relative trajectory works very well with the in-hand + external observation setting. Regarding other types of embodiments, eye-in-hand setting might actually be better for cross-embodiment generalization, as the observation would remain invariant across different robot embodiments.

One common problem with incorporating symmetries is that the tasks usually do not fully respect the group structures in general, due to factors like imperfections of the observations. I think it would be beneficial to consider soft variants like [1]. [1] Finzi, Marc, Gregory Benton, and Andrew G. Wilson. "Residual pathway priors for soft equivariance constraints." Advances in Neural Information Processing Systems 34 (2021): 30037-30049.

This is a good thought, however, unlike the prior works in equivariant policy learning that use end-to-end equivariant networks and might encounter the symmetry breaking problems you mention, our work uses an equivariant encoder followed by a standard diffusion head (see Figure 2). As a result, our policy is not totally constrained by the symmetry of the network, and thus will not suffer from the symmetry mismatch problem. That said, we do agree with the reviewer that methods like the residual pathway would be an interesting future direction to explore fused equivariant and non-equivariant feature encoding. We will discuss this in the final version of our paper.

How important is FOV in terms of performance? It appears that there are no direct comparison between normal FOV and large FOV in the experiments, but only large FOV and normal FOV + external/voxel.

We performed a new experiment comparing the large FOV with the normal FOV. As shown in the table below, a normal FOV could work for simple tasks like Stack D1, however, when facing more challenging tasks with multiple objects in the scene like Stack Three D1, it suffers from incomplete information in the observation.

	Stack D1	Stack Three D1
Large FOV	98.0	72.0
Normal FOV	90.0 (-8)	30.0 (-42)

Related to (1), would it be possible to also incorporate equivariant CNNs in the normal FOV + external setting? Would it require significant changes to the architecture?

It is possible, and no changes to the architecture are required. Please notice that we did perform experiments using the proposed method with external cameras in Appendix E, where our method still outperforms the baselines with a significant margin. In fact, our implementation supports an arbitrary number of cameras, and our Table 4 and Table 2 were run using the same encoder class.

How sensitive is the method's performance to the discretization of SO(2) used in equivariant CNNs? (e.g., would improve/degrade performance)

Reducing the discretization will degrade performance. We validated this with our Equi Enc variation, where we changed the group associated with the network from C8 (containing 8 discrete rotations ranging from 0 to 2pi) to C4 (containing 4 discrete rotations) or C2. Please see the results in the table below.

	Stack Three D1	Square D2
C8	75.3	32.0
C4	68.0	30.7
C2	63.3	20.7

We suspect that increasing the size of the sub-group from C8 to higher order might not future improve the performance. As experimented in the e2cnn paper, a cyclic group of order 8 is typically a sweet spot among the discrete subgroups of SO(2).

2025-08-05

I thank the authors for the detailed reply. All my concerns have been addressed, and I remain my positive outlook for the paper.

2025-08-05

Dear Reviewer jnxb,

As the reviewer-author discussion period is coming to an end, we would greatly appreciate it if you could kindly let us know if our rebuttal has resolved your concerns, and if you have any additional questions about our work. Thank you!

审稿意见

评分: 4置信度: 22025-07-02

The paper proposes and evaluates several methods to gain the benefits of symmetry-awareness in robotic manipulation policies without the high implementation complexity of fully equivariant models. The authors show that through a combination of invariant representations (eye-in-hand camera with relative actions), equivariant vision encoders and symmetrizing pretrained encoders with Frame Averaging, they can achieve performance comparable to state of the art equivariant architectures while using a much simpler and more practical setup.

优缺点分析

STRENGTHS:

The authors formally prove that the combination of eye-in-hand perception and relative actions parametrisation leads to inherent SE(3)-invariance. While components are not new, this specific formalisation is novel as far as I am aware
The paper does a good job at systematically analysing these known techniques and how they can be practically integrated into a feasible and effective setup for robotic learning
The authors obtain good results given the simpler setup, effectively, they show that for a minor drop in performance, they can disregard three RGBD cameras and the complex voxelisation / equivariance network pipeline.
As the title suggests, the paper provides a clear, practical guide backed by systematic experimental / empirical findings.

WEAKNESSES:

The novelty is limited, as the authors are simply combining and analysing different known methods
The comparisons in Table 2 are not completely fair, as they are comparing their method (which uses an eye-in-hand camera) to EquiDiff (using absolute actions and multiple external cameras). Yes, it is a simpler setup, and the authors argue the setup is a key part of their solution, but I still find the comparison a bit awkward.

问题

I think it would be interesting to see an experiment where the proposed Pretrain+FA was applied to the same multi-camera, absolute action setting as EquiDiff. Additionally, would it be possible to run EquiDiff in your eye-in-hand setup? I think having these two experiments would strengthen the paper

局限性

Yes

最终评判理由

I appreciate the new ablations and I think they really add value to the insights of the paper. I'm still on the fence about the originality of the work, but I think the paper presents the information well and the experiments are well done.

格式问题

None

作者回复

2025-07-31

We thank the reviewer for their insightful suggestions. Please see our response below.

The novelty is limited, as the authors are simply combining and analysing different known methods

We would like to clarify three novelties of our paper:

A theoretical analysis of the symmetry of relative trajectory and eye-in-hand perception.
The use of an equivariant encoder but non-equi policy, rather than a fully equivariant policy as commonly done in prior works, for incorporating symmetry.
The use of Frame Averaging to bridge the benefit of equivariant encoding and pre-trained vision encoders.

To the best of our knowledge, we are the first to accomplish each of those three points in the robot learning literature. We thank the reviewer for raising this concern, we will clarify the novelty of our work in the revision.

The comparisons in Table 2 are not completely fair, as they are comparing their method (which uses an eye-in-hand camera) to EquiDiff (using absolute actions and multiple external cameras). Yes, it is a simpler setup, and the authors argue the setup is a key part of their solution, but I still find the comparison a bit awkward.

This is a good point, thanks for bringing it up. The reason for using the external camera for EquiDiff is two-folds: 1. This is the setting proposed by the prior work and 2. Using EquiDiff with only a large FOV in-hand camera will decrease its performance, due to the lack of equivariant signal from the external camera. However, we do agree with the reviewer that using a different camera setting can raise fairness concerns. To address this, we performed an additional experiment for EquiDiff using a large FOV in-hand camera. As shown in the table below, the performance of EquiDiff dropped significantly without accessing the external camera.

	Mean	Stack	Stack Three	Square	Threading	Coffee	Three Piece Assembly	Hammer Cleanup	Mug Cleanup	Kitchen	Nut Assembly	Pick Place	Coffee Preparation
EquiDiff IH + External	53.7	93.3	54.7	25.3	22.0	60.0	15.3	65.3	49.3	67.3	74.0	41.7	76.7
EquiDiff Large FOV IH	40.9 (-12.8)	96.0	61.3	8.7	13.3	47.3	3.3	59.3	50.7	55.3	40.0	27.7	27.3

I think it would be interesting to see an experiment where the proposed Pretrain+FA was applied to the same multi-camera, absolute action setting as EquiDiff. Additionally, would it be possible to run EquiDiff in your eye-in-hand setup? I think having these two experiments would strengthen the paper

This is also a great point, thank you for the suggestion. Demonstrating the performance of Pretrain + FA with multi-camera, absolute action setting as EquiDiff would indeed demonstrate the generalizability of our proposal. We performed a new experiment in this suggested setting for two methods: (1) EquiDiff + Pretrain + FA: replacing the encoder in EquiDiff with our Pretrain + FA encoder. (2) Diff + Pretrain + FA: using our Pretrain + FA with the original diffusion head, similar as our method in Table 2. As shown in the table below, the Pretrain + FA encoder can be used with either EquiDiff or the standard diffusion head, both performing better than the baseline EquiDiff. In addition, we had an experiment of Pretrain + FA in the multi-camera (with relative action though) in Appendix E. We believe these results well demonstrate the generalizability of our Pretrain + FA encoder, improving baseline performance in various different settings.

	Stack Three D1	Square D2
EquiDiff	54.7	25.3
EquiDiff + Pretrain + FA	68.7 (+14.0)	29.3 (+4.0)
Diff + Pretrain + FA	58.7 (+4.0)	38.0 (+12.7)

We also performed the requested EquiDiff in eye-in-hand setup, please see our response to the question above.

2025-08-05

Dear Reviewer zdCn,

Thank you for acknowledging that you’ve read our rebuttal. We would greatly appreciate it if you could kindly let us know if our rebuttal has addressed your concerns about our work, and if you have any remaining questions about the paper. Thank you!

2025-08-06

Thank you, I've updated my score.

审稿意见

评分: 4置信度: 22025-07-02

This paper proposes practical strategies to incorporate spatial symmetry into diffusion-based policy learning frameworks without the overhead of fully equivariant architectures. The authors demonstrate that using SE(3)-invariant representations—specifically relative trajectory actions and eye-in-hand perception—naturally leads to improved generalization. Additionally, the work introduces the use of equivariant vision encoders and Frame Averaging to leverage symmetry-aware features from pretrained networks. Across 12 tasks in the MimicGen benchmark, the proposed methods achieve performance competitive with or better than state-of-the-art equivariant policies while maintaining simplicity in design and implementation.

优缺点分析

Strengths:

The paper presents a thorough theoretical analysis backed by extensive empirical validation on a standard benchmark. Results are clearly reported and systematically compared.
The writing is clear and well-organized.
The work addresses a critical trade-off in robot learning: the complexity of incorporating symmetry versus the practical need for deployable, performant policies. This makes the contribution broadly relevant to practitioners and researchers.

Weaknesses:

Proposition 2 assumes the policy does not depend on the gripper pose $T_t$ , yet the authors themselves acknowledge this is not strictly true in practice. However, no ablation is provided to quantify how much performance degrades due to residual $T_t$ -dependence. As a result, the claim that “approximate invariance” suffices remains qualitative and lacks experimental support.
The experimental sections evaluate combinations of techniques (e.g., relative trajectories with equivariant encoders), but do not fully isolate the contributions of each component via comprehensive ablations.

问题

How much additional computational cost is introduced by Frame Averaging, and how does this compare to training from scratch with a full equivariant architecture?

局限性

Yes

最终评判理由

The authors’ response has addressed most of my concerns, and I have accordingly adjusted my score.

格式问题

No major formatting issues are noted.

作者回复

2025-07-31

We thank the reviewer for their careful review. Please see our response below.

Proposition 2 assumes the policy does not depend on the gripper pose , yet the authors themselves acknowledge this is not strictly true in practice. However, no ablation is provided to quantify how much performance degrades due to residual -dependence. As a result, the claim that “approximate invariance” suffices remains qualitative and lacks experimental support.

We performed a new ablation study to remove the gripper pose in the policy input. The resulting ablation better satisfies Proposition 2 at a cost of the proprioceptive information. As shown in the table below, removing the gripper pose decreases the performance, justifying our design choice of keeping the pose in the input.

	Stack Three D1	Square D2
With Gripper Pose	72.0	16.0
No Gripper Pose	64.7 (-7.3)	14.7 (-1.3)

The experimental sections evaluate combinations of techniques (e.g., relative trajectories with equivariant encoders), but do not fully isolate the contributions of each component via comprehensive ablations.

This is a good point. Although some of the ablations do exist in the paper, they are not organized as an ablation table. We add an additional ablation baseline that uses the Pretrain + FA encoder with absolute trajectory, and explicitly ablates the relative trajectory and equivariant encoder compared to the full model. As shown in the table below, both components contribute significantly to the overall performance.

	Stack Three D1	Square D2
Full (Rel Traj + Pretrain + FA in Table 2)	86.7	43.3
- Pretrain - FA (Rel Traj + CNN Enc in Table 2)	72.0 (-14.7)	16.0 (-27.3)
- Rel Traj (new, Abs Traj + Pretrain + FA)	68.7 (-18)	26.7 (-16.6)

How much additional computational cost is introduced by Frame Averaging, and how does this compare to training from scratch with a full equivariant architecture?

Frame averaging increases the GPU memory consumption by increasing the batch size by a factor of the frame size. In our case (8 frames), adding frame averaging to a pretrained encoder increases the GPU memory from 2970GB to 6476GB. Compared to Equi Enc’s 4942GB, frame averaging is slightly heavier (partially because the pretrained resnet-18 is slightly larger than our custom equivariant resnet-18), however, it remains relatively light and can be trained on most of the existing gaming GPU. For larger base models, one can reduce the size of the frame or the training batch size to reduce the GPU memory consumption.

2025-08-05

Dear Reviewer yST9,

审稿意见

评分: 4置信度: 42025-07-03

This paper presents a guide on how to easily incorporate symmetry into diffusion policies. They propose a combination of in-hand camera observation, relative action commands, and equivariant image encoders. The authors evaluate their approach in simulation benchmarks and show that their simple approach offers comparable performance to fully equivariant network designs.

优缺点分析

Strength:

Proposes a simple yet effective way of designing diffusion policies that incorporate symmetry.
Clearly documents expeirments designs and results. Overall paper is clear and well presented.
Offers a interesting proof on how in-hand observations + relative actions leads to SE3 equivariance under assumptions.

Weakness:

No real-world experiment to evaluate the proposed approach.
Limited experimental results. The result from using relative actions is not very interesting as it was already proposed in the oringal DP paper and widely used in diffusion policies these days.
Action selection is well explored in the community, and the use of equivariant encoders and in-hand observations are not uncommon in robot policy learning.

问题

Since the design incorporates symmetry, will it perform better in low-data regimes? All of the datasets used are fairly abundant in the experiments.
As mentioned in the paper that frame averaging slows down the training process. Does it also slow down the inference process? Will it limit the performance of the policy in realworld settings if the latency of the model increases?
It would be interesting to see how this work can be extended to external observation cameras.

局限性

Yes.

最终评判理由

I gave the paper a score of 4. The paper is well written, and the experiments and methods are clearly documented and explained. The authors' rebuttal also addressed almost all of my questions, particularly clarifying my misunderstanding of action selection, and included explanations of the lack of real-world experiments, as well as how this method could be applied to external observations. The only drawback I think is the novelty and performance of the method. It utilizes only equivariant encoders + relative action representation, which are both existing methods, and the final performance is slightly lower than that of more sophisticated methods. However, I believe the paper does provide a good contribution in guiding how to incorporate symmetry in diffusion policies with just simple methods.

格式问题

None.

作者回复

2025-07-31

The authors thank the reviewer for their thoughtful review. Please see our response below:

No real-world experiment to evaluate the proposed approach.

While we do not explicitly include real-world experiments in this paper, the assumptions that we are making in our simulation experiments are well-aligned with standard real-world experimental settings (i.e., the only information that we are acquiring from the simulation is the eye-in-hand image and the proprioceptive information of the gripper). Consequently, our findings should transfer easily to the real world. Moreover, one important goal of this paper is to systematically evaluate a number of practical methods for incorporating symmetry into Diffusion Policy. This requires abundant and rigorous comparisons that are more controllable in simulation than the real-world.

Limited experimental results. The result from using relative actions is not very interesting as it was already proposed in the oringal DP paper and widely used in diffusion policies these days.

The original DP did not use the same relative action definition as our paper. Instead, their relative action is a delta action defined in the world frame (referred as velocity control in the original DP paper), following the RoboMimic action space convention. This action formulation does not have any inherent symmetry, thus it underperforms the absolute action in the original DP paper. The relative action studied in this paper, however, is defined in the gripper frame, thus its symmetric property discussed in section 4.1 enables an outperformance compared with absolute control.

Action selection is well explored in the community, and the use of equivariant encoders and in-hand observations are not uncommon in robot policy learning.

Our method uses an equivariant encoder with a non-equivariant policy, as well as frame averaging for bridging pretrained encoders and equivariance, both of which have never been done before in robot learning. Prior works in equivariant robot learning all focus on defining an end-to-end equivariant policy, which is much more complex mathematically and implementationally compared with our approach. Regarding action selection, although this is indeed a question that has been studied a lot in the community, it remains uncertain what is the best action formulation in the context of Diffusion Policy. Moreover, we believe re-thinking about this question from a symmetry perspective will also offer the community some new insights.

Since the design incorporates symmetry, will it perform better in low-data regimes? All of the datasets used are fairly abundant in the experiments.

Yes, it will perform better in low-data regimes. In fact, all of our experiments use 100 demonstrations, which is already in the low-data regime (e.g., compared with the 200-300 demonstrations per task used in the original DP paper and the UMI gripper paper). If we reduce the number of demonstrations further, we believe the performance of all methods will decrease, but the impact on symmetric methods will be smaller.

As mentioned in the paper that frame averaging slows down the training process. Does it also slow down the inference process? Will it limit the performance of the policy in realworld settings if the latency of the model increases?

This is a good question. In our experiments, the inference for frame averaging is the same as the baseline. One advantage of frame averaging is that it typically does not slow down inference, especially in the robot policy inference where the inference batch size is normally 1. Frame averaging essentially increases the batch size by the size of the frame (8 in our case), which does not affect inference time due to batch processing.

It would be interesting to see how this work can be extended to external observation cameras.

This is a very important point, but we would like to clarify that we do have this experiment in Appendix E, where we found that the proposed method in external observation can also improve the performance by a significant margin.

In addition, we performed an additional experiment validating the Pretrain + FA encoder (with either equivariant diffusion head or standard diffusion head) in the external observation and absolute action space (i.e., same setting as EquiDiff). As shown in the table below,our method is also effective in the external observation and absolute action space setting.

	Stack Three D1	Square D2
EquiDiff	54.7	25.3
EquiDiff + Pretrain + FA	68.7 (+14.0)	29.3 (+4.0)
Diff + Pretrain + FA	58.7 (+4.0)	38.0 (+12.7)

2025-08-05

Dear Reviewer ikDP,

2025-08-06

Thank you for your rebuttal. It has addressed most of my questions. I will maintain my score.

最终决定Accept (poster)

2025-09-17

This paper explores a number of straightforward and practical approaches to incorporate symmetry benefits into diffusion policies without the overhead of full equivariant designs. This paper proves that combining eye-in-hand perception with relative or delta action parameterization yields inherent SE(3)-invariance, performs a systematic experimental study on those design choices for integrating symmetry in diffusion policies, and concludes that an invariant representation with equivariant feature extraction significantly improves the policy performance.

Based on comments from reviewers, the merits of this paper include: (1) It is clearly written and easy to follow. (2) Provide a contribution in guiding how to incorporate symmetry in diffusion policies with just simple methods. (3) Offer a proof on how in-hand observations + relative actions leading to SE3 equivariance under assumptions.

While there are some merits as above, some reviewers also have some concerns about the novelty and originality, and the performance of the method. For the performance issue, the concern by a reviewer is to utilizes only equivariant encoders + relative action representation, which are both existing methods, and the final performance is slightly lower than that of more sophisticated methods.

All reviewers recommend Weak Accept, which suggests that the work is valuable but may not meet the highest bar for novelty or performance. There are no recommendations for rejection. Given that the contribution is clean, well-supported, and of practical interest—even if somewhat incremental—I would lean slightly toward acceptance if room permits. However, I acknowledge that the final decision may depend on the overall selectivity and comparative strength of submissions, and I defer to the SAC or program chairs in borderline cases.