/10

Poster3 位审稿人

最低1最高4标准差1.2

ICML 2025

One-Step Diffusion Policy: Fast Visuomotor Policies via Diffusion Distillation

Zhendong Wang,Max Li,Ajay Mandlekar,Zhenjia Xu,Jiaojiao Fan,Yashraj Narang,Linxi Fan,Yuke Zhu,Yogesh Balaji,Mingyuan Zhou,Ming-Yu Liu,Yu Zeng

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

TL;DR

We introduce One-Step Diffusion Policy (OneDP), a distilled diffusion-based policy that accelerates robotic action generation 40 times..

摘要

Diffusion models, praised for their success in generative tasks, are increasingly being applied to robotics, demonstrating exceptional performance in behavior cloning. However, their slow generation process stemming from iterative denoising steps poses a challenge for real-time applications in resource-constrained robotics setups and dynamically changing environments. In this paper, we introduce the One-Step Diffusion Policy (OneDP), a novel approach that distills knowledge from pre-trained diffusion policies into a single-step action generator, significantly accelerating response times for robotic control tasks. We ensure the distilled generator closely aligns with the original policy distribution by minimizing the Kullback-Leibler (KL) divergence along the diffusion chain, requiring only $2%$-$10%$ additional pre-training cost for convergence. We evaluated OneDP on 6 challenging simulation tasks as well as 4 self-designed real-world tasks using the Franka robot. The results demonstrate that OneDP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz, establishing its potential for dynamic and computationally constrained robotic applications. A video demo is provided at our project page, and the code will be publicly available.

关键词

diffusion policies; diffusion distillation; one-step diffusion policy

评审与讨论

审稿意见

评分: 12025-03-10

The authors introduce One-Step Diffusion Policy (OneDP), which distills pre-trained diffusion policies into single-step generators for robotic control. OneDP achieves 42× faster inference (62.5Hz vs 1.5Hz). Evaluation is performed on six simulation and four real-world tasks.

给作者的问题

N/A

论据与证据

The results are promising, but I am not entirely convinced. See below.

方法与评估标准

理论论述

N/A

实验设计与分析

Ok-ish

补充材料

与现有文献的关系

遗漏的重要参考文献

N/A

其他优缺点

My main concern with the method is that it should lose partly/fully the benefit of multimodality that is given by diffusion policy, which is perhaps the biggest strength of diffusion policy. It would have been interesting to see experiments where multimodality is required, and to see how the distilled one-step policy would deal with them.
The idea is simple, but not novel (same trick used, e.g., in image generation).

其他意见或建议

Figure 1, training cost: I think this is highly misleading. I understand that the distillation part is much cheaper than the full training of the diffusion policy, showing that oneDP adds little overhead. However, it still requires training the original diffusion policy first

作者回复

2025-03-29

We find it disheartening that Reviewer Bfs2 rated our paper a 1, accompanied by a very brief review and a complete dismissal of our contributions, despite our focus on a critical problem in learning fast visuomotor policies for robotic control and our efforts to advance the state of the art in this area.

We respectfully urge Reviewer Bfs2 to reassess our paper more carefully and, if time permits, consider the evaluations provided by the other reviewers. We address your concerns in detail below:

Mode-Seeking Behavior in Reverse KL Minimization: While reverse KL minimization encourages mode-seeking, this does not necessarily lead to mode collapse. In our real-world coffee machine manipulation experiments, we deliberately collected data for closing the lid from two different angles. After distillation, the distilled policy retained this multimodal behavior. Additionally, mode-seeking can be beneficial in reinforcement learning tasks—once the policy identifies an optimal solution, it helps eliminate noisy low-density actions and enhances stability. In our case, imitation learning data is human-demonstrated and high-quality, ensuring that the pretrained policy converges to optimal solutions. Therefore, mode-seeking is not a critical concern for our approach.
Our submission primary area is Applications -> Robotics. Diffusion distillation has been applied across various domains, including 2D/3D generation, video generation, and diffusion-based language models. However, our contribution focuses on demonstrating its effectiveness in robotics, a field where efficient policy execution is critical. We put significant effort to show the effectiveness of diffusion distillation in robotics and its wild potentials for robotics. We also kindly request Reviewer Bfs2 to review the potentials mentioned by Reviewer kXET.
The pretraining + distillation paradigm is widely adopted across AI research, including language models (e.g., GPT-4o with its 4o-mini and o1-mini variants). Distillation plays a crucial role in making inference efficient, reducing computational costs, and enabling deployment on constrained hardware. For robotic control, where policies must run onboard with limited compute, achieving fast inference without sacrificing performance is highly practical and necessary. Given its broad adoption in other domains, distillation should not be viewed as a drawback but rather as a crucial technique for real-world applicability. We will update Fig1 to make it looks better.

审稿意见

评分: 32025-03-11

The paper introduces the One-Step Diffusion Policy, a novel approach that distills a pre-trained multi-step diffusion-based visuomotor policy into a single-step action generator for robot control. This one-step policy greatly accelerates inference (boosting action output frequency from ~1.5 Hz to 62 Hz) while maintaining state-of-the-art task success rates.

Strengths:

Order-of-magnitude speed improvement in action inference without performance loss.
The distillation process only adds 2%–10% to the additional training cost
Efficient Task Completion not only reacts faster but completes tasks in significantly less time.

Weaknesses:

The study does not assess long-horizon tasks, where cumulative decision-making plays a larger role.
Some mode-seeking behaviors inherent in reverse KL minimization could lead to suboptimal exploration in stochastic tasks .
No Analysis of Failure Case

给作者的问题

Would it be possible to include experiments on other imitation learning methods cited in the paper for comparison?
Is there a more in-depth discussion on the insights behind why the distilled policy can achieve better performance compared to the original diffusion policy?

论据与证据

The paper presents quantitative results comparing the inference speed of OneDP (62 Hz) vs. Diffusion Policy (1.49 Hz)
A wall-clock time comparison (Table 5) shows that OneDP's single-step action generation is much faster than the iterative denoising process in traditional diffusion models.

While the improvement is clear, the robot’s control frequency was limited to 20 Hz in real-world settings
Success rates are evaluated on a fixed set of tasks. The results may not generalize to more complex or unseen environments.

方法与评估标准

Strengths:

OneDP effectively distills a slow diffusion policy into a single-step generator, achieving ~42× faster inference while maintaining high success rates.
Benchmark evaluation on Robomimic and real-world tasks with meaningful metrics (success rate, inference speed, task time).
Fair comparison with Diffusion Policy and Consistency Policy, demonstrating superior speed and efficiency.

Weaknesses:

KL divergence distillation may limit policy diversity; alternative techniques are unexplored.
Evaluation is limited to short-horizon tasks; long-term planning and generalization to unseen environments are not tested.

理论论述

The paper’s theoretical foundation is based on KL divergence minimization for distillation.
The mathematical derivation follows from established diffusion model literature (Ho et al., 2020; Song et al., 2020), ensuring theoretical validity.
No formal proof or bound is given for the error introduced by the single-step approximation compared to multi-step diffusion policies.

实验设计与分析

Evaluations conducted on six Robomimic tasks (simulation) and four real-world robotic tasks (Franka arm), ensuring relevance to visuomotor learning.
While OneDP theoretically achieves 62 Hz, real-world experiments cap execution at 20 Hz for stability.

补充材料

Most demos show that the policies will work in a real-world setting.

与现有文献的关系

Diffusion Models in Robotics & Policy Learning:
The work is related to Consistency Policy (CP) (Prasad et al., 2024), which also attempts to accelerate Diffusion Policies.
KL-Based Distillation in Generative Models

遗漏的重要参考文献

No.

其他优缺点

No other strengths and weaknesses.

其他意见或建议

No other comments or suggestions.

作者回复

2025-03-29

We sincerely appreciate Reviewer 5rTc's positive feedback and address your remaining concerns below:

Weaknesses:

Mode-Seeking Behavior in Reverse KL Minimization
While reverse KL minimization encourages mode-seeking, this does not necessarily lead to mode collapse. In our real-world coffee machine manipulation experiments, we deliberately collected data for closing the lid from two different angles. After distillation, the distilled policy retained this multimodal behavior. Additionally, mode-seeking can be beneficial in imitation learning tasks—once the policy identifies an optimal solution, it helps eliminate noisy low-density actions and enhances stability. In our case, imitation learning data is human-demonstrated and high-quality, ensuring that the pretrained policy converges to optimal solutions. Therefore, mode-seeking is not a critical concern for our approach.
Evaluation on Long-Horizon Tasks
We emphasize that our study explores both simulation and real-world tasks. However, long-horizon tasks require a diffusion policy trained explicitly for long-horizon planning. We acknowledge the importance of extending our approach to such tasks and consider it a promising direction for future work.

Questions:

Comparison with Other Imitation Learning Methods
Implementing all cited imitation learning baselines within our framework is highly resource-intensive and beyond the scope of this work. Instead, we selected Diffusion Policy and Consistency Policy as our primary baselines, as they are the most representative diffusion-based approaches.
Why the Distilled Policy Outperforms the Pretrained Policy
The distilled policy demonstrates slightly better performance than the original diffusion policy due to two main factors:
- Error Accumulation in Multi-Step Sampling: Traditional diffusion policies suffer from accumulated errors over multiple denoising steps, while our one-step approach eliminates this issue.
- Mode-Seeking Behavior Enhancing Stability: As previously mentioned, the distilled policy focuses more on successful trajectories, reducing the influence of low-density noisy actions. This results in a more stable and confident policy during evaluation.

审稿意见

评分: 42025-03-11

This paper adapts diffusion distillation techniques from text-to-3D generation to achieve one-shot diffusion policies for robotics. The authors compare two types of distillation: 1) distilling to a stochastic one-step diffusion policy, and 2) distilling to a deterministic one-step policy. They show that the stochastic policy has a task performance slightly superior to that of the original DDPM teacher, with a ~40x inference time speedup. The method significantly outperforms a consistency policy baseline, both in terms of final performance and training speed.

给作者的问题

In Algorithm 1, should not the sampling of $A_{\theta}$ be deterministic in the OneDP-D case?
Can the authors elaborate on "distillation occurs over [2, 95] diffusion timesteps to avoid edge cases" in line 248? I haven't seen this before.
I'm somewhat suprised that the stochastic policy outperformed the deterministic one. I get that the stochasticity during pre-training is valuable, as human demonstrations aren't deterministic and thus it's difficult to fit a deterministic policy to them. But when we're distilling, it seems that converging to the action mode should be beneficial (as briefly discussed in the "Distillation Discussion" section). Could the authors elaborate on why OneDP-S outperforms One-DP-D? The authors write that: "A stochastic policy, which encompasses deterministic policies, is more versatile and better suited to scenarios requiring exploration, potentially leading to better convergence at a global optimum." In RL, I see why exploration is important; but here we're doing imitation learning so I don't see how this applies.

论据与证据

The main claim is that the OneDP dramatically speeds up inference: "that OneDP not only achieves state-of-the-art success rates but also delivers an order-of-magnitude improvement in inference speed, boosting action prediction frequency from 1.5 Hz to 62 Hz." This is well-supported by the results (although see the caveat in "Experimental Designs Or Analyses").

方法与评估标准

The evaluation settings are thorough, involving both simulated and real-world experiments.

理论论述

N/A

实验设计与分析

Overall, the experiments look sound. The one comment I'd have is that it seems that the underpowered hardware might accentuate performance differences. A DDPM frequency of 1.5 Hz is much lower than Diffusion Policy (10Hz diffusion), Octo (10Hz diffusion), or $\pi_0$ (~6Hz flow matching). Nevertheless, I think that ultimately these visuomotor policies will need to be executed onboard with resource-constrained hardware, and this work is correspondingly more valuable.

补充材料

I didn't review the code.

与现有文献的关系

I think that the importance of faster action inference are pretty clear for robotics, especially for systems where compute has to be entirely onboard.

I'd like to comment on another reason I see this paper as being important to the field that I don't see mentioned by the authors. One important challenge with these diffusion policies lies in transitioning from imitation learning pre-training to finetuning with online RL. Obvious online RL candidates like PPO don't immediately apply, as they require the gradient of the action probability w.r.t. the policy parameters; this isn't available in closed-form because the diffusion action probability is a) defined implicitly, and b) involves iterative denoising which would be hard to differentiate through anyways. This paper solves problem (b), and I think opens up a few interesting possibilities for addressing (a), such as having the policy network output a normal distribution centered at the distilled $G_{\theta}(O)$ , or perhaps this idea can be adapted to flow matching where the density at a point can be evaluated. While understandably this extension isn't in scope for the current paper, I think the paper opens up some exciting next steps.

遗漏的重要参考文献

I'd say that $\pi_0$ is a fairly important paper that should be discussed: Black, Kevin, et al. " $\pi_0$ : A Vision-Language-Action Flow Model for General Robot Control." arXiv preprint arXiv:2410.24164 (2024).

其他优缺点

The primary weakness is that the paper is a little lacking in novelty, as it largely adapts ideas from the generative AI literature to robotic policies. But I still think that the paper is a valuable contribution for both its practical impact on the feasibility of diffusion policies and it setting the groundwork for incorporating online RL. The paper is overall well-written and easy to follow.

其他意见或建议

"Initializaiton" in Algorithm 1.

作者回复

2025-03-29

We sincerely appreciate Reviewer kXET's recognition of our work's value and your insightful suggestion on extending our distillation technique to facilitate online RL fine-tuning. We agree this is a promising future direction. Below, we address your remaining concerns and questions.

References

We will include a discussion of π₀: A Vision-Language-Action Flow Model for General Robot Control in the related work section.

Questions & Responses

Deterministic Sampling in OneDP-D (Algorithm 1)
Thank you for pointing this out. In the OneDP-D case, $z$ is indeed fixed at zero. We will correct the typo in Algorithm 1 accordingly.
Distillation Over [2, 95] Diffusion Timesteps
The reason for avoiding the extreme timesteps (near 0 and 100) is that the distilled policy’s distribution may initially differ significantly from the pretrained diffusion policy.
- If nearly no noise is added, the distilled policy may generate actions that fall in low-density regions of the pretrained policy, leading to unstable distillation loss.
- If excessive noise is added, the score guidance from the pretrained policy becomes nearly uniform across actions, reducing its utility for training.
  This aligns with observations from DreamFusion, where extreme noise levels were avoided. We quote the original sentence here: “We sample t ∼ U(0.02, 0.98), avoiding very high and low noise levels due to numerical instabilities”.
Why OneDP-S Outperforms OneDP-D
This is due to the pretrained diffusion policy having strong performance in learning multi-modal distributions, so the pretrained policy has multi-modality. The stochastic policy benefits from the multimodal nature of the pretrained diffusion policy. As discussed in our response to Reviewer 5rTc, reverse KL divergence encourages mode-seeking behavior, allowing the stochastic policy to converge to multiple optimal action modes.
- In our coffee machine manipulation tasks, we designed scenarios with two distinct successful strategies. The stochastic policy successfully captured both, whereas the deterministic policy followed only one.
- Exhibiting multi-modality enables the robot to solve the request in broader regions with multiple potential solutions so it is more stable than pure deterministic policy that can only follow one specific path.

We appreciate your thoughtful feedback and will incorporate these clarifications in the final revision.

审稿人评论

2025-04-03

Thank you to the authors for their reply. I am comfortable maintaining my accept recommendation.

作者评论

2025-04-04

We again thank Reviewer kXET for the thoughtful comments and the recommendation to accept our paper.

最终决定Accept (poster)

2025-05-01

The method shows a clear 40–42× inference-speed improvement over prior diffusion policies (1.5 Hz → 62 Hz) while matching or slightly surpassing task success rates on six simulated and four real-world Franka tasks.

Reviewers agree the strong empirical evaluation, modest extra training cost, and potential impact on onboard robotics.

Reviewer Bfs2 raises two main objections: (i) limited novelty because one-step distillation has appeared in very recent work, and (ii) risk of losing multimodality. The authors convincingly rebut (ii) with evidence that the stochastic variant retains multiple modes in a coffee-machine task and explain why reverse-KL does not force mode collapse. On (i), although one-step distillation itself is not new, applying it rigorously to visuomotor control and demonstrating real-world speedups is novel and practically important for resource-constrained robotics; both other reviewers agree.

Overall, I recommend acceptance.