7.0

/10

Poster3 位审稿人

最低4最高5标准差0.5

3.7

置信度

创新性3.0

质量3.0

清晰度2.7

重要性3.3

NeurIPS 2025

GenPO: Generative Diffusion Models Meet On-Policy Reinforcement Learning

Shutong Ding,Ke Hu,Shan Zhong,Haoyang Luo,Weinan Zhang,Jingya Wang,Jun Wang,Ye Shi

OpenReview PDF

提交: 2025-05-07更新: 2025-10-29

TL;DR

We propose GenPO, which effectively incorporates invertible diffusion model into on-policy RL, and deals with the challenge of log-likehood computation in diffusion policies.

摘要

关键词

Reinforcement LearningOn Policy LearningGenerative Diffusion Models

评审与讨论

审稿意见

评分: 4置信度: 42025-06-30

The paper introduces the GenPO framework, which combines the diffusion policy with the online on-policy method PPO. The core contribution is to adapt EDICT's exact diffusion inversion to RL using "double dummy action", enabling the exact likelihood calculation. To be more specific, the approach reformulates the MDP with a double action space where the average of two actions is the action to take. The authors add a compression loss to prevent unnecessary exploration. The method is evaluated on 8 tasks in IssacLab compared with model-free methods, DDPG, TD3, SAC, PPO, and two diffusion-based off-policy algorithms, DACER and QVPO, and achieves performance gains.

优缺点分析

Strengths:

achieve performance gains compared to their baselines and a comprehesive experimental evaluation
well-written with a well-structured presentation
nice cross-domain adaptation from image to RL

Weakness:

Computational inefficiency: double action space doubles the computation and introduces one more loss and one more hyperparameter to tune.
likelihood computation: In SDE from Song Yang, the probability flow ODES already solves the exact likelihood calculation using deterministic ODEs. In the experiments, the diffusion step is only 5, so why not use their method? Why make it more complex to double the action space? Also, though the main idea of DPPO is used in offline-to-online settings, it is possible to apply it in online settings. In DPPO github projects, there are two ways of calculating the likelihood. From my view, it is necessary to add their likelihood calculations as baselines.
Biased experimental setup: IssacLab's parallel environments is not that fit off-pooicy diffusion methods like DACER, QVPO

问题

Can you provide the comparison with Song Yang's exact likelihood method, including the computation cost and performance? This is critical to to justify the necessity of your doubled action space approach.
In the appendix, the 5-step diffusion model seems enough for your tasks. What do you think of the reason behind it?
If standard RL benchmarks (not just IsaacLab) are evaluated, how do you think the results will be? This gives a fairer comparison environment for all methods.

局限性

Novelty and Motivation: Yang Song et al.'s probability flow ODE already provides exact likelihood computation for diffusion models. Why is your doubled action space approach necessary when this established method exists? Especially when the diffusion steps are only 5

Method comparison: The experimental comparison lacks the most relevant baselines: DPPO used in online settings

最终评判理由

The authors responses and other reviewers' comments convinced my concerns.

格式问题

作者回复

2025-07-31

Thanks for your valuable comments and suggestions. Here we address your detailed comments as below:

Q1: Computational inefficiency: double action space doubles the computation and introduces one more loss and one more hyperparameter to tune.

A1: Thank you for raising this concern. We need to clarify that doubling the action space and the compression loss does not introduce significant computation cost. Training time remains acceptable under multi-GPU settings (see Appendix Figures 11–12), and inference takes only about 2 ms (see Appendix Figure 10), meeting real-time demands of most robotics tasks [R4， R5]. Compared to Jacobian computation, the added loss is negligible. The hyperparameter $\nu$ is easy to tune and fixed to 0.01 in all experiments.

Q2(1): likelihood computation: In SDE from Song Yang, the probability flow ODES already solves the exact likelihood calculation using deterministic ODEs. In the experiments, the diffusion step is only 5, so why not use their method? Why make it more complex to double the action space?

A2(1): That is a good question. Although Song Yang's SDE framework [R6] guarantees exact likelihood in theory, it cannot obtain the exact probability density in practice. This is because the diffusion process is implemented by discretizing the underlying probability flow ODE in practice, while the computation method proposed by Song does not guarantee exact probability density estimation under such discretization. The concrete explanation can be referred to line 206-210 and Eq.(6). In that case, we must use the exact diffusion inversion technique with doubled action space to compute the exact diffusion probability density.

Q2(2): Also, though the main idea of DPPO is used in offline-to-online settings, it is possible to apply it in online settings. In DPPO GitHub projects, there are two ways of calculating the likelihood. From my view, it is necessary to add their likelihood calculations as baselines.

A2(2): Yes, certainly we can do that, but not necessary. We view DPPO more as a fine-tuning method than an on-policy method because DPPO is much worse than PPO in online reinforcement learning tasks learned from scratch(see Table 1 in [R9]).

Using two implementations provided by the DPPO project [R12] actually performs worse than the original DPPO(see Appendix C.6 Figure A10 of DPPO [R10]). These two ways estimate the Jacobian using Gaussian or Rademacher noise via the Skilling-Hutchinson estimator. Although theoretically unbiased, both suffer from high variance, which undermines the accuracy of the estimated likelihood. If you insist on adding it, we will add it to the final version.

Q3: Biased experimental setup: IssacLab's parallel environments are not that fit off-pooicy diffusion methods like DACER, QVPO

A3：Thank you for raising the concern. Due to the fact that GenPO is the first method designed for on-policy diffusion-based RL (existing diffusion-based online RL algorithms are all off-policy), we have no choice but to use DACER and QVPO as diffusion-based baselines to demonstrate both the superiority of our approach and for completeness of experiments.

Q4: Can you provide the comparison with Song Yang's exact likelihood method, including the computation cost and performance? This is critical to to justify the necessity of your doubled action space approach.

A4: Thanks for your question. As illustrated in A2, Song Yang's method cannot provide an exact likelihood for diffusion models with a discrete step in practice. Therefore, it is necessary for GenPO to double the action space to make the diffusion model reversible and then calculate the exact policy likelihood.

Q5: In the appendix, the 5-step diffusion model seems enough for your tasks. What do you think of the reason behind it?

A5：Thanks for your question. Our method actually uses a special case of diffusion models known as Flow Matching, where 5 denoising steps are sufficient for the Flow Matching model to learn a powerful diffusion policy from state to action (see Figure 10 $c$ in [R7]).

Q6: If standard RL benchmarks (not just IsaacLab) are evaluated, how do you think the results will be? This gives a fairer comparison environment for all methods.

A6：Thanks for your question. We believe GenPO can still perform well in other standard RL benchmarks. However, current simulators capable of massive environment-level parallelism (e.g., IsaacGym/IsaacLab [R8]) are much more suitable for evaluating the potential of on-policy methods, while off-policy methods often suffer from learning efficiency of value networks under massive parallelism [R13, R15]. This makes IsaacLab the most appropriate platform to benchmark and validate GenPO.

Q7: Novelty and Motivation: Yang Song et al.'s probability flow ODE already provides exact likelihood computation for diffusion models. Why is your doubled action space approach necessary when this established method exists? Especially when the diffusion steps are only 5

A7：Thank you for raising the concern. As mentioned in A2, since the diffusion model actually discretizes the reverse ODE, it is infeasible to compute the exact policy likelihood $\pi(a|s)$ , a key component for importance sampling in GenPO. Therefore, our motivation and novelty lie in making the diffusion policy reversible through a double action space, which allows for the calculation of the exact policy likelihood.

Q8: Method comparison: The experimental comparison lacks the most relevant baselines: DPPO used in online settings

A8: Thanks for your suggestions. DPPO cannot be viewed as an online RL method and performs poorly when learned from scratch, as shown in Figure A7 of [R10], and performs worse than PPO on many tasks in Table 1 of [R9]. It is also mentioned in the GitHub repo[R11] of DPPO that DPPO's advantage lies in the fine-tuning of robot manipulation tasks, because the policy does not need to be extensively explored during training. Therefore, it is appropriate to regard DPPO as a fine-tuning method rather than an online reinforcement learning algorithm.

[R4] He, T., Luo, Z., Xiao, W., Zhang, C., Kitani, K., Liu, C., & Shi, G. (2024, October). Learning human-to-humanoid real-time whole-body teleoperation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 8944-8951). IEEE.

[R5] Xue, Y., Dong, W., Liuˆ, M., Zhang, W., & Pang, J. (2025). A Unified and General Humanoid Whole-Body Controller for Versatile Locomotion. arXiv preprint arXiv:2502.03206.

[R6] Song, Y., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2020). Score-Based Generative Modeling through Stochastic Differential Equations. arxiv preprint arxiv:2011.13456.

[R7] Park, S., Li, Q., & Levine, S. (2025). Flow q-learning. arXiv preprint arXiv:2502.02538.

[R8] Makoviychuk, V., Wawrzyniak, L., Guo, Y., Lu, M., Storey, K., Macklin, M., ... & State, G. (2021). Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470.

[R9] Ma, H., Chen, T., Wang, K., Li, N., & Dai, B. (2025). Efficient Online Reinforcement Learning for Diffusion Policy. arXiv preprint arXiv:2502.00361.

[R10] Ren, A. Z., Lidard, J., Ankile, L. L., Simeonov, A., Agrawal, P., Majumdar, A., ... & Simchowitz, M. (2024). Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588.

[R11] https://github.com/irom-princeton/dppo/issues/46#issuecomment-2817918130

[R12] https://github.com/irom-princeton/dppo/blob/main/model/diffusion/exact_likelihood.py

[R13] Liang, E., Liaw, R., Nishihara, R., Moritz, P., Fox, R., Goldberg, K., ... & Stoica, I. (2018, July). RLlib: Abstractions for distributed reinforcement learning. In International conference on machine learning (pp. 3053-3062). PMLR.

[R15] Li, Z., Chen, T., Hong, Z. W., Ajay, A., & Agrawal, P. (2023, July). Parallel $Q$ -Learning: Scaling Off-policy Reinforcement Learning under Massively Parallel Simulation. In International Conference on Machine Learning (pp. 19440-19459). PMLR.

2025-08-05

Thanks to the authors for their rebuttal! Since this the lowest recommendation among the reviewers, a critical evaluation whether the rebuttal addresses the reviewer's questions and criticisms is very important. Could the reviewer please check whether the rebuttal left them still with open questions, whether their concerns are alleviated enough to raise their score, or just acknowledge that neither is the case?

2025-08-08

Dear Reviewer tqrN,

I hope this message finds you well. As the discussion/rebuttal period is nearing its close, we wanted to kindly check if you’ve reviewed our clarifications. It's worth noting that Reviewer rVSg and kGqW have demonstrated a positive assessment on GenPO after reading our rebuttal. Your feedback is valuable to us, and we’re eager to ensure all points are fully resolved.

Best regards,

The Authors

审稿意见

评分: 4置信度: 32025-07-02

This paper presents GenPO, a framework integrating generative diffusion models into on-policy reinforcement learning (RL). Existing diffusion-based policies have focused on offline and off-policy RL, leaving a gap in on-policy integration. GenPO addresses the challenge of computing state-action log-likelihoods in diffusion policies by introducing a doubled dummy action mechanism for invertible mappings, enabling exact likelihood calculation, unbiased entropy, and KL divergence estimation. Experiments on 8 IsaacLab benchmarks show GenPO outperforms baselines, demonstrating an effective integration of diffusion policies into on-policy RL.

优缺点分析

Strength:

GenPO is the first framework to combine generative diffusion models with on-policy RL (e.g., PPO), addressing the gap in leveraging diffusion’s exploration and multimodality for large-scale parallel training in simulators like IsaacLab. By introducing a doubled dummy action mechanism and exact diffusion inversion proposed by EDICT, GenPO overcomes the intractability of computing state-action likelihoods in diffusion policies.
Superior empirical performance across diverse tasks: Experiments on 8 IsaacLab benchmarks (e.g., legged locomotion, dexterous manipulation) show GenPO outperforms traditional RL baselines (PPO, SAC) and diffusion-based methods. The evaluation is comprehensive.

Weakness:

The presentation of this paper needs improvement:

(1) The Preliminaries section lacks a brief introduction to EDICT, which hinders the reader’s ability to contextualize the proposed exact diffusion inversion mechanism.

(2) The final loss function in Eq. (12) omits the KL divergence regularizer from Eq. (10).

(3) The meaning "this issue" and "ineffective exploration" in Line 250 is ambiguous. The motivation for the compression loss in Line 250 is unclear. The loss aims to reduce discrepancies between two sampled action sequences $x_1$ and $y_1$ , which contradicts to the entropy maximization regularizer (e.g., compression loss encourages low-entropy actions while entropy regularization encourages exploration).

问题

Clarify the questions mentioned in the above weakness.
Evaluation of the effectiveness of adaptive learning rate $\alpha$ mechanism is missing. How is it compared with constant learning rate?

Minor problem: (1) Notation ambiguity in Eq. (7) and Eq. (8) : The coefficient $p$ used in the mixing scheme may be misinterpreted as a probability distribution which is also denoted as $p$ . It is recommended to replace $p$ with an alternative notation to avoid confusion.

局限性

yes

最终评判理由

My concerns are well clarified and addressed. I think the paper is technically solid with comprehensive evaluation.

格式问题

None

作者回复

2025-07-31

We gratefully appreciate the reviewer kGqW for recommending acceptance of our submission and thank his/her valuable suggestions to help improve it. We will answer all the questions that the reviewers are concerned about.

Q1: The Preliminaries section lacks a brief introduction to EDICT, which hinders the reader’s ability to contextualize the proposed exact diffusion inversion mechanism.

A1：Thank you for your suggestion. We will integrate a description of EDICT into the Preliminaries section in the final version to improve clarity and completeness.

Q2: The final loss function in Eq. (12) omits the KL divergence regularizer from Eq. (10).

A2：Thanks for your question. On one hand, GenPO actually adopts the clipped surrogate objective $L^{CLIP}$ as defined in Eq. (7) of PPO [R3], without explicitly incorporating a KL divergence penalty term in the loss function.

On the other hand, the estimation of the KL divergence during training is to adaptively schedule the learning rate as shown in Algorithm 1, line 11. To avoid ambiguity, we will explain the purpose of estimating the KL divergence in Eq.(10) in Section 4.3 of the final version.

Q3(1): The meaning "this issue" and "ineffective exploration" in Line 250 is ambiguous.

A3(1)：Thanks for your careful review. We occasionally deleted a sentence before "this issue ...". The original text should be "Since the MDP problem to optimize is reformulated from the original one, featuring an enlarged but practically unnecessary exploration space, we introduce a compression loss to mitigate this issue and prevent ineffective exploration of the action space." We appreciate your comment and will fix this problem.

Q3(2): The motivation for the compression loss in Line 250 is unclear. The loss aims to reduce discrepancies between two sampled action sequences $x_1$ and $y_1$ , which contradicts the entropy maximization regularizer (e.g., compression loss encourages low-entropy actions while entropy regularization encourages exploration).

A3(2): Thank you for raising the critical question. The compression loss does not contradict to the entropy regularization term. Specifically, as stated in line 250, the entropy regularizer encourages sufficient exploration, and the compression loss reduces ineffective exploration. Ineffective exploration refers to high action diversity in the double action space but low action diversity in interaction with the environment.

For example, without the compression loss, the policy may sample dummy actions $\tilde{a} = (x, y)$ such as $[-1, 1], [10, -9], [-20, 19]$ (see Algorithm 1 line4), which all result in the similar actions $a = \frac{x_1 + y_1}{2} = 0$ , $0.5$ , and $-0.5$ , respectively (see Algorithm 1 line 5). The entropy regularizer doesn't contribute to effective exploration. In that case, the compression loss encourages $x_1$ and $y_1$ to be close to ensure that exploration occurs in the actual action space, which does not contradict the entropy regularizer.

Q4: Evaluation of the effectiveness of adaptive learning rate $\alpha$ mechanism is missing. How is it compared with constant learning rate?

A4： Thank you for raising the concern. The effectiveness of the adaptive learning rate $\alpha$ mechanism is shown in Figure 4(b). The inaccurate subtitle in line 297 may mislead you. This paragraph is actually for the ablation of the adaptive learning rate. The reason we used the subtitle “KL divergence” is that GenPO uses KL divergence to schedule the learning rate adaptively, as mentioned in A2. We will replace "KL divergence" with "adaptive learning rate" to avoid this misleading in the final version.

Specifically, in Figure 4(b), "wo_KL" denotes training without KL estimation (constant learning rate), and "w_KL" denotes training with KL-divergence-based adaptive learning rate. The results show that the adaptive learning rate in GenPO leads to faster and more stable convergence.

Q5: Notation ambiguity in Eq. (7) and Eq. (8): The coefficient $p$ used in the mixing scheme may be misinterpreted as a probability distribution, which is also denoted as $p$ . It is recommended to replace $p$ with an alternative notation to avoid confusion.

A5: Thanks for your suggestion. We will replace the symbol $p$ with $m$ in the final version.

[R3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

2025-08-01

I appreciate this thorough rebuttal. It resolved my questions. Therefore, my positive assessment of the paper stands and my rating remains unchanged.

审稿意见

评分: 5置信度: 42025-07-03

The paper introduces GenPO, an on-policy reinforcement-learning (RL) algorithm that replaces the policy network with a conditional diffusion generator. GenPO tackles the non-invertibility of diffusion policies by doubling the action space with a “dummy” copy and applying alternating updates that make the forward and reverse processes analytically tractable. Two additional regularisers: (i) a compression loss that discourages redundant dummy actions and (ii) an entropy bonus that preserves exploration, stabilise training. Experiments on eight IsaacLab robotic benchmarks show that GenPO consistently outperforms standard model-free on-policy baselines and two diffusion-based off-policy methods.

优缺点分析

Strengths:

The paper is well written and clear, with a lot of details on the method, and ablations on its components.
The resulting method is able to outperform baselines on average on the 8 IsaacLab environments.

Weaknesses:

GenPO is a computationally intensive method.

问题

Which specific Franka-arm tasks were used? If any involve multimodal action distributions (e.g., pick-and-place with multiple grasp configurations), please highlight the corresponding performance curves. Demonstrating gains in explicitly multimodal settings would strengthen the argument that diffusion models excel where unimodal Gaussian policies struggle.

局限性

The authors adequately addressed the limitations of their work.

格式问题

I have no major paper formatting concerns.

作者回复

2025-07-31

We would like to thank the Reviewer rVSg for the detailed and constructive comments. In the following, we have provided an item-by-item response to the comments.

Q1: GenPO is a computationally intensive method.

A1：Thank you for raising this concern. We need to clarify that training time is acceptable under multi-GPU settings (see Figure 11–12 in Appendix), and inference time is about 2 ms (see Figure 10 in Appendix), which is fast enough to meet the real-time need of real-world robotics tasks [R1, R2].

Q2(1): Which specific Franka-arm tasks were used? If any involve multimodal action distributions (e.g., pick-and-place with multiple grasp configurations), please highlight the corresponding performance curves.

A2(1)：Thanks for your question. In Line 262, we have specified the exact environment as Isaac-Lift-Cube-Franka-v0, which is one of the most challenging and representative Franka arm tasks in IsaacLab involving picking up a cube and placing it at a randomly sampled target position.

Q2(2): Demonstrating gains in explicitly multimodal settings would strengthen the argument that diffusion models excel where unimodal Gaussian policies struggle.

A2(2): Thanks for your suggestion. We need to clarify that the multimodality of the final converged policy is not the focus of our method. In online RL, the multimodality of the diffusion model is leveraged to achieve better exploration capability rather than for the final multimodality of the converged policy in [R14].

With better exploration capability, GenPO can prevent the policy from converging to a local optimum. This remains meaningful even when the oracle optimal policy is actually unimodal. This advantage can be confirmed by the superior performance of GenPO compared with that of PPO (with Gaussian policy) across IsaacLab Benchmarks in Figure 3.

[R1] He, T., Luo, Z., Xiao, W., Zhang, C., Kitani, K., Liu, C., & Shi, G. (2024, October). Learning human-to-humanoid real-time whole-body teleoperation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 8944-8951). IEEE.

[R2] Xue, Y., Dong, W., Liuˆ, M., Zhang, W., & Pang, J. (2025). A Unified and General Humanoid Whole-Body Controller for Versatile Locomotion. arXiv preprint arXiv:2502.03206.

[R14] Chi, C., Xu, Z., Feng, S., Cousineau, E., Du, Y., Burchfiel, B., ... & Song, S. (2023). Diffusion policy: Visuomotor policy learning via action diffusion. The International Journal of Robotics Research, 02783649241273668.

2025-08-06

Thank you for your response and clarifications. I am satisfied with your replies and will maintain my current positive evaluation of the paper.

最终决定Accept (poster)

2025-09-17

This paper presents a novel adaptation of generative diffusion from image generation to PPO. This includes some non-trivial novel double-action likelihood computation. The authors test many ablations and outperform strong RL baselines. A downside is the authors preference for massively parallel setups, which bias the evaluated environments somewhat against the tested baselines.

However, all reviewers like the paper and recommend acceptance. It seems to me that there is really nothing substantial wrong with this paper, and I do not see why it should be rejected.