6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.0

置信度

创新性2.0

质量2.3

清晰度3.0

重要性2.3

NeurIPS 2025

Value Diffusion Reinforcement Learning

Xiaoliang Hu,Fuyun Wang,Tong Zhang,Zhen Cui

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

摘要

关键词

Value Diffusion learning; Distributional Reinforcement Learning

评审与讨论

审稿意见

评分: 4置信度: 32025-06-23

This paper proposes Value Diffusion Reinforcement Learning (VDRL), a model-free, off-policy reinforcement learning algorithm that employs diffusion models to represent value distributions. To further improve stability, they propose double value diffusion learning with sample selection, inspired by double Q-learning. VDRL is evaluated on eight MuJoCo benchmark tasks and compared against eleven baseline RL methods spanning traditional, distributional, and diffusion-policy RL paradigms.

优缺点分析

Strengths:

Addressing Q-value estimation bias is a core challenge in RL. VDRL has the potential to significantly impact future work in distributional RL.
Value estimation bias is quantitatively analyzed, which is an uncommon but insightful inclusion.
The paper makes a compelling case for moving beyond unimodal Gaussian value approximations, providing both empirical and conceptual justification.

Weaknesses:

The proof in Line 500 relies on the assumption that the policy has fully converged (the authors do not mention this in Theorem 1). However, in practice, the policy is extremely hard to fully converge, especially when using a neural network as the policy. Therefore, $\frac{P_{\theta}(Y_{1:T} \mid s, a, Y_0)}{q(Y_{1:T} \mid s, a, Y_0)}=1$ in Line 500 does not hold in general, which means Theorem 1 does not hold.
There are no concrete data for training and inference time. The training and inference processes of diffusion models are notably time-consuming. Whether the performance improvements are worth such a high computational cost is debatable.
The x-axis in Figure 1 should be changed from training steps to wall time; otherwise, the comparison is unfair because diffusion models need more time for training in general.
Line 40 mentions quantile-based methods [27–32], which are also multimodal value distributions, but no experiments are conducted to compare against these baselines.

Moreover, I just noticed a paper [1] that has already been accepted but is not cited in your work. Its background and method are highly similar to yours. Both use diffusion models to fit multimodal value networks, and the loss functions are identical—i.e., Equation (10) in your paper and Equation (15) in [1]. Could you clarify the differences between your work and this accepted paper?

[1] Tong Liu, et al. Distributional Soft Actor-Critic with Diffusion Policy. ITSC 2025. (arXiv)

问题

Can you provide wall-clock training and inference times per step for VDRL compared to baselines?
After changing the x-axis of Figure 1 to wall time, does VDRL still retain its advantage?
How does VDRL compare to the quantile-based methods [27–32]?
I noticed that the performance of your DACER is quite different from that in the original paper. The distributed Q learning part based on DACER is the same as DSACT, but the performance in the paper is lower than DSACT. Can you run the experiment using the jax version of the code?
What are the differences between your work and the accepted paper (Liu et al, ITSC 2025)?

局限性

yes

最终评判理由

This work did not achieve the claimed state-of-the-art results, and the performance of the compared algorithms was biased. For example, DACER was based on DSACT, but its performance was worse than it.

格式问题

Typo:

dose not improve -> does not improve (Line 297)

作者回复

2025-07-31

We would like to thank you for proposing some valuable comments. Below we respond to your concerns:

Q1: The proof in Line 500 ..........

A1: Thanks for your comments. We make the clarification as follows:

(i) The assumption of proof in Line 500 of manuscript is the learning policy converges, which has been presentation in Line 152 (Theorem 1) of original manuscript (i.e., the equality holds when the policy converges).

(ii) For the question (difficult to converge) you concerned, we experimentally observed that it works well on convergence and performance, as shown in Figure 1 and Table 1 of original manuscript. It is noted that the formula $\frac{P_{\theta}(Y_{1:T}|s,a,Y_0)}{q(Y_{1:T}|s,a,Y_0)}=1$ is the tightest case (i.e., an idea lower bound), which likes the extreme case of lower bound of Variational auto-encoder (VAE). In other work, the tightest case (w.r.t lower bound) does not affect the validity of this theorem, in analogy to the bound theory of VAE.

Q2: There are no concrete data for .....

A2: Thanks for your question. The training and inference time of our method (VDRL) and other baselines on three tasks of MuJoCo, i.e., Ant-v3, Humanoid-v3, and Walker2d-v3, are shown in the following tables. Notably, the official code of the QSM and DACER methods are based on the JAX framework but the implementation of other baselines and our method are based on pytorch. In practice, the algorithm realized with jax is 6-10 times faster than that realized with pytorch[1]. To ensure a fair comparison, we utilize the implementation of the QSM and DACER methods with PyTorch, which is based on the official code (JAX). Furthermore, the number of diffusion steps for all diffusion-based online RL baselines (DIPO, QSM, QVPO, and DACER) is set to 20.

Table 1: The training time (h) comparison on three tasks of MuJoCo Benchmarks.

			Diffusion-based				Distributional		Traditional
Method	VDRL (ours)	DIPO	QSM	QVPO	DACER	DSAC	DSACT	PPO	TD3	SAC
Ant-v3	7.9	11.1	10.2	9.1	9.8	5.3	5.8	0.4	0.6	3.8
Humanoid-v3	8.2	11.5	10.7	9.4	10.1	5.5	6.1	0.5	0.6	3.9
Walker2d-v3	7.8	11.0	9.9	8.9	9.6	5.2	5.7	0.4	0.5	3.8

Table 2: The inference time (ms) comparison on three tasks of MuJoCo Benchmarks.

			Diffusion-based				Distributional		Traditional
Method	VDRL (ours)	DIPO	QSM	QVPO	DACER	DSAC	DSACT	PPO	TD3	SAC
Ant-v3	3.6	5.8	6.4	6.2	5.4	1.9	2.1	0.2	0.2	0.4
Humanoid-v3	3.7	6.0	6.5	6.3	5.5	2.0	2.2	0.2	0.3	0.5
Walker2d-v3	3.6	5.7	6.3	6.1	5.3	1.9	2.1	0.2	0.3	0.4

From the results of above tables, we can observe:

(i) The training and inference time of our method is faster than the diffusion-based online RL methods (i.e., DIPO, QSM, QVPO, and DACER), because the lower diffusion steps of VDRL (T=10) compared with diffusion-based online RL baselines (T=20) reduce the computational complexity;

(ii) All diffusion-based RL methods (i.e., DIPO, QSM, QVPO, DACER, VDRL) are slower than distributional (i.e., DSAC and DSACT) or traditional (i.e., SAC, TD3, and PPO) ones, due to the iterative denoising process with some steps. Although VDRL is slower than the traditional or distributional RL methods in inference, the inference times (3.7ms) is still tolerable in the real-time application. As discussed in QVPO, most existing real robots only require a 50-Hz control policy, i.e. output action per 20 ms. Moreover, the training and inference time of VDRL can be further reduced by using the JAX framework if it is necessary, like the official code of QSM and DACER methods. Hence, the inference time is not a bottleneck to applying our method to real-time applications.

Overall, the performance improvements of our method are relatively worth with appropriate longer training and inference time.

Q3: The x-axis in Figure 1 should be ...... Can you provide wall-clock training ..........? After changing the x-axis of Figure 1 to wall time, ...

A3: Thanks for your comments. We make the clarification as follows:

(i) The setting of x-axis (training steps) in Figure 1 of our work is the same as the existing diffusion-based online RL methods as well as the traditional online RL methods, including DIPO, QSM, and QVPO, SAC. So the comparison of performance for VDRL and baselines are reasonable and fair.

(ii) For the comparison of training and inference time for VDRL and baselines, please refer to the A2.

(iii) According to your suggestion, here we conduct experiments to compare VDRL and baselines on three tasks of MuJoCo benchmark with the function of wall clock time. Table 3 shows the mean returns and standard deviations for our method (VDRL) and baselines, which further demonstrate that VDRL achieve better or competitive performance compared with these baselines.

Table 3 The final performance of VDRL and baselines on three tasks of MuJoCo benchmark over 8 hours training time.

	SAC	QVPO	DSACT	VDRL (ours)
Ant-v3	4386.35 ± 50.42	5324.96 ± 83.41	5784.20 ± 105.59	7248.18 ± 89.14
Halfcheetah-v3	10294.37 ± 105.73	9013.16 ± 77.42	11740.82 ± 130.08	12840.61 ± 120.97
Walker-v3	4669.25 ± 38.11	4398.74 ± 46.01	5167.18 ± 29.67	5248.36 ± 11.56

Q4: Line 40 mentions quantile-based methods, ........?

A4: Thanks for your comments. Following your suggestion, here we conduct experiments on ten Atari games using the Arcade Learning Environment to compare our method with four quantile-based methods, including C51, IQN, ER-DQN, and FQF. The raw scores of these methods are shown in the Table 4, which demonstrate that VDRL achieves better or competitive performance on ten Atari games.

Table 3: The comparison of raw scores between VDRL and four baselines across ten Atari games, starting with 30 no-op actions.

	Human	C51	IQN	ER-DQN	FQF	VDRL (ours)
Alien	7127.7	3166	7022	6212	16754.6	16216
Bowling	160.7	81.8	86.5	53.1	102.3	125.8
Centipede	12017.0	9646	11561	22505.9	11526.0	26249.2
Enduro	860.5	3454	2359	2339.5	2370.8	3278
Gopher	2412.5	33641	118365	115828.3	121144	121034.7
Krull	2665.5	9735	10707	11318.5	10706.8	11652
Pong	14.6	20.9	21.0	21.0	21.0	22.1
Seaquest	42054.7	26434	30140	19401	29383.3	38061
Tennis	-8.3	23.1	23.6	5.8	22.6	27.4
Venture	1187.5	1520	1318	1107	1112	1429

Q5: Moreover, I just noticed a paper... Could you clarify the differences .........

A5: Thanks for your comments. We make the clarification as below:

(i) The paper was not yet publicly available (this paper was accepted in July 1st) when our paper was submitted (our work was submitted in May 15th). Hence, this paper was not cited in our work.

(ii) The differences of our method from the paper DSAC-D are obvious: Our method (VDRL) establishes a theoretical foundation between the variational bound objective (VBO) of diffusion model and value distribution learning objective (VDLO). Building upon this theoretical foundation, we propose our method with optimal process and sample selection, which distinguishes from DSAC-D you said.

Q6: I noticed that the performance ..................

A6: Thanks for your comments. We clarify this question from the following three aspects:

(i) The difference of performance between DACER [10] and our method comes from different implementation settings. As shown in Figure 1 of DACER, the x-axis is not the online interaction steps as in our work (i.e., training step) but the training iteration with 20 interaction steps. Thus, to perform fairly comparison between DACER and VDRL, we need to multiply all the coordinates on the x-axis by 20. With this operation, you will obtain results similar to ours on DACER.

(ii) Our setting strictly follows the existing works: SAC, TD3, DIPO, QSM, QVPO, etc. Our experimental results on all other baselines including SAC are also consistent with these works.

(iii) Following your suggestion, we run experiments on three tasks of MuJoCo benchmark by using the official code of DACER (JAX), as shown in Table 5. In particular, to perform fair comparison, we adjust one training iteration with 1 online interaction step, which is same as the implementation setting of most existing works (i.e., SAC, TD3, DIPO, QVPO).The performance of DACER (JAX) is similar as the experimental results of DACER (Pytorch) in the Table 1 of the original manuscript.

Table 5 The performance of DACER on three tasks of MuJoCo benchmark, which shows that the best mean returns and standard deviations over 800K training steps.

	Ant-v3	Halfcheetah-v3	Walker-v3
DACER (JAX)	5493.57 ± 96.31	11428.16 ± 131.68	4062.09 ± 56.22
DACER (Pytorch)	5506.40 ± 99.67	11250.96 ± 119.48	4053.14 ± 62.19

[1] Shutong Ding, et al. Diffusion-based reinforcement learning via q-weighted variational policy optimization. NIPS, 2024.

2025-08-04

Thanks to the author for the response. Some issues have been resolved. I recently noticed that DIME [1] published at ICML25 is your direct competitor. Please reproduce its results on mujoco or weaken your SOTA claim. I have raised the score to 3.

[1] Celik et al. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning. ICML 2025

评论- Thank you so much for acknowledging our rebuttal and supporting a score increase.

2025-08-07

Dear Reviewer WsdU,

We sincerely appreciate your recognition of our rebuttal efforts. We are delighted to learn that some of your concerns have been addressed, and you have raised your score.

(i) Thank you for recommending this excellent SOTA work DIME [1] (recently published in ICML 2025), which we had not previously explored. For comparison, here we conduct experiments to compare our work (VDRL) with the recent work (DIME as you suggested) on six tasks of MuJoCo benchmark over 800,000 training steps, as shown in Table 1. From the Table 1, we can observe that our VDRL achieves more competitive performance in most cases. Notably, our method is fundamentally different from DIME in technique lines, as elaborated in the next point.

Table 1: The final performance of VDRL and DIME on six tasks of MuJoCo benchmark.

	DIME	VDRL (Ours)
Ant-v3	6692.52 ± 138.35	7246.18 ± 89.14
Halfcheetah-v3	12424.73 ± 96.39	12820.48 ± 139.02
Hopper-v3	3748.90 ± 47.15	3791.56 ± 16.37
Inverted2Pendulum-v2	9359.81 ± 0.63	9359.87 ± 0.12
Swimmer-v3	142.27 ± 6.42	146.85 ± 2.55
Walker2d-v3	5309.82 ± 68.49	5263.16 ± 24.36

(ii) Main Difference: Our method belongs to value diffusion RL, by establishing (both theoretical and methodological) connection distributional RL and diffusion model, while DIME [1] (as well as DIPO, QSM, QVPO, and DACER) falls in the category of diffusion-based policy RL, by leveraging the advances in approximate inference with diffusion models to derive a lower bound on the maximum entropy objective (as stated in the original DIME paper).

It has been a pleasure communicating with you. We sincerely hope you could re-evaluate this work and we are looking forward to discussions if you have any other concerns. Thank you so much again for your valuable feedback and suggestions!

Authors

[1] Celik et al. DIME: Diffusion-Based Maximum Entropy Reinforcement Learning. ICML 2025

2025-08-08

Thank you for your reply, I will consider increasing the score. The full version of the paper should ideally include experiments and citations.

评论- Thank you so much for acknowledging our rebuttal and supporting the score increase.

2025-08-09

Dear Reviewer yrGJ,

We sincerely appreciate your recognition of our rebuttal efforts and the time taken to re-evaluate our work. We are delighted to learn that you have considered raising your score.

We make sure to incorporate the additional experimental results and citations as you suggested in the revised manuscript to further strengthen the validity and clarity of the paper. Your insights have been invaluable in enhancing the quality of our work.

Thank you so much again for your valuable feedback and suggestions!

Authors

审稿意见

评分: 4置信度: 32025-06-27

This paper aims to address the distributional bias in value distributions via diffusion models (multi-modal value distributions). Specifically, the proposed variational loss of diffusion-based value distribution is theoretically proven to be the lower bound of the optimization objective. Experimental results on the MuJoCo benchmark demonstrate that the proposed VDRL outperforms existing baselines significantly.

优缺点分析

Strengths

It is an interesting idea to integrate the diffusion model and distributional RL.
The writing is clear, but the structure should be improved.

Weakness

Lines 171-174: The statements of underestimation and overestimation are not correct. These estimation biases have different effects in different task settings [1]. Additionally, it is possible to control the estimation bias via maxmin value estimation.
There are no related experiments of value estimation accuracy that validate the paper's statements: "utilizing the generative capacity of diffusion models to represent expressive and multimodal value distribution"

Minor

For coherency of reading, the related work section is usually placed after the introduction section or before the conclusion section.

[1] Lan, Qingfeng, et al. "Maxmin q-learning: Controlling the estimation bias of q-learning." arXiv preprint arXiv:2002.06487 (2020).

问题

Questions

Could the authors provide the training and inference speed comparison? One of my concerns is the computational budget of the diffusion value function.
Experimental results on value estimation accuracy.
Experimental results on discrete control tasks (e.g., Atari-10). Comparison of the proposed diffusion value method with the IQN.

局限性

yes

最终评判理由

I have carefully read the response, other reviews, and the paper. This paper has a novel motivation for advancing RL with the diffusion model, and comprehensive experiments demonstrating the superior performance. Thus, I tend to raise my score to 4.

格式问题

None

作者回复

2025-07-31

We would like to thank you for recognizing our innovation and proposing some valuable comments. Below we make the responses to your concerns:

Q1: Lines 171-174: The statements of underestimation and overestimation are not correct. These estimation biases have different effects in different task settings [1]. Additionally, it is possible to control the estimation bias via maxmin value estimation.

A1: Thanks for your comments. We make the clarification as follows:

(i) For the first sentence of lines 171-174, we conduct additional experiments to verify the effectiveness of the double value diffusion learning rule on three tasks of MuJoCo benchmark, i.e., Ant-v3, Hopper-v3, and Walker2d-v3, as shown in the Table 1 and 2. The result of Table 1 indicates that this learning rule can further reduce value overestimation, and induce a relatively minor value underestimation in some tasks.

(ii) From the experimental results, we can observe that the slight value underestimation induced by this learning rule could improve the performance on the MuJoCo benchmark, which may because the underestimated value do not propagate explicitly through policy update during training. Besides, the effect of underestimation and overestimation, i.e., the second sentence of lines 171-174, have also been similarly discussed in TD3 [1] and DSACT [2].

(iii) As you suggested and introduced in Maxmin Q-learning [3], there are different effects of these estimation bias for distinct tasks. Thus, one promising future direction is to integrate the technique (maxmin value estimation) for different environment settings.

Table 1: The comparison of average value estimation bias between VDRL with and without the double value diffusion learning rule.

	VDRL	VDRL w/o the double value diffusion learning
Ant-v3	-7.36	9.24
Hopper-v3	-113.70	151.70
Walker2d-v3	-0.95	12.09

Table 2: The comparison of final performance between VDRL with and without the double value diffusion mechanism.

	VDRL	VDRL w/o the double value diffusion learning
Ant-v3	7236.67 ± 89.44	7113.44 ± 130.06
Hopper-v3	3639.41 ± 8.73	3526.17 ± 21.48
Walker2d-v3	5216.24 ± 13.62	5098.87 ± 22.86

Q2: There are no related experiments of value estimation accuracy that validate the paper's statements: "utilizing the generative capacity of diffusion models to represent expressive and multimodal value distribution". Experimental results on value estimation accuracy.

A2: We make a clarification for the question you concerned. The experimental results of value estimation accuracy were provided in Table 2 of the original manuscript. The table shows the average value estimation bias, i.e., the difference between the estimated Q-value and the true Q-value, for VDRL, traditional model-free online RL, and distributional RL methods. The results show that VDRL consistently achieves lower bias than DASC and DSACT, highlighting the advantages of leveraging diffusion models to capture multimodal and complex value distributions for improved estimation accuracy.

Q3: For coherency of reading, the related work section is usually placed after the introduction section or before the conclusion section.

A3: Thanks for your suggestion. To improve the reading coherence, we will adjust the related work section to precede the conclusion section in the final version.

Q4: Could the authors provide the training and inference speed comparison? One of my concerns is the computational budget of the diffusion value function.

A4: Thanks for your question. The training and inference time of our method (VDRL) and other baselines on three tasks of MuJoCo, i.e., Ant-v3, Humanoid-v3, and Walker2d-v3, are shown in the following tables. Notably, the official code of the QSM and DACER methods are based on the JAX framework but the implementation of other baselines and our method are based on pytorch. In practice, the algorithm realized with jax is 6-10 times faster than that realized with pytorch [4]. To ensure a fair comparison, we utilize the implementation of the QSM and DACER methods with PyTorch, which is based on the official code (JAX). Furthermore, the number of diffusion steps for all diffusion-based online RL baselines (DIPO, QSM, QVPO, and DACER) is set to 20.

Table 3: The training time (h) comparison on three tasks of MuJoCo Benchmarks.

			Diffusion-based				Distributional		Traditional
Method	VDRL (ours)	DIPO	QSM	QVPO	DACER	DSAC	DSACT	PPO	TD3	SAC
Ant-v3	7.9	11.1	10.2	9.1	9.8	5.3	5.8	0.4	0.6	3.8
Humanoid-v3	8.2	11.5	10.7	9.4	10.1	5.5	6.1	0.5	0.6	3.9
Walker2d-v3	7.8	11.0	9.9	8.9	9.6	5.2	5.7	0.4	0.5	3.8

Table 4: The inference time (ms) comparison on three tasks of MuJoCo Benchmarks.

			Diffusion-based				Distributional		Traditional
Method	VDRL (ours)	DIPO	QSM	QVPO	DACER	DSAC	DSACT	PPO	TD3	SAC
Ant-v3	3.6	5.8	6.4	6.2	5.4	1.9	2.1	0.2	0.2	0.4
Humanoid-v3	3.7	6.0	6.5	6.3	5.5	2.0	2.2	0.2	0.3	0.5
Walker2d-v3	3.6	5.7	6.3	6.1	5.3	1.9	2.1	0.2	0.3	0.4

From the results of above tables, we can observe:

(ii) All diffusion-based RL methods (i.e., DIPO, QSM, QVPO, DACER, VDRL) are slower than distributional (i.e., DSAC and DSACT) or traditional (i.e., SAC, TD3, and PPO) ones, due to the iterative denoising process with some steps. Although VDRL is slower than the traditional or distributional RL methods in inference, the inference times (3.7ms) is still tolerable in the real-time application. As discussed in QVPO [1], most existing real robots only require a 50-Hz control policy, i.e. output action per 20 ms. Moreover, the training and inference time of VDRL can be further reduced by using the JAX framework if it is necessary, like the official code of QSM and DACER methods. Hence, the inference time is not a bottleneck to applying our method to real-time applications.

Q5: Experimental results on discrete control tasks (e.g., Atari-10). Comparison of the proposed diffusion value method with the IQN.

A5: According to your suggestion, we conduct experiments on 10 Atari games using the Arcade Learning Environment to compare our method with the IQN [5]. The raw scores of VDRL and IQN are shown in the Table 5, which demonstrate that our proposed diffusion value method achieves better or competitive performance on ten Atari games.

Table 5. The comparison of raw scores between VDRL and IQN across ten Atari games, starting with 30 no-op actions, where the reference value from IQN [5].

Game	Human	IQN	VDRL (ours)
Alien	7127.7	7022	16216
Bowling	160.7	86.5	125.8
Centipede	12017.0	11561	26249.2
Enduro	860.5	2359	3278
Gopher	2412.5	118365	121034.7
Krull	2665.5	10707	11652
Pong	14.6	21.0	22.1
Seaquest	42054.7	30140	38061
Tennis	-8.3	23.6	27.4
Venture	1187.5	1318	1429

[1] Scott Fujimoto, et al. Addressing function approximation error in actor-critic methods. ICML, 2018.

[2] Jingliang Duan, et al. stributional soft actor-critic with three refinements. TPAMI, 2025.

[3] Qingfeng Lan, et al. Maxmin q-learning: Controlling the estimation bias of q-learning. ICML, 2020

[4] Shutong Ding, et al. Diffusion-based reinforcement learning via q-weighted variational policy optimization. NIPS, 2024.

[5] Will Dabney, et al. Implicit quantile networks for distributional reinforcement learning. ICML, 2018.

评论- Please let us know if any further questions! Thanks so lot!

2025-08-02

Dear Reviewer KCCf,

Thanks for your timely reply! We would like to know whether our responses have addressed your concerns. We hope our rebuttal could address your concerns. We would appreciate it if you could re-evaluate our submission and we are looking forward to discussions if you have any other concerns. Thank you so much again!

Authors

2025-08-02

Thanks for your response. I have carefully read the response, other reviews, and the paper. Please consider including these additional experimental results in the revised version, making the paper stronger. My concerns have been addressed properly, and I tend to raise my score to 4.

评论- Thank you so much for acknowledging our rebuttal and supporting a score increase.

2025-08-02

Dear Reviewer KCCf,

We sincerely appreciate your recognition of our rebuttal efforts and the time taken to re-evaluate our work. We are delighted to learn that your concerns have been addressed, and you are inclined to raise your score.

We make sure to incorporate the additional experimental results as suggested in the revised manuscript to further strengthen the validity of this paper. Your insights have been invaluable in enhancing the quality of our work.

Thank you for recognizing our innovation and proposing these valuable comments/suggestions again!

Authors

审稿意见

评分: 4置信度: 32025-06-28

This paper introduces Value Diffusion Reinforcement Learning (VDRL) which employs diffusion models to learn expressive and multimodal returns in an off-policy distributional RL framework. The authors prove that a diffusion variational lower bound (VLB) is a tight lower bound on the KL divergence between the Bellman-updated target distribution and the learned diffusion critic (Theorem 1). They then build an algorithm incorporating mainstream RL techniques—namely a double-critic sampling method—to mitigate overestimation and stabilize training. Experimental results on MuJoCo locomotion tasks provide some statistical evidence that VDRL consistently outperforms baselines (PPO, TD3, SAC, DSAC, DSACT, and diffusion-based methods DIPO, QSM, QVPO, DACER) in both mean return and Q-estimation bias.

优缺点分析

Quality
- Strength: Theorem 1 cleanly links diffusion VLB to the RL KL-loss and gives a concrete algorithmic foundation.
- Weakness: Theorem 1 follows directly from standard DDPM variational bounds (e.g., Eq. 5 in ref [41]), thus offering limited new insight. As part of their theoretical exploration it would have been nice to see an analysis of Bellman-operator contraction/convergence (like in C51 [Bellemare et al. 2017], IQN [Dabney et al. 2018]), although this is not strictly necessary as long as learning curves show stability and convergence (which they do), but this brings me to the next point which is the very low number of seeds (4) used to make statistical arguments of the superiority of the method. As a rule of thumb, having 10 seeds is a requirement to make such statement; any less would be questionable so I would not take their empirical tests seriously unless they fix it in the rebuttal. I looked at one of the main baselines (QVPO) [1], and there doesn’t seems to be consistent final returns between the two papers of the algorithm QVPO judging by Figure 1 in your paper and Figure 3 in [1].
Clarity
- Strength: Well-motivated exposition with clear pseudocode (Algorithm 1).
- Weakness: Minor typos (e.g., “introducs”, “alogirthms”); Eq. 10 needs a revision. Hyperparameters (diffusion steps, noise schedules, seed count) are scattered across the main text, Section 5.4, and Appendix D instead of centralized.
Significance
- Strength: Addresses the limitation of unimodal Gaussian or quantile critics by modeling full return distributions, leading to better representation of complex reward landscapes.
- Weakness: Evaluation is confined to deterministic MuJoCo environments; lacks tests under sensor/motor noise, partial observability, or on real-world robotic platforms. The paper also omits reporting compute–performance tradeoffs (e.g., reverse-sampling cost, wall-clock metrics).
Originality
- Strength: First to apply diffusion generative modeling directly to value distributions in RL rather than to policies.
- Weakness: The work overlaps heavily with recent diffusion-value works (Mazoure et al. 2023; QSM; QVPO). The core DDPM VLB machinery is reused rather than extended, so the conceptual novelty is incremental.

Ref: [1] Ding, S. et al . Diffusion-based Reinforcement Learning via Q-weighted Variational Policy Optimization. NeurIPS 2024.

问题

Do MuJoCo locomotion tasks exhibit true multimodal return distributions? (any refs if yes)? Have you considered inherently multimodal benchmarks (e.g., Atari with stochastic rewards, hierarchical tasks, synthetic bimodal MDPs) to directly test your critic’s multimodal modeling claims?
How does VDRL’s objective, derivation, and performance differ from QSM and QVPO? Please provide theoretical distinctions and empirical comparisons on a common benchmark.
Could you consider reporting the per-iteration wall-clock time and total GPU-hours for VDRL vs. DSAC/DSACT.
Will you consider testing VDRL under observation/action noise? If not, discuss anticipated failure modes and possible mitigation strategies.

局限性

Compute Latency: Appendix E notes computational overhead but fails to provide concrete latency or throughput metrics, which are critical for high-frequency control applications.

最终评判理由

Thank you for addressing most of my points (improving seed count and the benchmark multimodality (the 10 Atari examples)). I think the paper is well-executed at this point given the ideas proposed, but the contribution (in terms of the conceptual novelty and possible significance) is still, in my opinion, incremental. To recognize the efforts and the new results you presented in the rebuttal, I will raise my score to 4.

格式问题

N/A

作者回复

2025-07-31

We would like to thank you for proposing some valuable comments. Below we respond to your concerns:

Weakness 1

A: Thanks for your comments. We make the clarification as below:

(i) The core theoretical contribution of Theorem 1 lies in establishing the formal connection between the Variational Bound Objective (VBO) of diffusion models and the Value Distribution Learning Objective (VDLO) in distributional RL, inspired by DDPM variational bounds but different from it. i) solved question is different; ii) The whole derivation process is different from the derivation of DDPM variational bounds; iii) The style of variational bounds, such as DDPM, VAE and ours, look similar, which might be the understanding you think "Theorem 1 follows directly from standard DDPM variational bounds".

(ii) We conduct experiments to compare VDRL and baselines on three tasks of MuJoCo benchmark with 10 runs and different random seeds. The Table 1 demonstrates the consistent conclusion claimed in the manuscript.

Table 1: The final performance of VDRL and baselines on three tasks of MuJoCo benchmark with ten seeds over 800,000 training steps.

	SAC	QVPO	DSACT	VDRL (ours)
Ant-v3	4286.35 ± 135.19	5230.64 ± 80.32	5314.06 ± 141.58	7368.67 ± 81.27
Halfcheetah-v3	10308.47 ± 90.84	9104.28 ± 76.11	12069.31 ± 152.02	12643.09 ± 156.44
Walker-v3	4493.60 ± 64.31	4197.09 ± 47.77	5114.26 ± 32.15	5340.83 ± 24.95

(iii) For the inconsistency of results between our work and QVPO, we make the clarification that all comparison results of QVPO on MuJoCo benchmark are strictly produced from the official code (https://github.com/wadx2019/qvpo).

Weakness 2

A: Thanks for your review. We make sure to correct these typos and summarize these hyperparameters into Table 3 of manuscript in the final version, according to your suggestion.

Weakness 3 (Part 1) and Question 4

A: Thanks for your comments. The MuJoCo environments have the standard protocols for evaluating various algorithms. Previous online RL methods follows the protocols. For fair comparisons, we also take the same configurations. We very appreciate these points you suggested, which is interesting and promising future direction. But they seems beyond the scope of this work. Due to time limitation, here we conduct an experiment with adding some noises to observation. We observe that noise can disrupt value learning in some tasks by distorting state-action representations. Techniques like data augmentation or adversarial training could improve noise resilience, and we'll explore these ideas in future work.

Weakness 3 (Part 2)

A: Thanks for your question. The training and inference time of our method and other baselines on three tasks of MuJoCo, are shown in the following tables. Notably, the official code of the QSM and DACER methods are based on the JAX framework but the implementation of other baselines and our method are based on pytorch. In practice, the algorithm realized with jax is 6-10 times faster than that realized with pytorch[1]. To ensure a fair comparison, we utilize the implementation of the QSM and DACER methods with PyTorch, which is based on the official code (JAX).

Table 2: The training time (h) comparison on three tasks of MuJoCo Benchmarks.

			Diffusion-based				Distributional		Traditional
Method	VDRL (ours)	DIPO	QSM	QVPO	DACER	DSAC	DSACT	PPO	TD3	SAC
Ant-v3	7.9	11.1	10.2	9.1	9.8	5.3	5.8	0.4	0.6	3.8
Humanoid-v3	8.2	11.5	10.7	9.4	10.1	5.5	6.1	0.5	0.6	3.9
Walker2d-v3	7.8	11.0	9.9	8.9	9.6	5.2	5.7	0.4	0.5	3.8

Table 3: The inference time (ms) comparison on three tasks of MuJoCo Benchmarks.

			Diffusion-based				Distributional		Traditional
Method	VDRL (ours)	DIPO	QSM	QVPO	DACER	DSAC	DSACT	PPO	TD3	SAC
Ant-v3	3.6	5.8	6.4	6.2	5.4	1.9	2.1	0.2	0.2	0.4
Humanoid-v3	3.7	6.0	6.5	6.3	5.5	2.0	2.2	0.2	0.3	0.5
Walker2d-v3	3.6	5.7	6.3	6.1	5.3	1.9	2.1	0.2	0.3	0.4

From the results of above tables, we can observe:

Overall, the performance improvements of our method are relatively worth with appropriate longer training and inference time.

Weakness 4

A: Thanks for your comments. We make the clarification as follows:

(i) Main difference: Our work is to derive and build value-distribution diffusion model, while these works (QSM and QVPO you mentioned) falls in the category of policy-based diffusion models. That is, VDRL is fundamentally different from these diffusion-based policy RL methods you said.

(ii) The core conceptual novelty of our work lies in establishing the (both theoretical and methodological) connection between the Variational Bound Objective (VBO) in diffusion models and the Value Distribution Learning Objective (VDLO) in distributional RL. Crucially, we theoretical prove that the VBO serves as a tight lower bound of the VDLO under KL-divergence minimization. This theoretical foundation underpins our methodological contribution.

Question 1

A: Thanks for your comments. We make the clarification as below:

(i) The MuJoCo locomotion benchmark is widely-used by the existing online RL methods to verify their performance. In practice, we experimentally fit the values on this benchmark, observe that the value distribution is not unimodal.

(ii) We conduct experiments on 10 inherently multimodal Atari games to verify the effectiveness of VDRL compared with four methods, as shown in Table 4. The results demonstrate that VDRL achieves better or competitive performance on ten Atari games.

Table 4: The comparison of raw scores between VDRL and four baselines across ten Atari games, starting with 30 no-op actions.

	Human	C51	IQN	ER-DQN	FQF	VDRL (ours)
Alien	7127.7	3166	7022	6212	16754.6	16216
Bowling	160.7	81.8	86.5	53.1	102.3	125.8
Centipede	12017.0	9646	11561	22505.9	11526.0	26249.2
Enduro	860.5	3454	2359	2339.5	2370.8	3278
Gopher	2412.5	33641	118365	115828.3	121144	121034.7
Krull	2665.5	9735	10707	11318.5	10706.8	11652
Pong	14.6	20.9	21.0	21.0	21.0	22.1
Seaquest	42054.7	26434	30140	19401	29383.3	38061
Tennis	-8.3	23.1	23.6	5.8	22.6	27.4
Venture	1187.5	1520	1318	1107	1112	1429

Question 2

A: Thanks for your comments. We clarify the distinctions as follows:

(i) Theoretical objective and derivation: Our optimization objective is derived from the value distribution learning (i.e., Theorem 1 of Section 3.1 in the original manuscript), while the objective of QSM is the alignment of the score of learning policy. The difference of derivation and formulation for our method and QVPO was introduced in Remark 1 (Line 154-160) of original manuscript.

(ii) Empirical Comparison: The empirical comparison of VDRL, QSM and QVPO on MuJoCo benchmark have been provided in Figure 1 and Table 1 of original manuscript.

Question 3

A: We conduct experiments to compare VDRL, DSAC, and DSACT on three tasks of MuJoCo benchmark with the function of wall clock time.

(i) The Table 5 further demonstrates that VDRL achieve better or competitive performance compared with these methods.

Table 5 The final performance of VDRL and baselines on three tasks of MuJoCo benchmark over 8 hours training time.

	DSAC	DSACT	VDRL (ours)
Ant-v3	4625.94 ± 226.31	5784.20 ± 105.59	7248.18 ± 89.14
Halfcheetah-v3	11698.55 ± 136.47	11740.82 ± 130.08	12840.61 ± 120.97
Walker2d-v3	4913.82 ± 184.09	5167.18 ± 29.67	5248.36 ± 11.56

(ii) For the comparison of total GPU-hours (training time) for VDRL and these baselines , please refer to the answer of Weakness 3 (Part 2).

[1] Shutong Ding, et al. Diffusion-based reinforcement learning via q-weighted variational policy optimization. NIPS, 2024.

2025-08-03

评论- Thank you so much for acknowledging our rebuttal and supporting a score increase!

2025-08-03

Dear Reviewer WruH,

We sincerely appreciate your recognition of our rebuttal efforts and the time taken to re-evaluate our work. We are delighted to learn that most of your concerns have been addressed, and you are inclined to raise your score.

Regarding the concern about conceptual novelty, we respectfully highlight that our work, from the value distribution perspective, establishes the (both theoretical and methodological) connection distributional RL and diffusion model, and then introduces the value diffusion reinforcement learning, which fundamentally differs from prior diffusion-based online policy RL methods (i.e., DIPO, QSM, QVPO, and DACER).

Thank you for your time and proposing these valuable comments/suggestions again!

Authors

审稿意见

评分: 5置信度: 32025-06-30

Accurate Q-value estimation is essential for effective policy optimization in reinforcement learning. This work introduces Value Diffusion Reinforcement Learning (VDRL), a model-free online RL approach that uses diffusion models to capture rich, multimodal value distributions. The key idea is a diffusion-based variational loss, theoretically shown to provide a tight lower bound under KL-divergence, leading to more precise value estimation. To further stabilize training, the authors propose a double value diffusion mechanism with sample selection. Experiments on MuJoCo benchmarks show that VDRL consistently outperforms strong state-of-the-art baselines, demonstrating its effectiveness.

优缺点分析

Strengths:

The paper tackles an important problem, which is using diffusion models for value estimation in RL.

The method is motivated by theoretical analysis and results.

The experimental evaluation is rich and well executed comparing against 11 baseline algorithms on 8 tasks from Mujococ, reporting both plots and tables.

The paper is well-organized and well-written.

Weaknesses:

The computational overhead of using diffusion in value estimation is not thoroughly analyzed. There is only a brief mention of trade-offs in the sensitivity analysis (Section 5.4) as well as a brief mention in the limitations of the computational overhead during online training, particularly in high-dimensional action space (Appendix E). Runtime analysis and computational complexity analysis are lacking.

While MuJoCo is a representative, standard, and popular benchmark, it remains unclear how VDRL would perform in other tasks.

Sensitivity analysis was only conducted on Humanoid-v3 task as an example. However, it remains unclear if the same conclusions holds on the other tasks. The authors should provide ablations on different environments (say three of them) to confirm the observation.

问题

(1) Can the authors provide detailed runtime comparisons compared to the other baselines?

(2) Can the authors expand the sensitivity analysis to at least two or three additional environments to confirm the consistency of conclusions?

(3) How critical is the double value diffusion mechanism? Could the authors provide an ablation showing performance with and without this?

局限性

Yes.

最终评判理由

My concerns have been addressed, and I have decided to raise my score to 5. Please incorporate the additional experimental results and discussions into the revised version of the paper.

格式问题

No.

作者回复

2025-07-31

We would like to thank you for recognizing our innovation and proposing some valuable comments. Below we make the responses to your concerns:

Q1: The computational overhead of using diffusion in value estimation ..........

A1: Thanks for your question. The training and inference time of our method (VDRL) and other baselines on three tasks of MuJoCo, i.e., Ant-v3, Humanoid-v3, and Walker2d-v3, are shown in the following tables. Notably, the official code of the QSM and DACER methods are based on the JAX framework but the implementation of other baselines and our method are based on pytorch. In practice, the algorithm realized with jax is 6-10 times faster than that realized with pytorch [1]. To ensure a fair comparison, we utilize the implementation of the QSM and DACER methods with PyTorch, which is based on the official code (JAX). Furthermore, the number of diffusion steps for all diffusion-based online RL baselines (DIPO, QSM, QVPO, and DACER) is set to 20.

Table 1: The training time (h) comparison on three tasks of MuJoCo Benchmarks.

			Diffusion-based				Distributional		Traditional
Method	VDRL (ours)	DIPO	QSM	QVPO	DACER	DSAC	DSACT	PPO	TD3	SAC
Ant-v3	7.9	11.1	10.2	9.1	9.8	5.3	5.8	0.4	0.6	3.8
Humanoid-v3	8.2	11.5	10.7	9.4	10.1	5.5	6.1	0.5	0.6	3.9
Walker2d-v3	7.8	11.0	9.9	8.9	9.6	5.2	5.7	0.4	0.5	3.8

Table 2: The inference time (ms) comparison on three tasks of MuJoCo Benchmarks.

			Diffusion-based				Distributional		Traditional
Method	VDRL (ours)	DIPO	QSM	QVPO	DACER	DSAC	DSACT	PPO	TD3	SAC
Ant-v3	3.6	5.8	6.4	6.2	5.4	1.9	2.1	0.2	0.2	0.4
Humanoid-v3	3.7	6.0	6.5	6.3	5.5	2.0	2.2	0.2	0.3	0.5
Walker2d-v3	3.6	5.7	6.3	6.1	5.3	1.9	2.1	0.2	0.3	0.4

From the results of above tables, we can observe:

(ii) All diffusion-based RL methods (i.e., DIPO, QSM, QVPO, DACER, VDRL) are slower than distributional (i.e., DSAC and DSACT) or traditional (i.e., SAC, TD3, and PPO) ones, due to the iterative denoising process with some steps. Although VDRL is slower than the traditional or distributional RL methods in inference, the inference times (3.7ms) is still tolerable in the real-time application. As discussed in QVPO [1], most existing real robots only require a 50-Hz control policy, i.e. output action per 20 ms. Moreover, the training and inference time of VDRL can be further reduced by using the JAX framework if it is necessary, like the official code of QSM and DACER methods. Hence, the inference time is not a bottleneck to applying our method to real-time applications.

Q2: While MuJoCo is a representative, standard, and popular benchmark, it remains unclear how VDRL would perform in other tasks.

A2: Thanks for your comments. Following your suggestion, here we conduct additional experiments on 10 inherently multimodal Atari games using the Arcade Learning Environment to verify the effectiveness of VDRL. The raw scores of these methods are shown in the Table 3, which demonstrate that VDRL achieves better or competitive performance on 10 Atari games.

Table 3: The comparison of raw scores between VDRL and four baselines across ten Atari games, starting with 30 no-op actions, where the reference value from C51 [2 ], IQN [3 ], ER-DQN [4], and FQF [5].

	Human	C51	IQN	ER-DQN	FQF	VDRL (ours)
Alien	7127.7	3166	7022	6212	16754.6	16216
Bowling	160.7	81.8	86.5	53.1	102.3	125.8
Centipede	12017.0	9646	11561	22505.9	11526.0	26249.2
Enduro	860.5	3454	2359	2339.5	2370.8	3278
Gopher	2412.5	33641	118365	115828.3	121144	121034.7
Krull	2665.5	9735	10707	11318.5	10706.8	11652
Pong	14.6	20.9	21.0	21.0	21.0	22.1
Seaquest	42054.7	26434	30140	19401	29383.3	38061
Tennis	-8.3	23.1	23.6	5.8	22.6	27.4
Venture	1187.5	1520	1318	1107	1112	1429

Q3: Sensitivity analysis was only conducted on Humanoid-v3 task as an example. However, it remains unclear if the same conclusions holds on the other tasks. The authors should provide ablations on different environments (say three of them) to confirm the observation. Can the authors expand the sensitivity analysis to at least two or three additional environments to confirm the consistency of conclusions?

A3: Thanks for your comment. Following your suggestion, here we conduct additional experiments on three tasks of MuJoCo benchmark, including Ant-v3, Hopper-v3, and Walker2d-v3, as shown in Table 4 and 5. These results further confirm the consistency of conclusions:

(i) the performance does not improve monotonically with an increasing number of diffusion steps, and the diffusion steps should be set to 10 by considering the trade-off between performance and computational complexity;

(ii) the cosine and variance_preserve noise schedules perform similarly, and both outperform the linear schedule. Hence, we choose cosine schedule for all tasks.

Table 4: The final performance of mean return and standard deviations for different diffusion steps.

	T=5	T=10	T=20	T=30
Ant-v3	5824.41 ± 92.66	7236.67 ± 89.44	6993.57 ± 92.96	6452.06 ± 97.20
Hopper-v3	3404.68 ± 9.92	3639.41 ± 8.73	3716.93 ± 13.58	3523.13 ± 11.05
Walker2d-v3	4938.82 ± 14.27	5216.24 ± 13.62	5208.74 ± 12.21	5127.44 ± 25.86

Table 5: The final performance of mean return and standard deviations for different diffusion noise schedule.

	linear	variance_preserve	cosine
Ant-v3	6422.53 ± 82.06	7119.02 ± 98.57	7236.67 ± 89.44
Hopper-v3	3490.76 ± 8.17	3551.37 ± 10.19	3639.41 ± 8.73
Walker2d-v3	4708.92 ± 12.84	5203.56 ± 13.99	5216.24 ± 13.62

Q4: How critical is the double value diffusion mechanism? Could the authors provide an ablation showing performance with and without this?

A4: Thanks for your comments. To investigate the significance of the double value diffusion mechanism, here we conduct ablation experiments (as you suggested) on three tasks of MuJoCo benchmark, i.e., Ant-v3, Hopper-v3, and Walker2d-v3, as shown in the Table 6 and 7. The results show that VDRL with the double value diffusion mechanism achieves better performance than VDRL without this mechanism in terms of average returns and variance, which verifies the effectiveness of the double value diffusion mechanism.

Table 6: The comparison of final performance between VDRL with and without the double value diffusion mechanism.

	VDRL	VDRL w/o the double value diffusion learning
Ant-v3	7236.67 ± 89.44	7113.44 ± 130.06
Hopper-v3	3639.41 ± 8.73	3526.17 ± 21.48
Walker2d-v3	5216.24 ± 13.62	5098.87 ± 22.86

Table 7: The comparison of average value estimation bias between VDRL with and without the double value diffusion learning rule.

	VDRL	VDRL w/o the double value diffusion learning
Ant-v3	-7.36	9.24
Hopper-v3	-113.70	151.70
Walker2d-v3	-0.95	12.09

[1] Shutong Ding, et al. Diffusion-based reinforcement learning via q-weighted variational policy optimization. NIPS, 2024.

[2] Marc G Bellemare, et al. A distributional perspective on reinforcement learning. ICML, 2017.

[3] Will Dabney, et al. Implicit quantile networks for distributional reinforcement learning. ICML, 2018.

[4] Mark Rowland, et al. Statistics and samples in distributional reinforcement learning. ICML, 2019.

[5] Derek Yang, et al. Fully parameterized quantile function for distributional reinforcement learning. NIPS, 2019.

评论- Please let us know if any further questions! Thanks so lot!

2025-08-02

Dear Reviewer yrGJ,

Authors

评论- Any further questions for the authors?

2025-08-04

Are there any further questions for the authors?

2025-08-08

Dear Reviewer yrGJ,

We sincerely appreciate the time and effort you have dedicated to reviewing our manuscript and providing valuable feedback. We have provided detailed responses to your comments and hope that they adequately address your concerns.

As the discussion stage deadline approaches, we would be grateful to know whether our responses have resolved the issues you raised. If you need further clarification or have any other questions, please feel free to discuss them with us! We are more than willing to continue our communication with you.

Thank you once again for your thoughtful review. We would be deeply appreciative if you could re-evaluate our submission.

Authors

2025-08-08

Dear Authors,

Thank you for your detailed response. I have also reviewed other reviewers’ comments and your replies. Please incorporate the additional experimental results and discussions into the revised version of the paper, as they will strengthen its contribution. My concerns have been addressed, and I have decided to raise my score to 5.

Best Regards.

评论- Thank you so much for acknowledging our rebuttal and supporting a score increase.

2025-08-08

Dear Reviewer yrGJ,

We make sure to incorporate the additional experimental results and discussions as you suggested in the revised manuscript to further strengthen the validity and clarity of the paper. Your insights have been invaluable in enhancing the quality of our work.

Thank you for recognizing our innovation and proposing these valuable comments/suggestions again!

Authors

最终决定Accept (poster)

2025-09-17

The reviewers appreciated the novelty of the paper, but converged on a positive view of this paper only after the authors provided substantial additional results that were not included in the submitted paper. Since it would not difficult to incorporate these results into a final paper, I am leaning positive on this.