5.8

/10

Poster4 位审稿人

最低5最高7标准差0.8

3.8

置信度

正确性3.0

贡献度2.8

表达3.0

NeurIPS 2024

Diffusion Actor-Critic with Entropy Regulator

Yinuo Wang,Likun Wang,Yuxuan Jiang,Wenjun Zou,Tong Liu,Xujie Song,Wenxuan Wang,Liming Xiao,Jiang WU,Jingliang Duan,Shengbo Eben Li

OpenReview PDF

提交: 2024-05-13更新: 2024-12-19

摘要

Reinforcement learning (RL) has proven highly effective in addressing complex decision-making and control tasks. However, in most traditional RL algorithms, the policy is typically parameterized as a diagonal Gaussian distribution with learned mean and variance, which constrains their capability to acquire complex policies. In response to this problem, we propose an online RL algorithm termed diffusion actor-critic with entropy regulator (DACER). This algorithm conceptualizes the reverse process of the diffusion model as a novel policy function and leverages the capability of the diffusion model to fit multimodal distributions, thereby enhancing the representational capacity of the policy. Since the distribution of the diffusion policy lacks an analytical expression, its entropy cannot be determined analytically. To mitigate this, we propose a method to estimate the entropy of the diffusion policy utilizing Gaussian mixture model. Building on the estimated entropy, we can learn a parameter $\alpha$ that modulates the degree of exploration and exploitation. Parameter $\alpha$ will be employed to adaptively regulate the variance of the added noise, which is applied to the action output by the diffusion model. Experimental trials on MuJoCo benchmarks and a multimodal task demonstrate that the DACER algorithm achieves state-of-the-art (SOTA) performance in most MuJoCo control tasks while exhibiting a stronger representational capacity of the diffusion policy.

关键词

Diffusion modelonline reinforcement learningGaussian mixture model

评审与讨论

审稿意见

评分: 7置信度: 42024-07-11

The paper proposes DACER — an actor-critic that uses the inverse diffusion process as policy. Additionally, some noise is added to actions to increase the entropy (similar motivation as in SAC but implemented differently). Empirical evaluations and ablations show that the method is quite capable.

优点

Originality: good. The paper proposes a novel method how to use diffusion as policy representation for a SAC-like algorithm.

Quality: good. The method is presented and evaluated thoroughly.

Clarity: excellent. The paper is easy to follow.

Significance: good. The method will be of interest to the RL community.

缺点

No comparison to an algorithm with multimodal policy. The key advantage of the diffusion policy compared to the usual Gaussian is its multimodality. But there are other algorithms with multimodal policies, e.g., Reinforcement Learning with Deep Energy-Based Policies and many newer works (check "Cited by" on that paper).
No clear demonstration where multimodality helps. Some toy task or better not-toy task where one sees a clear advantage would be nice to see.

问题

The proposed entropy regulation scheme seems a bit unusual. Did you try other options (e.g. like in TRPO)?

局限性

Please add a paragraph discussing the limitations

作者回复

2024-08-07

We appreciate you for the careful reading of our paper and detailed discussions.

> Weakness 1

As you mentioned, it is meaningful to compare with other algorithms that have multimodal characteristics. We chose to compare the performance with two online RL with Diffusion model algorithms mentioned in the related work (DIPO and QSM). In Figure 3 of the author global rebuttal PDF, it can be seen that DACER achieved the best performance.

> Weakness 2

We answer this question in two ways. First, some not-toy tasks are inherently multimodal. Zou et al.[1] proposes a vehicle control scenario in which the policies are multimodal, i.e., in the case where traffic regulations are not taken into account and the ego car is directly behind the front vehicle and there is only one weekly vehicle, it is an optimal policy for the ego car to overtake both from the left and from the right. In this case, only a multimodal policy is able to deal with the situation. We add experiments comparing the action (steering wheel angle) distributions of DSAC and DACER at the bifurcation point. We use the diffusion policy to sample 200 different actions and plot a histogram in Figure 5 of the author global rebuttal PDF. The results indicate that DACER learns a bimodal policy, whereas DSAC can only learn a unimodal distribution for overtaking from one side.

Second, we think multimodality enhances exploration. To assess the multi-style capability of DACER, we use the method proposed in [2]. This method analyze the diversity of state trajectories by computing the information entropy from 1,100 independent simulation episodes. The experimental results are plotted in the author global rebuttal Table 2. The results show that in the Humanoid-v3, Ant-v3, Hopper-v3 and Pusher-v2 tasks, DACER achieves the highest entropy value, surpassing DSAC, SAC, DDPG, TD3, PPO, and TRPO.

> Question 1

This question of yours is one of the most creative points of DACER. The entropy of the diffusion policy is difficult to calculate analytically, which means we cannot use entropy regularization methods like in TRPO to adjust the entropy (the gradient cannot be backpropagated to the diffusion policy). Additionally, we tried to adaptively adjust the single-step denoising coefficient based on the estimated entropy during the reverse diffusion process, but the results were not as good as the current methods.

> Limitation

In Appendix B we discuss related Limitation and future work, which we will put in the main text in the official version.

Reference

[1] Zou et al. " Policy Bifurcation in Safe Reinforcement Learning," arXiv, 2403.

[2] Xiao et al. "Multi-Style Distributional Soft Actor-Critic: Learning a Unified Policy for Diverse Control Behaviors," in IEEE TIV, 2024.

评论- Rebuttal Acknowledged

2024-08-12

I thank the authors for answering my questions and I appreciate the clarifications

审稿意见

评分: 5置信度: 42024-07-12

The paper presents a novel method using diffusion models as a policy parameterization for online reinforcement learning. The method works by learning a Q function and backpropagating through the reverse diffusion process in order to update the diffusion policy weights, similarly with Diffusion-QL. In order to aid exploration, the authors estimate the entropy of the policy and use that to update $\alpha$ , which is used to adjust the noise added to actions after sampling.

优点

The paper's initial results are strong and I think with further experimentation it could be a strong paper
Few papers have studied the application of diffusion policies to online RL, and this paper presents a fairly straight-forward method in that direction which seems to achieve strong performance compared to baselines.
The presentation of the method is generally easy to understand
The use of GMM to estimate the entropy of the diffusion policy is a novel and creative trick

缺点

A glaring issue with the paper is that the authors chose not to compare to any other papers using diffusion policies for online RL. They claim without citation that these papers perform worse than SAC although these papers claim to outperform SAC. In order to fully contextualize this paper I feel that these comparisons are essential.
As I discuss in the following section, several key results are missing.
The presentation is generally pretty sloppy as I discuss at several points in the following section.
I worry that the method may be very slow, and that the gains in performance may not be enough to justify the added computational cost. It would be useful to see some data about the comparative wall clock times of the algorithm and baselines.

问题

Random questions and comments

Section 2 paragraph 2, you frequently cite incompatibility / combinability with past methods or a lack thereof as an important feature of an algorithm, using this metric to claim a victory over other methods using diffusion for online. I disagree that this is a useful metric for evaluating algorithms and was wondering whether you could elaborate on the logic behind this claim.
Line 108 you claim that DIPO and QSM perform worse than SAC and thus you don't compare to them. However, both papers claim to outperform SAC so I don't see why you believe they perform worse.
The method seems to rely on the fact that a GMM can accurately fit your action distribution. Is there a reason you can't simply use that GMM for your policy? Seems to me that that would be more efficient and easy to implement. I think this is an important ablation to run to show that the full expressive power of the diffusion policy is actually useful in the settings you consider.
Figures 3-5 all have the same caption, "Trainining curves" [sic]. The caption should fully describe what is going on in the figure.
Your method takes 7 hours to train with the JAX implementation and apparently 4-5 times longer with PyTorch. How long do the baseline algorithms take to train? Assuming that they train for much less time, can you make a case that the extra time is worth it for generally modest performance improvements?
In Figure 3 what is DAC (I assume that this is the ablated version but you never state the meaning of that acronym)? It seems to outperform your method. Is the legend switched?
One of the key claims of your paper is that a diffusion policy is necessary due to its ability to accurately model multimodal distributions, and yet your main evaluation suite doesn't include any tasks that require this. You do have a toy problem where this is necessary, but I believe that additional experiments in an environment such as Push-T from Cheng et al. would help you to provide evidence for this claim.
Why do your plots show learning over iterations rather than the more common metric of the number of environment timesteps?
The core novelty of your method is using the GMM for adaptive noise injection, but I don't fully understand whether this is necessary. It would be nice to see how the algorithm performs with a wider range of fixed $\alpha$ values with and without linear decay, across several environments. If you still show a win over all of these it shows that the adaptive noise injection is necessary. Even if you tie with a few, you can still use this experiment to argue that the adaptive noise injection helps you to avoid an expensive hyperparameter tuning process.
$\alpha$ is overloaded. You use it as a variable in your mathematical description of the denoising process as well as the variable used to inject noise into the policy.
You argue that the diffusion model is important because it enables multimodal policies, and yet at the end of the day you add unimodal noise to add exploration. Have you looked into whether this prevents the policy from learning multimodal behaviors?
How did you tune the hyperparameters for your algorithm and for the comparisons?

Nit picks:

The text in all of the plots is far too small
The training curves would be much easier to read if the curves were smoothed.
The caption of Table 1 written entirely in capital letters is hard to read.
It is almost impossible to distinguish PPO from DDPG in your plots.

局限性

The authors discuss the limitations of their work, although they do it in the appendix.

作者回复

2024-08-07

Thank you for insightful review. We report below the changes we are currently making to address your comments.

Q1

Our work emphasizes proposing the reverse diffusion process as a novel policy expression, combinable with existing online RL algorithms. Section 2.2 highlights the method's portability. To prove it, global rebuttal Fig 4 includes extra experiments combining the diffusion policy with SAC.

However, we feel that your suggestion is good and we modifiy the relevant statement in Section 2.2 to

"Yang et al.[1] pioneered this approach by proposing action gradient method. This approach achieves policy improvement by updating the action in the replay buffer through the $\nabla_a Q$ , followed by mimicry learning of the action post-update using a diffusion model. However, action gradient increased the additional training time. Furthermore, it is difficult to fully learn both the action gradient and imitation learning steps simultaneously, which also resulted in suboptimal performance of this method in MuJoCo tasks. Penka et al.[2] proposes Q-score matching (QSM), a new methodology for off-policy reinforcement learning that leverages the score-based structure of diffusion model policies to align with the $\nabla_a Q$ . This approach aims to overcome the limitations of simple behavior cloning in actor-critic settings by integrating the policy's score with the Q-function's action gradient. However, QSM needs to accurately learn $\nabla_a Q$ in most of the action space to achieve optimal guidance. This is difficult to accomplish, resulting in suboptimal performance of QSM."

Q2

As you mentioned, comparing the performance with DIPO and QSM is meaningful. In fact, we have previously conducted relevant experiments. However, using the hyperparameters provided in their papers and testing within our JAX framework, we found that their performance did not surpass SAC as claimed. To avoid any misunderstanding that we intentionally lowered their performance, we chose not to include the experimental results. To address your concerns, we have added the results in the global rebuttal PDF Figure 3. We will proactively contact the authors to discuss why the performance does not exceed SAC.

Q3

First, using GMM as a policy requires pre-specifying the number of Gaussian components, which limits its ability to approximate any continuous distribution. Second, in DACER, entropy estimation updates the noise parameter $\alpha$ , where some estimation error is acceptable. Finally, we conduct experiments replacing the diffusion policy with GMM (3 Gaussian components) in MuJoCo tasks, showing that performance is lower than that of the diffusion policy, as illustrated in Figure 4 of the global rebuttal PDF.

Q4

Thank you for your reminder. We have written the revisions for the titles of Figures 3-5 in global rebuttal 4.

Q5

Your concern about training time is understandable, but our starting point is to train in an offline simulation environment and then apply the trained network online. Therefore appropriate longer training time is acceptable if performance gains can be realised.

We add some comparative experiments and the results are shown in Table 1 of the author global rebuttal.

Fortunately, DACER's 1-batch inference time is a small value within 1 ms, which is acceptable in the real-time application. We find DACER's training time is 4-5 times longer than DSAC due to the longer chain gradient of the diffusion policy. Your question is insightful and aligns with our recent work. We plan to introduce an ODE method to eliminate this bottleneck.

Q6

Thank you for your careful discovery that the DAC and DACER legends were written backwards here, which we have corrected.

Q7

As you said, it's important to have an experiment that clearly characterizes the need for a multi-peak policy. We choose the vehicle bifurcation experimental task presented in [3] (like push-T) and plot the action distribution at the bifurcation point in Figure 5 of global rebuttal PDF.

Q8

Q9

Your proposal is strong. To show that adaptive noise injection can achieve optimal performance without parameter tuning, we conducted experiments in three environments, with training curves in Figure 1 of the global rebuttal PDF. The results indicate that while certain linear decay results can perform comparably to adaptive noise injection, most of the results are not as good as adaptive noise injection.

Q10

Thank you for pointing out that we will modify $\alpha$ in the mathematical description of the denoising process to $\omega$ .

Q11&12

It does not prevent policy from learning multimodal behavior. First, the diffusion policy's multimodality comes from the reverse diffusion action generation, with added noise just being a little perturbation. Second, experiments show the distance between peaks is generally greater than 0.5, while the injected noise magnitude is within 0.272, decaying to within 0.15 as training progresses.

For the second problem, we used hyperparameters from the DSAC paper for comparison algorithms. For DACER, the diffusion steps, $\lambda$ , and $\alpha$ learning rate were consistently set at 20, 0.1, and 0.03 for nearly all experiments.

Picks

Thanks to your suggestions, we will zoom in on the text in the image, add smoothing to the training curves (all curves in global rebuttle have been added), change the title of Table 1 to lowercase.

Reference

[1] Yang et al. "Policy representation via diffusion probability model for reinforcement learning," arXiv, 2305.

[2] Penka et al. "Learning a diffusion model policy from rewards via q-score matching," arXiv, 2312.

[3] Zou et al. " Policy Bifurcation in Safe Reinforcement Learning," arXiv, 2403.

评论- Some concerns addressed but still worried about the benefits of the algorithm

2024-08-08

Thank you for your detailed response and for performing several additional experiments to address some of my concerns. Overall I am feeling more convinced but my concerns about the benefits of the algorithm still persist.

In your setting of first training in simulation and then deploying on a real robot, there are two main metrics that are important to me as a researcher: wall clock training time and final converged performance. The number of training iterations really doesn't matter since simulated data is cheap and the real bottleneck is the time it takes to spit out a strong policy. At the same time, taking longer to spit out a policy is justified if the converged performance is much stronger. So far, none of the results you've shown prove to me that your method outperforms DSAC in either of these metrics. In fact, I'm inclined to believe that if you changed the x-axis in Figure 1 to display wall-clock time instead of the number of training iterations, DSAC would look a lot better than your method. How would the results look if everything were trained to convergence, or if, for example, you trained a well-implemented version of PPO for the same 7 hours as DACER? I think since your method is definitely a lot slower than comparisons I need to understand whether its converged behavior is strong enough to justify using it rather than a more established method.

评论- Re: Concerns about the benefits of algorithms

2024-08-09

We thank you for the quick and detailed response to our rebuttal.

We are pleased to see that some of the previous concerns have been addressed, and that the focus has shifted to the final performance gap after algorithm convergence, as well as the experimental results under the same wall time. Regarding the final performance after algorithm convergence, we trained DACER, DSAC, and PPO for 12 hours (Wall time) in the Humanoid-v3 and Ant-v3 environments. It is evident that all three algorithms have converged. Due to limitations imposed by OpenReview, we are unable to upload additional images or links, so we have provided the following table instead.

Table 1 Average final return. Computed as the mean of the highest return values observed in the final 10% of iteration steps per run, with an evaluation interval of 15,000 iterations. The mean value for each task is bolded. ± corresponds to standard deviation over five runs.

Task	DACER	DSAC	PPO
Humanoid-v3	13209 ± 116	11087 ± 193	9134 ± 412
Ant-v3	11470 ± 244	9083 ± 122	7645 ± 291

Thank you for your question, which allowed us to assess DACER's final performance after convergence. In the Humanoid-v3 and Ant-v3 tasks (the two most complex control tasks in MuJoCo), DACER outperformed DSAC by 2122 and 2387, respectively. This performance has been quite encouraging for us as well.

As training steps increase, DSAC gradually converges. Therefore, under these conditions, even with the same Wall time, if the training time is sufficiently long, the performance ceiling of DACER is significantly higher than that of DSAC.

2024-08-13

Thank you for providing these additional experiments. This definitely strengthens your claims about the algorithm and I've increased my score to a 4. The results I've seen look good, but without seeing learning curves or, especially, results for the rest of the benchmark it's difficult for me to feel extremely confident in them. If the paper is accepted I hope full learning curves for these results across all environments will be considered in the main body of the paper.

评论- Re: reviewer 3PmU

2024-08-13

We are somewhat disappointed to see that your lack of trust in our work stems from not having seen the final training curve images. We sincerely hope that this online RL algorithm, with its powerful strategic multimodal characteristics, can be recognized by the RL community. Therefore, we will provide a download link for the images (if the AC sees this, we hope you can understand our rationale for providing the link). Humanoid-v3, Ant-v3

In the baseline algorithms, the best off-policy and on-policy algorithms are DSAC and PPO, respectively. Therefore, comparing the performance with these two algorithms is highly informative.

In the images, we plotted the training curves of the DACER, DSAC, and PPO algorithms for the Humanoid-v3 and Ant-v3 tasks (all three algorithms have converged). The results indicate that the upper performance bound of the DACER algorithm after convergence is significantly higher than that of DSAC and PPO.

It has been a pleasure communicating with you. We sincerely hope you will reconsider the score, so that the first work combining diffusion models with online RL can receive the attention it deserves.

2024-08-13

Thanks for providing the curves. I hope the AC will see that this is an extenuating circumstance and accept that you provided a link as it was very helpful for my assessment of the work.

I've increased my score to a 5. In order to go any higher I'd have to see similar results across the full benchmark.

评论- Re: the Reviewer 3PmU

2024-08-14

Pleasant communication with you is our greatest motivation to continue the experiment. Due to the time difference, you may only begin working after the communication phase has ended, but we still look forward to your response.

In the remaining benchmark tasks, Walker2d-v3, Swimmer-v3, Hopper-v3, Pusher-v2, InvertedDoublePendulum-v3, and HalfCheetah-v3, we trained three representative algorithms DACER, DSAC(off-policy baseline), and PPO(on-policy baseline) until convergence. The following training curves were obtained:

Walker2d-v3, Swimmer-v3, Hopper-v3, Pusher-v2, InvertedDoublePendulum-v3, HalfCheetah-v3.

In the tasks mentioned above, the final performance of DACER (the mean of the highest return values observed in the final 10% of iteration steps per run) was higher than that of DSAC by 11.7%, 25.0%, 32.1%, 5.2%, 0%, and 2.9%, respectively.

In the later stages of training on Hopper-v3, DSAC experienced a significant performance drop, and in the later stages of training on Pusher-v2, DSAC showed considerable performance fluctuations. Additionally, we observed that the final performance of DACER and DSAC was quite similar in the HalfCheetah-v3 and InvertedDoublePendulum-v3 tasks. HalfCheetah-v3 involves controlling an agent with two legs to run, where the optimal control strategy is relatively straightforward, allowing both DACER and DSAC to easily learn the optimal policy. InvertedDoublePendulum-v3 involves controlling two inverted pendulums to remain upright, which also has a simple optimal control strategy. Both DACER and DSAC can keep the pendulums upright and achieve the highest returns. Therefore, the performance upper limit in these two tasks does not differ significantly.

Thank you once again for the communication this week.

审稿意见

评分: 6置信度: 42024-07-13

This paper proposes using the reverse diffusion process as the policy for actor-critic-based online reinforcement learning. An EM-based mixture model is fitted to estimate current diffusion policies' entropy to balance exploration and exploitation. The proposed method, DACER, demonstrated on-par or improved performance compared with RL with deterministic and single-mode Gaussian policies in various tasks.

优点

It is the first work that I know of directly using online RL to train diffusion models by directly back-propagating through the diffusion chain.
The proposed method performed on-par or better than conventional RL methods with deterministic or single-mode Gaussian policies in some continuous control tasks.
The writing of the paper is easy to follow, and the contributions are clearly outlined.

缺点

My major concern is that the calculation of entropy using GMM seems not incorrect; Eq. 15 is not equivalent to the entropy of GMM presented in Eq. 12. GMM does not have a closed-form solution for its entropy [1]. However, it could be approximated in various ways. If the author used any of the approximations, the reference should be mentioned in the paper. Furthermore, in the ablation study (Figure 4) comparing DACER with linear decay entropy, the entropy regularization does not show much performance difference. Considering the extra compute overhead by applying EM in the inner loop, I’m not convinced it is worth the effort.
Another major concern of mine is that the proposed method seems to be very computationally heavy; for each policy update, the gradient needs to backpropagate through the entire diffusion chain. Furthermore, an inner-loop EM algorithm needs to be implemented to approximate the policy entropy. Although the training time for Humanoid-v3 is mentioned in section 5, there is a lack of comparison with other baselines implemented in the same framework (JAX) and the same hardware, which is necessary for understanding the extra computational cost.
DACER does not show significant performance improvement over DSAC in most environments except Humanoid-v3 and Ant-v3. Given the large computational overhead, it is not sure if it is really worth using the method in given tasks.
In Figure 1, the iteration of PPO and TRPO are reported in the number of network update steps, which is uncommon and needs proper justification, as the number of network updates does not necessarily reflect the number of simulation steps for these two algorithms.
I am also concerned that the policy representation experiments can not fully show the representation power of diffusion policy. Diffusions are models with rich representation power as they can capture multi-modal action distribution [2] at the same state. For the task with 2d state space presented in Fig. 2, the multi-modality in action distribution only exists in the state of (0, 0), which is also why single-mode policies like DSAC and TD3 could also partially solve the task. Although the result of DACER looks slightly better than DSAC, I’m not fully convinced that this result suggests a better representation power for diffusion policy.

[1] Robin S, Scrucca L. Mixture-based estimation of entropy. Computational Statistics & Data Analysis. 2023 Jan 1;177:107582.

[2] Jia X, Blessing D, Jiang X, Reuss M, Donat A, Lioutikov R, Neumann G. Towards diverse behaviors: A benchmark for imitation learning with human demonstrations. arXiv preprint arXiv:2402.14606. 2024 Feb 22.

问题

Could the author explain why the value of PPO looks like a constant for all the states in Figure 2? More explanations on why PPO does not work in this task would also help the understanding.

局限性

The computational complexity is the main limitation of this work. Although it has been mentioned in the appendix, I feel like more discussion and evaluations are needed to better understand this limitation.

作者回复

2024-08-07

We thank you for the careful reading of our paper and constructive comments in detail.

> Weakness 1

As you pointed out, the entropy of GMM does not have a closed-form solution. Eq.15 provides an upper bound for the entropy of GMM [1], which can be used for approximate estimation. We will update this reference accordingly.
In Algorithm 1, we propose calculating 1 EM every 10,000 steps, resulting in a computational overhead that can be ignored in terms of overall efficiency.
The reason you did not observe a significant performance difference is that the initial and final values of the linear decay are provided by the results of the entropy adjustment method. We add more experimental results for different linear decay range of $\alpha$ values. In Figure 1 of the author global rebuttal PDF, it can be observed that the best performance after parameter tuning only approaches that of our method, while the computational cost of parameter tuning is significantly higher.

> Weakness 2

As you pointed out, the discussion on training and inference efficiency is very interesting. We acknowledge that DACER has longer training and inference time compared to DSAC. But our starting point is to train in an offline simulation environment and then apply the trained network online. Therefore appropriate longer training time is acceptable if performance gains can be realised (the Humanoid-v3 task with the longest training time in Mujoco took only 7 hours in JAX).

Fortunately, DACER's 1-batch inference time is a small value within 1 ms, which is acceptable in the real-time application. We also find that the training time of DACER is 4-5 times longer than that of DSAC. The bottleneck is the longer chain gradient of the diffusion policy. This question you presented is insightful, and it is actually what we are working on recently. We are going to introduce an ODE method to eliminate the bottleneck. We would advise to follow our future work.

> Weakness 3

Based on the above response, the appropriate longer training time relative to DACER is acceptable. As you pointed out, our performance improvements are particularly significant in the most complex state-space tasks, Humanoid-v3 (376-dimensional state, 17-dimensional action) and Ant-v3 (111-dimensional state, 8-dimensional action), demonstrating DACER's strong potential. In other Mujoco tasks, DSAC has already approached the performance ceiling, while DACER still achieves obvious improvements of 276 and 440 in the complex Walker2d-v3 and Hopper-v3 tasks, respectively, and outperforms or matches DSAC in all other tasks. We believe these performance gains are substantial.

> Weakness 4

For the sake of fairness in comparison, we maintained the same experimental setup and plotting standards as used in DSAC [2, 3]. In DSAC, each iteration represents one network update, and 20 samples are sampled per iteration. For the implementation of TRPO and PPO, we have also aligned with DSAC. In their paper, they collect 2,000 samples per iteration, with a mini-batch size of 10 and repeat_num set to 10. This setup achieves a network update corresponding to 20 samples.

> Weakness 5

You may have misunderstood. The multimodality of the action distribution is not limited to the (0, 0) point; it also exists between adjacent peaks (on two diagonals). This is why, in the results of DSAC and TD3, there are issues with flat and poorly learned regions between adjacent peaks, as described in Section 1.1.

To demonstrate the powerful multimodality of DACER, we add an experiment. We select five points requiring multimodal policies: (0.5, 0.5), (0.5, -0.5), (-0.5, -0.5), (-0.5, 0.5), and (0, 0). For each point, we sampled 100 trajectories. The trajectories are plotted in Figure 2 of the author rebuttal PDF. The results show that compared with DSAC, DACER exhibits strong multimodality. This also explains why only the Q-function of DACER can learn the nearly perfectly symmetrical four peaks.

In addition to using Multi-goal task, we use the method from [4] to analyze the diversity of state trajectories for MuJoCo tasks, with results presented in Table 2 of the author global rebuttal 2. This indicator can measure the algorithm's ability to explore the state space. The results show that in the Humanoid-v3, Ant-v3, Hopper-v3 and Pusher-v2 tasks, DACER achieves the highest entropy value, surpassing DSAC, SAC, DDPG, TD3, PPO, and TRPO.

> Question 1

During training, we found that PPO always remains around its initial value and does not improve reward performance. "Multi-goal" is an environment with very sparse rewards, where huge rewards are only provided at four symmetrical peaks. In this environment, the state initialization is mostly near (0, 0). We think that PPO's conservative policy update mechanism is the fundamental reason it performs poorly in this multimodal environment.

Reference

[1] M. F. Huber et al. "On entropy approximation for Gaussian mixture random vectors," IEEE MFI, 2008.

[2] Duan et al. "Distributional soft actor-critic: Off-policy reinforcement learning for addressing value estimation errors," IEEE TNNLS, 2021.

[3] Duan et al. "Dsac-t: Distributional soft actor-critic with three refinements," arXiv, 2310.

[4] Xiao et al. "Multi-Style Distributional Soft Actor-Critic: Learning a Unified Policy for Diverse Control Behaviors," in IEEE TIV, 2024.

2024-08-09

I appreciate the very detailed response and new experiments. However, my major concerns about the entropy calculation and the long training time remain:

Could the authors specify which equation from [1] was used to derive the upper bound referenced in Eq. 15? Additionally, please provide an explanation about when this upper bound will be tight. If possible, could the authors include the derivation of the upper bound in the response?
While I agree that longer training times are acceptable in simulation, a training period that is 4-5 times longer without significant performance improvement raises concerns about the algorithm's efficiency.
In my opinion, reporting the number of network updates is not a straightforward measure of the training cost, especially considering the authors' starting point is to 'train in an offline simulation environment and then apply the trained network online.' In this case, wall-clock time would be a better measure, as reviewer 3PmU also pointed out.
I appreciate the inclusion of new evaluations on the diversity of learned behavior. Could the authors provide the equation used to calculate the Information Entropy? Moreover, the advantages of DACER in Table 2 do not appear significant compared to other methods with single-mode Gaussian policy.

[1] M. F. Huber et al. "On entropy approximation for Gaussian mixture random vectors," IEEE MFI, 2008.

评论- (1/2) Re: Concerns about the entropy calculation and the long training time

2024-08-10

Thank you for your quick response, we work hard to address your two major concerns.

> Question 1

You can directly see the approximation formula we used in Eq. 8 of [1], without the need for further derivation. Theorem 3 in [1] indicates that this approximation formula serves as an upper bound for the GMM entropy. We provide the proof process below:

$H(x)=-\int _ {\mathbf{R}^{d}}\sum _ {i=1}^{K}\omega _ {i}\cdot\mathcal{N}(x;\mu _ {i},\Sigma _ {i}) \cdot \log\left(\sum _ {j=1}^{K}\omega _ {j}\cdot\mathcal{N}(x;\mu _ {j},\Sigma _ {j})\right)\mathrm{d}x$

$=-\sum _ {i=1}^{K}\omega _ {i}\int _ {\mathbf{R}^{d}}\mathcal{N}(x;\mu _ {i},\Sigma _ {i}) \cdot \log\left(\omega_i\cdot\mathcal{N}(x;\mu_i,\Sigma_i)\cdot(1+\epsilon_i)\right)\mathrm{d}x$

$=-\sum_{i=1}^{K}\omega _ {i}\int_{\mathbf{R}^{d}}\mathcal{N}(x;\mu _ {i},\Sigma _ {i}) \cdot \left(\log\left(\omega_i\cdot\mathcal{N}(x;\mu_i,\Sigma_i)\right)+\log(1+\epsilon_i)\right)\mathrm{d}x,$

where $\epsilon _ {i}=\frac{\sum_{i\neq j=1}^{K}\omega_{j}\cdot\mathcal{N}(x;\mu _ {j},\Sigma _ {j})}{\omega _ {i}\cdot\mathcal{N}(x;\mu _ {i},\Sigma _ {i})}.$

Since $\log(1+\epsilon_i)$ is always non-negative, neglecting it yields the desired upper bound.

This approximation method has the following three main advantages [1]:

It is highly efficient in terms of computation;
The upper bound is significantly closer to the true entropy value than the well-known bound given by a single Gaussian matching the first two moments of the distribution $f(x)=\sum_{i=1}^{K}\omega_{i}\cdot\mathcal{N}(x;\mu _ {i},\Sigma _ {i})$ ;
The bound is exact for the single Gaussian case.

After providing the content related to entropy, I hope your first major concern can be addressed.

> Question 2 & 3

As you pointed out, only significant performance differences make longer training times meaningful, and it is more appropriate to use Wall time as the horizontal axis in this context. Reviewer 3PmU also emphasized that whether the performance after convergence is strong enough is the core factor determining whether DACER is worth using.

Based on the aforementioned viewpoints, we have changed the horizontal axis to Wall time and trained DACER, DSAC, and PPO for 12 hours in the most complex MuJoCo tasks, Humanoid-v3 and Ant-v3. It is evident that all three algorithms have converged. Due to the limitations imposed by OpenReview, we are unable to upload additional images or links, so we have provided the following table instead.

Task	DACER	DSAC	PPO
Humanoid-v3	13209 ± 116	11087 ± 193	9134 ± 412
Ant-v3	11470 ± 244	9083 ± 122	7645 ± 291

Thank you for your question, which allowed us to assess DACER's final performance after convergence. In the Humanoid-v3 and Ant-v3 tasks, DACER outperformed DSAC by 2122 and 2387, respectively. This performance has been quite encouraging for us as well. As training steps increase, DSAC gradually converges. Therefore, under these conditions, even with the same Wall time, if the training time is sufficiently long, the performance ceiling of DACER is significantly higher than that of DSAC.

In summary, given the significant improvement in final performance, a longer training time is acceptable. We are also actively researching a method to accelerate training using ODE, and we hope that our future work will bring you new insights.

Reference

[1] M. F. Huber et al. "On entropy approximation for Gaussian mixture random vectors," IEEE MFI, 2008.

评论- (2/2) Re: Concerns about the entropy calculation and the long training time

2024-08-10

> Question 4

During the Rebuttal phase, we proactively reached out to the authors of [2]. Next, we first introduce the process and formula for calculating information entropy (all details are kept consistent with [2]).

Data collection. Using the trained policy, sample 1100 episodes, and for each episode, take the first 40 steps, recording the state of each step as a row in a CSV file. This results in a collection of 44,000 data points.
Data discretization. For each column in the CSV file (each state dimension), use pd.cut to divide the data into 11 different intervals.
State combination. The discretized data columns are treated as independent features. These features are then concatenated into a single string to represent a state.
Calculation of state frequency and information entropy. Encode the different combined states into an integer array, then calculate the frequency of each state. Finally, use formula $H(X)=-\sum_ip_i\log p_i$ to calculate the information entropy, where $p_i$ is the probability of the $i$ -th combined state.

We would like to mention that when calculating information entropy for any MuJoCo task using the method described above, each task will have its own approximate entropy range, and there will not be particularly significant differences in absolute values. Let’s take a look at the results for Humanoid-v3: compared to TRPO, DACER’s information entropy increased by 13.7%, which is already quite significant. You might still have some concerns, so we have applied the method from [2] (Fig. 6 of [2]) to normalize the results from Table 2 of the global rebuttal using the Z-score, resulting in the following table.

Table 1 Normalized trajectory diversity entropy. The horizontal axis represents different algorithms, and the vertical axis represents different MuJoCo tasks.

Task	DACER	DSAC	SAC	DDPG	TD3	PPO	TRPO
Humanoid-v3	0.560	0.511	0.450	0.486	-0.111	0.495	-2.393
Ant-v3	0.905	0.676	0.603	0.441	-2.257	-0.219	-0.149
Hopper-v3	1.778	1.102	-1.321	-0.354	-0.698	0.011	-0.517
Pusher-v2	1.503	0.056	0.773	-0.301	-2.011	0.049	-0.069

From the normalized results, it is evident that DACER exhibited higher entropy across all benchmarks compared to the baseline algorithms, indicating its effectiveness in generating a diverse range of behaviors from the diffusion policy. Based on the relative results, the information entropy of DACER is obviously higher than that of the baseline algorithms, suggesting that DACER has a stronger exploration capability.

Reference

[2] Xiao et al. "Multi-Style Distributional Soft Actor-Critic: Learning a Unified Policy for Diverse Control Behaviors," in IEEE TIV, 2024.

2024-08-12

Thanks to the authors for providing detailed explanations and additional experiments, which indeed address part of my concerns. I'll raise my score to 4. Why not higher scores mainly based on two reasons:

In my opinion, entropy scheduling based on entropy upper bounds of approximated GMM is more of a heuristic than a well-motivated approach.
Although the final convergence performance seems to improved in Humanoid and Ant, the training is much slower than the baseline.

However, I will keep following the discussion and potentially adjust my score accordingly.

评论- (1/2) Re: the Reviewer NPyG

2024-08-13

Thank you again for your response. This kind and responsible communication is our greatest motivation to continue reflecting.

> Question 1

We would like to share with you the reasons for introducing entropy regulation and why we ultimately used the GMM to estimate entropy. We believe that the choices made behind this decision are creative, enabling the diffusion policy with multimodal properties to achieve strong performance simultaneously.

In previous studies integrating diffusion model with online RL, entropy was not incorporated. However, entropy regulation is essential for balancing exploration and exploitation, and it significantly affects performance. This omission is a core reason why the performance of DIPO and QSM in MuJoCo tasks was relatively weak.

The absence of entropy consideration can be attributed to the complex distributions defined by the diffusion chain itself, which render entropy computation computationally intractable. Consequently, this necessitates the use of density estimation methods.

GMM is a parametric density estimation approach with the following characteristics that align well with our needs:

It can effectively fit data distributions and estimate entropy when the data quantity is limited. This limitation arises because the diffusion policy requires a long time to sample a large number of samples. In this study, we chose to sample 200 samples, and the clearly defined parametric structure of GMM allows it to provide a relatively accurate entropy estimation under limited data conditions [1].
GMM models the data using the weights and covariance matrices of the mixture components, offering greater adaptability and robustness [2].

In fact, we previously compared the use of the non-parametric estimation method Gaussian KDE. The table below presents the results after training to algorithm convergence on the Humanoid-v3, Ant-v3, and Hopper-v3 tasks. DACER-GMM and DACER-KDE represent entropy estimation using GMM and Gaussian KDE, respectively.

Task	DACER-GMM	DACER-KDE
Humanoid-v3	13209 ± 116	11806 ± 371
Ant-v3	11470 ± 244	10070 ± 407
Hopper-v3	4110 ± 34	3989 ± 128

Experimental results indicate that, overall, Gaussian KDE performs slightly worse than GMM, with its final performance being notably lower in high-dimensional control tasks (Humanoid-v3, Ant-v3). This can be explained by the following reasons:

Gaussian KDE, being non-parametric, relies heavily on the choice of bandwidth, which can lead to unstable density estimates if not carefully tuned [3];
Gaussian KDE's performance significantly deteriorates as dimensionality increases due to the curse of dimensionality, which makes it prone to overfitting or underfitting in high dimensions [4].

Therefore, we choose to use GMM in our final algorithm design.

评论- (2/2) Re: the Reviewer NPyG

2024-08-13

> Question 2

Based on our experience, RL generally follows an offline training and online application model, making longer training times acceptable. Thus, the final policy performance is more important than training time, especially with significant performance gains (19.14% in Humanoid-v3 and 26.30% in Ant-v3). Although DSAC is trained longer and with more steps, its performance has plateaued, with a lower ceiling than DACER's.
Compared to DSAC, DACER demonstrates higher sample efficiency and a greater performance improvement rate per unit time. At the same sampling steps, Fig. 1 in the paper shows that DACER achieves the fastest performance improvement rate in almost all tasks (i.e., the highest sample efficiency). Furthermore, in the Humanoid-v3 task, DACER reaches DSAC's performance ceiling within 3.5 hours of Wall time, meaning that beyond this point, DACER's performance improvement rate per unit time consistently surpasses that of DSAC.
Your question prompted us to reassess the relationship between training efficiency and policy performance. We suddenly realized that diffusion step might be the key to addressing this issue. Although in our ablation experiments, we observed that reducing diffusion steps could slow down the performance improvement of DACER within 1.5M iterations, we did not test the final performance after the algorithm's convergence, nor did we compare the training and sample efficiency of DACER with smaller diffusion step against DSAC. Therefore, we supplemented the following experiments:

We reduced DACER's diffusion steps from 20 to 7, which resulted in DACER's training time being 1.1 to 1.5 times that of DSAC. In the Humanoid-v3 and Ant-v3 tasks, we recorded performance metrics starting from 1h Wall time, at 1-hour intervals, up to 8 hours. The final results are shown in the following table:

Table 1 Humanoid-v3 average return. DACER-7 represents the diffusion step is set to 7, with the horizontal axis representing Wall time and the vertical axis representing different algorithms.

Algorithm	1h	2h	3h	4h	5h	6h	7h	8h
DACER-7	8301 ± 768	9950 ± 661	10912 ± 248	11298 ± 55	11882 ± 242	12021 ± 293	11995 ± 454	12403 ± 354
DSAC	8120 ± 348	9782 ± 309	9796 ± 331	10378 ± 375	10724 ± 367	10628 ± 205	10933 ± 360	11012 ± 298

Table 2 Ant-v3 average return. DACER-7 represents the diffusion step is set to 7, with the horizontal axis representing Wall time and the vertical axis representing different algorithms.

Algorithm	1h	2h	3h	4h	5h	6h	7h	8h
DACER-7	5164 ± 136	7812 ± 123	8961 ± 193	9567 ± 186	9956 ± 213	10732 ± 273	11021 ± 107	10982 ± 165
DSAC	6717 ± 367	7363 ± 176	7919 ± 160	8519 ± 197	8677 ± 146	8840 ± 231	8839 ± 68	9006 ± 164

Table 3 Average final return. Computed as the mean of the highest return values observed in the final 10% of iteration steps per run, with an evaluation interval of 15,000 iterations. The mean value for each task is bolded. ± corresponds to standard deviation over five runs.

Task	DACER	DACER-7	DSAC	PPO
Humanoid-v3	13209 ± 116	12541 ± 159	11087 ± 193	9134 ± 412
Ant-v3	11470 ± 244	11039 ± 172	9083 ± 122	7645 ± 291

According to the results in Tables 1 and 2, with only a 1.1-1.5 times extension of training time, the performance of DACER-7 exceeded that of DSAC at almost all recorded points (except for the 1h point in the Ant-v3 task), with the final performance surpassing DSAC by 13.1% and 21.5%, respectively. As shown in Table 3, the final performance of DACER-7 decreased by 5.1% and 3.8% compared to DACER. Furthermore, we conducted multimodal task experiments on DACER-7 and found that policy multimodality persisted (policy multimodality is currently a focal point in the RL community, which is also the original intent of this work). A breakthrough occurred—thanks to your persistent questioning, we discovered that adjusting the diffusion step can indeed balance training efficiency and policy performance, allowing future researchers to choose the appropriate diffusion step based on their needs.

Reference

[1] Robin et al. "Mixture-based estimation of entropy," Computational Statistics & Data Analysis, 2022.

[2] Magdon et al. "Approximating the Covariance Matrix of GMMs with Low-Rank Perturbations," Intelligent Data Engineering and Automated Learning – IDEAL 2010.

[3] Liu et al. "Kernel density estimation based distributionally robust mean-CVaR portfolio optimization," Journal of Global Optimization, 2022.

[4] Crabbe et al. "Handling the curse of dimensionality in multivariate kernel density estimation," OSU Dissertations, 2013.

2024-08-13

Reducing the time steps is indeed a promised direction, and I'm glad to see the new evaluation shows improved training efficiency. I think the authors might be interested in recent works[1,2,3] that use only a few denoising steps (or even 1 step) in behavior learning.

With the new evaluation, I am glad to improve my score further to 6.

[1] Ding Z, Jin C. Consistency models as a rich and efficient policy class for reinforcement learning. ICLR 2024. (Consistency Model for online RL)

[2] https://drive.google.com/file/d/1eQNhPpACHlzjgEGOYcFS5rAl4GlK_X5w/view (Figure 6, DDPM+DDIM with 2 steps solve some manipulation tasks)

[3] Reuss M, Li M, Jia X, Lioutikov R. Goal-conditioned imitation learning using score-based diffusion policies. RSS 2023. (EDM+DDIM solved kitchen and block push with 3 steps)

评论- Re: the Reviewer NPyG

2024-08-13

Thank you very much for your reply. We believe that the three papers you provided will be of great inspiration to our future work. We will continue to work hard.

审稿意见

评分: 5置信度: 32024-07-14

The authors propose DACER (Diffusion Actor-Critic with Entropy Regulator), an online RL algorithm that uses the reverse diffusion process as a policy in order to capture multimodal behaviours. To balance exploration and exploitation, the authors propose carrying out entropy regularization by estimating the entropy of the diffusion policy using a Gaussian mixture model. The paper demonstrates the proposed method on a set of simulated control tasks, comparing against a suite of established RL algorithms.

优点

To the best of my knowledge, the method proposed in the paper is a novel combination of new techniques – online actor-critic methods and diffusion models.
The paper also discusses relevant offline and online RL algorithms that make use of diffusion policies for control, clarifying the differentiating factors of this work.
The paper compares against a suite of well-established RL algorithms on a range of simulated control tasks (ranging from simple to complex). The experimental results show that the proposed method generally leads to improvements in learned policy performance, with additional studies showing the multimodality of learned policies and ablations on each component of the proposed method.
Learning multimodal policies is of significance to the RL community, and the proposed parametrization of entropy regularization for diffusion policies seems likely to be built on in subsequent work. The authors release relevant code, enabling future work to build upon this paper.

缺点

The related works section would be strengthened by adding in references to online RL algorithms used for image generation (e.g. "Training Diffusion Models with Reinforcement Learning" (Black et. al, 2023), or "DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models" (Fan et al., 2023)). While these methods are applied in an image generation setting rather than direct control, the methods for policy improvement are related (i.e. optimizing for maximum reward with a policy based on the reverse diffusion process, learned by backpropagating through the diffusion chain), with some differences (not using actor-critic methods, learning a q-value, etc.).
It would be more meaningful to see Table 1 as the mean of returns in the final 10% of iterations, rather than mean of highest returns, the latter which may be skewed by luck/noise in the evaluation.
It would be interesting to see some measures of how computationally efficient the method is – since it requires stepping sequentially through the diffusion process to sample an action at each step. What are the tradeoffs?
Rather than method iterations, the results in Figure 1 could be better contextualized by also looking at environment steps on the x-axis (i.e. how sample efficient is each method), since the meaning of “iteration” can vary per method.
Minor note: The definitions of eq. (1) and (2) could be written more clearly (i.e. how do (s,a) affect the rewards on the right hand side of the equation?)
Minor note: for clarity, the definitions of $\mathcal{L}_\pi$ and $\mathcal{L}_q$ should be stated.
Minor note: figures 3,4,5 should have more descriptive captions.

问题

See suggestions and questions in the Weaknesses section.

局限性

The authors should include a section on the limitations of the proposed methods (e.g. computational inefficiencies?)

作者回复

2024-08-07

We thank you for the careful reading of our paper and constructive comments in detail.

> Weakness 1

Thank you for your important information. We have added a paragraph to introduce image generation with online RL algorithms.

"Diffusion models are widely applied in the field of image generation [1, 2]. Recently, several studies have employed online RL algorithms to fine-tune these models. Black et al. [3] propose an RL-based method for fine-tuning large-scale diffusion models to match potentially non-differentiable reward functions. This approach views the reverse process of diffusion models as a reinforcement learning task with time steps T, where the output of the diffusion model at each step serves both as an action and the observation for the next step. Fan et al. [4] proposed DPOK, a RLHF method for diffusion models in text-to-image generation. Given a reward model, the goal is to fine-tune the parameters of the diffusion model so that the resulting images from the sampling process achieve high rewards. The reward is a function of the final output of the entire diffusion process and, therefore, not be optimized independently across timesteps. The above work utilizes the reverse diffusion process to optimize the maximum reward, but does not learn the Q-function, as well as using the classical Actor-Critic framework."

> Weakness 2

It can be seen that our core comparison algorithm is DSAC. To be consistent with it, we use the highest value in 10% of the iterations. It is worth noting that each point in the experiment is the average of 10 episodes. We also set 5 different random seeds in each experiment. The results in Table 1 are the average and standard deviation of the results of the 5 random seeds. These settings avoid the influence of luck and noise to a certain extent.

But since you have mentioned it, we also give the mean and standard deviation of the 5 seed results averaged over the last 10% of iterations, which are shown in Table 3 of the author global rebuttal 3.

> Weakness 3

> Weakness 4

For the sake of fairness in comparison, we maintained the same experimental setup and plotting standards as used in DSAC. In DSAC, each iteration represents one network update, and 20 samples are sampled per iteration. For the implementation of TRPO and PPO, we have also aligned with DSAC. In their paper, they collect 2,000 samples per iteration, with a mini-batch size of 10 and repeat_num set to 10. This setup achieves a network update corresponding to 20 samples. For env steps, you need to multiply the horizontal axis by 20.

> Minor note 1

We modify Eq.1 to $J _ \pi=\mathbb{E} _ {(s _ {i\geq t},a _ {i\geq t})\sim\pi}\Big[\sum _ {i=t}^{\infty}\gamma^{i-t}r(s_i, a_i)\Big]$

Eq.2 to $Q(s,a)=\mathbb{E} _ \pi\Big[\sum _ {i=0}^\infty\gamma^ir(s_i, a_i)|s_0=s,a_0=a\Big]$ .

> Minor note 2

$\text{min}\mathcal{L} _ {q}(\phi) = \mathbb{E} _ {({s},{a},{s}^{\prime})\sim\mathcal{B}}\left[\left(r(s,a) + \gamma \mathbb{E} _ {s' \sim p, a' \sim \pi}[Q^{\pi}(s',a')]-Q^{\pi} _ {\phi}(s, a)\right)^{2}\right]$

$\text{min}\mathcal{L} _ {\pi_{}}(\theta) = -\mathbb{E} _ {s \sim d _ {\pi},a \sim \pi }[Q^{\pi _ {\mathrm{old}}}(s,a)]$

> Minor note 3

Thanks to your reminder, we modify the figure notes of Figure 3-5 respectively as

"Ablation training curves for the Entropy Regulator mechanism. DAC stands for not using the Entropy Regulator. DACER's performance on Walker2d-v3 is far better than DAC."

"Ablation experiment curves for the Noise factor modulation mechanism. Adaptive tuning of the noise factor based on the estimated entropy achieved the best performance compared to fixing the noise factor or using the adaptive tuning method with initial, end values followed by a linear decay method."

"Ablation experiment curves for the different diffusion steps. The best performance was achieved with diffusion steps equal to 20, in addition to the instability of the training process when equal to 30."

> Limitation

In Appendix B we discuss related Limitation and future work, which we will put in the main text in the official version.

Reference

[1] Xie et al. "Scalable diffusion models with transformers," ICCV, 2023.

[2] Dhariwal et al. "Diffusion models beat gans on image synthesis," NIPS, 2021.

[3] Black et al. "Training Diffusion Models with Reinforcement Learning," ICLR, 2024.

[4] Fan et al. "DPOK: Reinforcement Learning for Fine-tuning Text-to-Image Diffusion Models," NIPS, 2023.

评论- Fw: share good news with reviewer xVKC

2024-08-12

I would like to share some good news with you: we have successfully trained DACER, DSAC, and PPO to convergence on the most complex tasks in MuJoCo, namely Humanoid-v3 and Ant-v3, and have obtained the following results.

Task	DACER	DSAC	PPO
Humanoid-v3	13209 ± 116	11087 ± 193	9134 ± 412
Ant-v3	11470 ± 244	9083 ± 122	7645 ± 291

In the Humanoid-v3 and Ant-v3 tasks, DACER outperformed DSAC by 2122 and 2387, respectively. This performance has been quite encouraging for us as well. As training steps increase, DSAC gradually converges, the performance ceiling of DACER is significantly higher than that of DSAC.

We firmly believe that this cutting-edge work, which combines diffusion models with online RL, will generate significant interest within the RL community. We look forward to your response.

2024-08-12

I thank the authors for responding to my questions and concerns with the additional analyses. I will be keeping my current score.

评论- Re: the Reviewer xVKC

2024-08-13

After addressing your concerns, we sincerely hope that you will reconsider our score (so that DACER can receive the attention and recognition it deserves within the RL community) based on the following reasons:

The Importance of Multimodal Strategies: Multimodal strategies are a key focus in the current RL community. As noted in [1], in certain tasks, only by training multimodal strategies can optimal processing be achieved.
The Scarcity of Multimodal Approaches: To the best of our knowledge, only DIPO and QSM have combined diffusion model with online RL to introduce multimodal strategies. However, their relatively low performance hinders their application in real-world control tasks.

Given that DACER is currently the only online RL algorithm with both multimodality and SOTA performance, we are confident that it will attract significant attention from the RL community.

Through the discussion process with the other reviewers, I trust you have observed that the recognition of DACER is gradually increasing. In our discussion with Reviewer NPyG, we reached a consensus on the balance between training efficiency and performance. In the discussion with Reviewer 3PnU, we noted DACER's superior performance ceiling, which far exceeds that of DSAC after training to convergence.

We look forward to continued productive exchanges with you and will continue to work on addressing your concerns.

作者回复

2024-08-06

Overall author rebuttal:

We thank all reviewers for their thoughtful comments. We greatly appreciate all the reviewers' acknowledgment that our method is very novel and is the cutting-edge work combining diffusion model and online RL. In addition, our method achieved SOTA in MuJoCo tasks and is very easy to follow. To address your concerns about the training and inference time, the lack of comparative results with DIPO and QSM, the wide range of noise factor ablation experiments, the performance of using GMM directly as a policy, and how multimodality is represented and its importance, we add new experiments and evaluations:

1. Experiments with training and inference time

We add experiments (Task: Ant-v3) to compare the inference and training time using the JAX framework on the AMD Ryzen Threadripper 3960X 24-Core Processor, NVIDIA 3090Ti. We use the results of 10 calculations to find the mean and standard deviation. Inference time represents the time required for the policy to generate actions. Backward time represents the time required for gradient propagation. (It should be noted that the backward time of DIPO consists of two parts: gradient propagation and action gradient. Besides, the diffusion steps of DACER and DIPO are both 20, and the action gradient steps of DIPO are 20.)

Table 1 Time statistics results. Under the same hardware conditions, the inference and backward time of the three algorithm policy functions DACER, DIPO and DSAC are compared in the Ant-v3 task with different batches.

Mode	Batch	DACER	DIPO	DSAC
Inference time	1	0.472ms ± 0.006ms	0.502ms ± 0.011ms	0.188ms ± 0.027ms
Inference time	100	0.968ms ± 0.004ms	0.970ms ± 0.006ms	0.190ms ± 0.020ms
Backward time	1	1.283ms ± 0.036ms	0.582ms ± 0.020ms (gradient propagation); 0.801ms ± 0.008ms (action gradient)	0.571ms ± 0.038ms
Backward time	100	2.428ms ± 0.028ms	0.601ms ± 0.021ms (gradient propagation); 1.483ms ± 0.006ms (action gradient)	0.602ms ± 0.015ms

2. Experiment on evaluating the performance of algorithm exploration.

To assess the multi-style capability of DACER, we use the method proposed in [1]. This method analyze the diversity of state trajectories by computing the information entropy from 1,100 independent simulation episodes.

Table 2 Information entropy results. The results show that in the Humanoid-v3, Ant-v3, and Hopper-v3 tasks. DACER achieves the highest entropy value.

Mode	DACER	DSAC	SAC	DDPG	TD3	PPO	TRPO
Humanoid-v3	16.743	16.711	16.668	16.693	16.284	14.699	14.721
Ant-v3	16.821	16.716	16.683	16.609	15.376	16.307	16.339
Hopper-v3	12.453	11.822	9.559	10.462	10.141	10.803	10.310
Pusher-v2	10.722	9.719	10.216	9.471	8.286	9.714	9.632

3. Average return of the last 10% of iterations averaged over the 5 seeds.

Table 3 Average final return. Computed as the mean of the average return values observed in the final 10% of iteration steps per run, with an evaluation interval of 15,000 iterations. The mean value for each task is bolded. ± corresponds to standard deviation over five runs.

Task	DACER	DSAC	SAC	TD3	DDPG	TRPO	PPO
Humanoid-v3	10896 ± 306	9630 ± 469	7515 ± 399	5305 ± 277	3558 ± 848	744 ± 333	5398 ± 2211
Ant-v3	7584 ± 211	6643 ± 324	6208 ± 1412	5772 ± 386	3474 ± 784	5851 ± 537	5908 ± 247
Halfcheetah-v3	16489 ± 276	16182 ± 779	15884 ± 355	7775 ± 3627	11710 ± 3108	4586 ± 1007	5503 ± 2084
Walker2d-v3	6496 ± 134	6171 ± 301	5364 ± 817	4785 ± 104	2577 ± 266	4708 ± 732	3634 ± 573
Inverteddoublependulum-v3	9360 ± 0	9360 ± 0	9360 ± 0	9311 ± 11	8983 ± 11	5852 ± 2335	9355 ± 2
Hopper-v3	3679 ± 362	3111 ± 616	2147 ± 621	2909 ± 510	1850 ± 275	2805 ± 404	1932 ± 290
Pusher-v2	-20 ± 0	-21 ± 1	-23 ± 2	-25 ± 4	-37 ± 2	-26 ± 5	-24 ± 1
Swimmer-v3	146 ± 2	135 ± 6	135 ± 14	114 ± 26	140 ± 4	69 ± 38	129 ± 1

4. Changes to the title of Figure 3-5.

"Ablation training curves for the Entropy Regulator mechanism. DAC stands for not using the Entropy Regulator. DACER's performance on Walker2d-v3 is far better than DAC."

Reference

[1] Xiao et al. "Multi-Style Distributional Soft Actor-Critic: Learning a Unified Policy for Diverse Control Behaviors," in IEEE TIV, 2024.

最终决定Accept (poster)

2024-09-25

This paper proposes a novel method for online actor-critic and diffusion models.

Citing the reviewer xVKC, "...The experiment results show that this method leads to improvements in learned policy performance, with additional studies showing the multimodality of learned policies and ablations on each component of the proposed method. Learning multimodal policies is of significance to the RL community...".

This method might be of interest for the research community, so this paper is recommended to be accepted to NeurIPS24 for a poster presentation.