6.0

/10

Poster4 位审稿人

最低5最高7标准差1.0

3.8

置信度

正确性3.3

贡献度3.3

表达3.0

NeurIPS 2024

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

Ruoqi Zhang,Ziwei Luo,Jens Sjölund,Thomas B. Schön,Per Mattsson

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

Offline Reinforcement Learning; Diffusion Model; Stochastic Differential Equation

评审与讨论

审稿意见

评分: 7置信度: 32024-06-24

This work introduces a new (fast) SDE-based sampling technique to derive actions from a diffusion based policy.

优点

Compares favorably against most relevant benchmark (Diff-QL)
Good ablation studies; this was helpful to understand the importance of each component.
Strong connection to standard tools in SDEs from diffusion literature

缺点

I'm not sure how big a weakness this is, but there appears to be some disconnect in the theory vs. experiments for discounting, that is infinite-time horizons. Why do you introduce a discount factor but in all equations use a finite time horizon?

High memory requirement: What is the memory cost (i.e. in VRAM) for an ensemble of 64 models?
Inconsistent use of "max Q trick" from [27]. Can you explain why this is used only in a limited set of environments? What happens if it is removed? A performance comparison would be useful.

Experiments

Given that this is a (mostly) experimental paper, it would be good to have more experiments, in particular some studies on the sensitivity of $\beta$ (is it OK to be equal for all envs? it seems like the magnitude of Q-value variance for the LCB should be MDP-dependent.)
Use of $T=5$ steps: can you show any (simple) experiments illustrating the use of various $T$ values? It seems implicitly dependent on the environment since it somehow controls the expressivity of the policy (please correct me if wrong)
L153, choice of $\eta$ (see above comments)
In figure 3, can the training time be extended? Especially regarding the decrease in performance for antmaze-med-play; if there is always a decrease over time in training this is important to note. (on the flip side, it can be noted as a useful limitation for future work to improve on)
Using non-scalar comparisons would be beneficial to gain a better idea of the statistical comparison of all algorithms. (cf. https://github.com/google-research/rliable in case data is still available, it should be straightforward to generate the plots)

问题

it's a bit hard to see variance of green line in Fig 1, Perhaps extracting it / using IQM plots with error bars would be helpful. (I know it is a toy example, but it does a good job illustrating the utility)
may just be me, but I believe using $t$ index in RL terms is more standard than $i$ , similarly for $T$ vs. $L$ . Perhaps consider swapping this notation?
on a similar notation front, the use of $a$ seems like a poor choice since starting at L76 it is very general (not yet discussing actions). Maybe switch to $x$ ?
I'm curious why different gradient norms were required; do you have any insight here? Was there parameter/grad divergence in some cases?
Missing reference to SUNRISE? (I know it is UCB instead of LCB, but perhaps still useful to comment on the relationship): https://arxiv.org/pdf/2007.04938
In algorithm 1, can you please include Q^LCB_psi rather than Q_psi alone (if it is correct)? Otherwise it is confusing where the previously discussed value estimate comes in.
Can you explain a bit more about the sense of "optimality" regarding the reverse time SDE?
Use of Mish activation function seems a bit non-standard. How important is it?
L356: can you give a rough time comparison of the diffusion policy vs. standard approaches?
How does entropy regularization interact with the SDE sampling? Is MaxEnt required for diffusion to be useful?
Can you increase linewidth in Fig 1a/c and Fig 3? Adding symbol/markers would be helpful too for accessibility

minor typographical:

L10: "is tractable and that it can..."
L27: rephrase "work introduce"
L95-100, maybe point to Fig 1 to see multi-modality of proposed approach

局限性

Beyond the weaknesses mentioned above, I believe the limitations were discussed adequately.

作者回复

2024-08-07

We appreciate your constructive feedback and the opportunity to improve our work. Your insights are invaluable, and we look forward to incorporating these revisions to strengthen our submission.

Weakness

Why do you introduce a discount factor but in all equations use a finite time horizon?

Note that all considered environments are continuous tasks, and the return is computed using an infinite time horizon. Hence, a discount factor is used. On the other hand, a finite time horizon is used for the SDE steps. Note that the superscript $t$ of $a_i^t$ is the SDE step, and the subscript $i$ is the RL step.
Memory Requirement of 64 Q-ensembles.

For the Antmaze task, the model requires ~3GB and ~5.5GB for $M=4$ and $M=64$ , respectively. We will report more details in the revised manuscript.
The sensitivity of $\beta$ .

The additional experiments of $\beta=1,2,4$ are provided in General Response Figure 5 and Table 1. These results show that $\beta=4$ performs the best in both antmaze-medium-diverse and antmaze-medium-play environments. The value of $\beta$ determines the amount of pessimism, and of course, it can be tuned for different datasets.
Use of "max Q trick".

We follow the Diffusion-QL to only use max Q-backup in Antmaze tasks. The ablation study of max Q-backup on selected Antmaze tasks is provided in General Response Table 2. One can see that the max Q-backup trick improves the Antmaze results.
Experiments on various $T$ values.

The experiment of $T=3,5,10$ is shown in General Response Table 3. The step $T=5$ gives the best overall performance. Moreover, as shown in General Response Figure 1, when $T=2$ , the diffusion model already achieves a good approximation of the original data distribution.
Choice of $\eta$ .

We select the value of $\eta$ by checking the most used value for each task group. Compared to Diffusion-QL which carefully tunes parameters for each environment, our settings are more general. Here, $\eta$ is mainly for balancing the two losses but we did not explore different values in the paper.
Extend training time.

We found that extending the training time decreases the performance of the antmaze task in most offline RL methods such as Diffusion-QL. The Q-ensemble technique can alleviate this problem but is still unstable and challenging. We are happy to note and list it as our future work.
Non-scalar comparisons.

We appreciate the suggestion to use non-scalar comparisons for a more comprehensive statistical analysis. We have starred the project and will be happy to illustrate our non-scalar comparisons in the future.

Questions

Figures improvement.

Thank you for the suggestions on figure improvements. Due to the page limitation of the rebuttal PDF, we will try to update the figures (e.g., use IQM plots with error bars, increase the linewidth, and add symbols/markers) directly in the revised manuscript.
Notation problems e.g. " $i$ " vs. " $t$ " and " $a$ " vs. " $x$ ".

We will consider swapping the notations to align more closely with standard RL terminology, making the presentation clearer.
Why different gradient norms.

Compared to Diffusion-QL which carefully tunes hyperparameters (e.g., the norm value) for each env, our settings are more general: the values are taken close to the average values as in Diffusion-QL and we use the same gradient norm value for the same type of tasks.
Missing Reference to SUNRISE.

Thank you for the suggestion. We will include a discussion on SUNRISE and its relationship to our method, particularly how it compares to using LCB.
$Q^\text{LCB}_\psi$ in Algorithm 1.

Thanks for pointing this out, we will change it to "Update policy $\pi_\phi$ by (15) using $Q^\text{LCB}_\psi$ ".
The sense of "optimality" regarding the reverse time SDE?

The path for reverse-time SDE can be diverse and tortuous due to the stochasticity of the diffusion term $\mathrm{d}w$ , which requires long timesteps to research the real data distribution. Our optimal sampling utilizes the condition posterior trick (see Proposition 3.1) and can generate new samples in very few steps.
Mish activation function.

In this work, we didn't focus on architecture tuning and our network simply follows the Diffusion-QL with three dense layers and the Mish activation function.
Time comparison of diffusion policy vs standard approaches.

We have conducted additional experiments to provide training time and evaluation time comparison between our method and our method with Gaussian policy between different settings. The results are shown in General Response Table 4. While our diffusion policy requires more computational resources during training compared to standard Gaussian policies, the performance gains justify the additional time investment. The evaluation times remain comparable, ensuring practical applicability in real-world scenarios. These results provide a clearer understanding of the computational trade-offs and reinforce the robustness and effectiveness of our proposed method.
How does entropy regularization interact with the SDE sampling?

During training, we first sample action from the SDE and then learn to maximize the approximation of the entropy to increase exploration of the environment in offline settings. In this stage, the MaxEnt is helpful for policy learning. Then, in inference, we can directly sample actions from the SDE and the entropy won't affect the sampling.

审稿意见

评分: 5置信度: 42024-07-02

The paper presents Entropy-Regularized Diffusion Policy with Q-Ensembles for offline reinforcement learning. This method addresses Q-value overestimation on out-of-distribution (OOD) data by using a mean-reverting stochastic differential equation (SDE) to transform action distributions into a Gaussian form, combined with entropy regularization for enhanced exploration and Q-ensembles for pessimistic Q-value estimation. The approach achieves state-of-the-art performance across D4RL benchmarks especially on AntMaze.

优点

The paper introduces a method that combines entropy regularization with Q-ensembles within the framework of diffusion policies for offline RL, offering a novel solution to tackle Q-value overestimation and limited exploration of OOD samples. The use of a mean-reverting SDE to model action distributions is theoretically robust and aligns well with the goal of transforming actions into a tractable Gaussian form for efficient sampling. The authors provide detailed theoretical contributions, including the tractability of entropy regularization in diffusion policies and the benefits of using Q-ensembles for robust value estimation.

缺点

Computational Cost: High computational resources are required, including extensive training epochs on high-capacity GPUs, which limits accessibility and applicability in real-world scenarios with constrained resources.
Hyperparameter Sensitivity: Although some experiments with different hyperparameters are provided, a more thorough exploration of hyperparameters like entropy temperature and Q-ensemble size is needed to further validate the method's robustness.
Generality: The evaluation is robust but focused on D4RL benchmarks. More experiments on diverse or real-world offline RL tasks would strengthen the claims about the method's generalizability.

问题

Could you please give a more thorough explanation on why the choice of hyperparameters, especially the entropy temperature plays a critical role in the performance. As we know that the adroit and kitchen tasks are more complex than Gym and AntMaze, I suspect on the explanation about narrowness of human demonstrations.
What are the potential challenges or limitations in extending this method to real-world offline RL tasks outside the D4RL benchmarks?
Could the authors provide more intuitive explanations or visualizations to illustrate the workings of the mean-reverting SDE and its impact on action distribution sampling?

局限性

The authors have figured out the limitation of the proposed method on high computational cost and large inference time.

作者回复

2024-08-07

Thank you for your detailed review and for recognizing the strengths of our work. We appreciate your positive feedback on the novelty and theoretical robustness of our proposed method.

Computational cost.

We acknowledge that diffusion policies require longer training and inference time due to the multi-step sampling process. However, the increased time is manageable and justified by the significant performance gains. A detailed comparison is provided in General Response Table 4. Moreover, techniques like the SDE/ODE solver and diffusion distillation have been proven useful for sample accelerating. We will regard it as our future work and try to apply the faster version to some real-world scenarios.
Hyperparameter sensitivity...exploration of hyperparameters like entropy temperature and Q-ensemble size is needed.

Please let us clarify that the ablation studies on entropy temperature and Q-ensemble size are already provided in Table 3 and Table 4 of the paper. Moreover, we further add two ablation experiments on LCB coefficient $\beta$ and diffusion steps $T$ in General Response Table 1 and Table 3.
Challenges and limitations in real-world applications.

As discussed in the Conclusion section, the main challenges and limitations of our work are the action sample time and the computation cost. And the network is also too simple (only 3 dense layers) for complex real-world applications. Our future work will investigate real-time policy distillation under time and compute constraints and explore other efficient network architectures to address real-world datasets and scenarios.
Could the authors provide more intuitive explanations or visualizations to illustrate the workings of the mean-reverting SDE and its impact on action distribution sampling.

Intuitively, the mean-reverting SDE has a forward process (action to noise) and a reverse process (noise to action). The forward process is only used in training to provide an analytically tractable solution for score network learning. In inference, the reverse process (based on the trained score network) is used as the policy to generate actions conditioned on RL states. A rough visualization of the workings of the mean-reverting SDE is provided in General Response Figure 4. This will also be clarified in the revised manuscript.

评论- Reponse to rebuttal

2024-08-10

Thank you for your rebuttal. However, I believe there may have been a misunderstanding regarding my first question. In Table 1, your method underperforms compared to others (e.g., Diff-QL) in the Adroit and Kitchen environments. I question your explanation attributing this to the narrowness of human demonstrations. I think it's more convincing to support your statement to generate additional data for retraining the policy rather than focusing on tuning $\alpha$ ? Adroit and Kitchen environments are generally considered more challenging than Gym and AntMaze, which raises further questions about this approach.

Additionally, Table 3 suggests that the auto-selection of entropy temperature is more critical for performance in Adroit and Kitchen than that in Gym and AntMaze. Could you elaborate on why this is the case?

2024-08-11

We sincerely appreciate your timely feedback and follow-up questions.

Why can't we add additional training data?

In offline RL, we typically avoid modifying datasets to ensure fair comparisons across different methods. While adding additional training data is indeed a potential solution for improving performance, in this study, we focused on making algorithmic adjustments to ensure consistency and fairness in evaluation. This approach allows us to directly compare the effectiveness of our method against others without introducing external variables. However, we acknowledge that augmenting data could be beneficial in real-world applications and plan to explore this in future work.

The results in Table 1.

In Table 1, our method underperforms in the Adroit and Kitchen environments compared to Diffusion-QL, mainly due to the fixed entropy temperature $\alpha$ set at 0.01. This fixed $\alpha$ leads the agent to continuously explore the action space throughout the entire training process, even when encountering unseen states. While exploration is generally advantageous, it can be detrimental in environments with limited data variability like Adroit and Kitchen. With sufficient data, the actor is encouraged to explore guided by accurate Q-values estimates; however, in the case of unseen state-action pairs, such exploration may harm performance. Additionally, unlike in antmaze tasks, random actions are more likely to negatively impact performance in more complex environments like Kitchen. Overall, excessive exploration prevents the agent from effectively leveraging learned strategies from human demonstrations, and random actions are more detrimental in Adroit and Kitchen, where precise control is essential. These factors contribute to the lower performance observed compared to Diffusion-QL in these tasks.

Why tuning $\alpha$ ?

Auto-tuning $\alpha$ is useful because it dynamically adjusts the balance between exploration and exploitation based on the data. Initially, $\alpha$ is set to a non-zero value to encourage exploration of the action space. As training progresses, especially in environments like Adroit and Kitchen, where precise control is crucial, the auto-tuning mechanism reduces $\alpha$ to near zero. Also, with more accurate Q-functions later in training, this shift towards exploitation helps the agent focus on optimal actions, improving performance.

2024-08-12

Thanks for your rebuttal. You have addressed most of my concerns. I suggest to add part of these illustrations to your final version paper. Thank you.

2024-08-12

Thank you for your positive feedback and for taking the time to review our rebuttal. We are pleased that our responses addressed most of your concerns and we will certainly include the illustrations in the final version of the paper to enhance clarity and completeness.

审稿意见

评分: 7置信度: 42024-07-11

The paper proposes to use reverse-time SDE as the policy in an actor-critic algorithm. To make it work, entropy regularization is added, for which an entropy approximation scheme is suggested. Furthermore, to improve stability, an ensemble of Q-networks is employed, and the pessimistic lower-confidence bound (LCB) is taken as the value, i.e., $\mathbb{E}[Q] - \beta \sqrt{\mathbb{V}[Q]}$ .

Evaluations on D4RL show improved performance compared to baselines.

优点

Originality: good. The paper proposes a few novel ideas to make diffusion policy work in offline RL. Quality: good. Overall presentation and evaluation are good. Tested only on D4RL, but this is common in offline RL. Clarity: excellent. Significance: excellent. Potentially a new baseline on D4RL, especially AntMaze.

缺点

Influence of multi-modality of the policy not sufficiently explored. Is this in the end what makes it better compared to using a Gaussian? Or is it rather the LCB and Q-ensemble? How would SAC with Q-ensemble and LCB-value perform (i.e., not using the diffusion policy)?
No comparison to other methods that use multi-modal policies, e.g., Gaussian mixture model policy (Wasserstein Gradient Flows for Optimizing Gaussian Mixture Policies) or energy-based policies (Maximum Entropy Reinforcement Learning via Energy-Based Normalizing Flow)

问题

In the abstract, the authors write "we show that the entropy of such a policy is tractable". This sounds like it is analytically tractable. But in fact an approximation is used (Sec. 3.2). I would suggest to reformulate the statement in the abstract to avoid such misunderstanding.

局限性

Computational limitation is briefly mentioned in the conclusion. It would be nice to have a more extended discussion of limitations and providing some numbers on the timings.

作者回复

2024-08-07

Thank you for your thoughtful review and for recognizing the novelty and potential impact of our work. We appreciate your positive feedback and are happy to provide our point-to-point response below:

Influence of multi-modality of the policy...SAC with Q-ensemble and LCB

In offline RL, the pre-collected datasets are always imbalanced and lead to a complex multi-modal policy rather than a Gaussian. We provided a comparison in Table 2 of the paper, which shows that our model outperforms MSG (Gaussian policy with Q-ensembles) on all AntMaze tasks, meaning that the multi-modal policy is important and improves the Gaussian in offline RL. Also, the ablation study in Table 4 of the paper shows that increasing the ensemble size further improves the performance. Moreover, unfortunately, we can't directly compare our method with SAC on offline RL tasks since SAC and its variants are all online algorithms that need to interact with the environment.
Comparison to Other Multi-Modal Policies

Please allow us to clarify that we have compared the behavior cloning (BC), decision transformer (DT), and Diffusion-QL, that are all non-Gaussian multi-modal policies. We appreciate the reviewer's suggestion and will be happy to implement the mentioned Gaussian mixture and energy-based policies in our future work.
Reformulation of the Abstract

Thank you for pointing out the unclear formulation in the abstract. We will revise the abstract to clarify that the entropy of the policy can be approximated in a tractable way.
Computational Limitations and Timings

We acknowledge that the computational time was only briefly mentioned. We add experiments to compare the training time and evaluation time between different policy types, diffusion steps, and the number of critics. The results are shown in General Response Table 4. These results indicate that while the diffusion policy requires longer training times compared to the Gaussian policy, the increase is manageable and justified by the significant performance gains. The evaluation times, however, remain relatively comparable, suggesting that the practical deployment of the trained models is feasible.

评论- Rebuttal Acknowledged

2024-08-12

I thank the authors for their responses. They answer all of my questions.

审稿意见

评分: 5置信度: 42024-07-12

The paper proposes that adding entropy regularization to offline RL is beneficial, and using pessimistic Q-value estimation through ensemble methods can provide a better estimate of the Q-value. Figure 1 explicitly shows the benefit of the ensemble Q method. The methods show impressive performance in D4RL.

优点

The paper is well-organized and easy to follow. I found the pessimistic ensemble Q trick, which increases estimation accuracy (Figure 1), to be interesting. It is intriguing to see that main entropy to the agent can also benefit offline RL in some tasks.

缺点

I have some trouble overcoming some gaps in the mathematical proofs and found that some mathematical equations may not hold.

The empirical performance in Table 1 is not consistent with the training curve in Figure 3.

More detailed questions about these two aspects are in the questions section.

问题

A typo in Eq(5). There is an unnecessary right parenthesis.
I found Mean-Reverting SDE is a mathematical rewrite of VP SDE, see [42]. VP SDE has form

dx=-\frac{1}{2}\beta(t)xdt+\sqrt{\beta(t)}dw

Mean-Reverting SDE has form

dx=-\theta_txdt+\sqrt{2\theta_t}dw

These two SDEs are the same in my mind. Then why is it called Mean-Reverting SDE instead of VP SDE in [42], and why is [42] not cited in this context? Can the authors point out any differences between them?
Optimal Sampling seems to use the same trick as DDIM[41], using $p(x_{t-1}|x_t,x_0)$ for reverse sampling instead of $p(x_{t-1}|x_t)$ as in DDPM. If that's the case, I believe the authors should point out and properly cite DDIM in the relevant context.
It is amazing to see that in Figure 2, optimal sampling can have such great performance when $N=5$ . Can the authors also provide its performance in this toy task when $N=1$ and $N=2$ ? I am curious about their performance with fewer steps.
It is hard for me to prove Eq(14). Can the authors elaborate on more details about its proof?
Can the authors elaborate more on how Eq(15) is derived from Eq(12) and Eq(13)? I guess the last term in Eq(15) should be $\log p(\hat a_i^1|s_i)$ , not $\log p(\hat a_i^1|a_i^T,s_i)$ . (There is a typo in Eq(15). It should be $a_i^T$ not $a_t^T$ ).
The paper proposes using automatic entropy tuning as in SAC in line 216. In line 551, the authors mention that the entropy target is the dimension of the action space. I guess it should be the negative dimension of the action space (as SAC does)? For the alpha loss in Eq (45), $\alpha$ depends on $s_i$ , while SAC doesn't. Why is the alpha loss designed this way? Is there any difference in making $\alpha$ depend on $s_i$ ?
I found your method's performance in Antmaze-medium-play-v0, Antmaze-medium-diverse-v0, Antmaze-large-play-v0, and Antmaze-large-diverse-v0 in Table 1 to be inconsistent with the training curve in Figure 3. For example, the score of Antmaze-medium-diverse-v0 in the table is 91.6, while the training curve in Figure 2 shows the training curve score is around 40. Why do they differ so much? The same phenomenon occurs in the other three Antmaze environments. Can the authors explain the mismatch? Can the authors also provide training curves for other environments?
In Table 1, how is the score recorded? Is it the final round score, online model selection, offline model selection, or moving average?
In Eq (34) and (35), should $t$ , $t-1$ , and $0$ be subscript or superscript? Why is $t-1$ subscript and $t$ superscript?
Can the authors elaborate on why Eq(43) holds?
In Table 5, the loss type is Likelihood or Noise. What are their mathematical forms?

[41] Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.

[42] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.

局限性

The paper has some gaps in the mathematical proofs. Additionally, more details of the experiments need to be added. I am more than willing to increase my score if my questions are well addressed by the authors.

作者回复

2024-08-07

Thank you for your thorough review and insightful feedback. Your comments have been invaluable in guiding our efforts to refine and clarify our work. Below, we address your main concerns in detail:

Mean-reverting SDE and VP SDE.

Our mean-reverting SDE is derived from the famous Ornstein-Uhlenbeck (OU) process [1] which has the following form:
$\mathrm{d} x = \theta (\mu - x) \mathrm{d} t + \sigma \mathrm{d}w.$
As $t \to \infty$ , its marginal distribution $p_t(x)$ converges to a stationary Gaussian towards the mean value $\mu$ , which gives the informative name: “mean-reverting”. We assume there is no prior knowledge of the actions and thus set $\mu = 0$ to generate actions from standard Gaussian noise. Then, with $\mu = 0$ , the mean-reverting SDE has the same form as VP SDE. However, in [42], no solution of the continuous time SDE was given. The authors start from pertubing data with multiple noise scales and generalize this idea with an infinite number of noise scales which makes the perturbed data distributions evolve according to an SDE. They keep using the solution of DDPM while we use Itô's formula to solve the continuous SDE. Compared to the original VP SDE, our mean-reverting SDE is analytically tractable (Eq.(2)) and thus its score $\nabla_{x} \log p_t(x)$ is easier to learn. More importantly, the solution of the mean-reverting SDE can be used for entropy approximation. We will clarify these points and cite [42] appropriately in the revised manuscript.
Optimal Sampling and DDIM.

Yes, our optimal sampling uses the same conditional posterior distribution $p(x_{t-1} | x_t, x_0)$ as DDIM for reverse sampling and we will cite it in Section 3.1 in the revised manuscript.
Performance of Optimal Sampling with $N=1$ and $N=2$ .

We provide new samples generated with fewer steps of the toy task in General Response Figure 1.
Proof for Eq.(14).

Eq.(14) is a multiple conditions Bayes' rule given by
$p_\phi(a_i^1 \mid a_i^T, a_i^0, s_i) = \frac{ p_\phi(a_i^T \mid a_i^1, s_i) \; p_\phi(a_i^1 \mid a_i^0, s_i)}{p_\phi(a_i^T \mid a_i^0, s_i)}.$
Here we use $p_\phi(a_i^1 \mid a_i^T, s_i)$ instead of $p_\phi(a_i^1 \mid a_i^T, a_i^0, s_i)$ since both $a_i^0$ and $a_i^1$ are computed from the diffusion sampling process and the conditional posterior from $a_i^1$ and $a_i^0$ is a certain Gaussian (with known mean and variance, see Appendix A.4). We will clarify it in the revised manuscript to make the notation and proof clearer.
Clarification on Eq.(15).

Substituting Eq.(13) to Eq.(12) gives the new policy objective and its last term should be $\log (p(\hat{a}_i^1 \mid s_i))$ . However, we assume that the terminal state $a_i^T$ must be a standard Gaussian and thus we add the condition $a_i^T$ as $\log (p(\hat{a}_i^1 \mid a_i^T, s_i))$ which can be obtained from Eq.(14) and the results won't be affected. (And thank you for pointing out the typo.)
Entropy target should be the negative dimension of the action space.

Yes, our code uses the negative dimension of the action space as the entropy target, the same as SAC. We will correct it in the paper.
The $\alpha$ depends on $s_i$ in Eq.(45) while SAC doesn't. Why is the alpha loss designed this way?

SAC is an online algorithm that can interact with the environment and collect new samples with various state actions. However, the offline dataset is pre-collected and may be imbalanced across different states. Thus our idea for alpha loss is to assign different entropy values to each state based on available data.
Inconsistent between Table 1 and Figure 3.

We made a mistake in Figure 3, Antmaze-medium-diverse-v0, where we used the wrong data source for the curve, resulting in a major inconsistency. The updated figure is provided in General Response Figure 2. The reason for other minor inconsistencies is that we report the mean of the best results in tables while showing the average values of each step in figures. Moreover, the error bar is a default value in seaborn.lineplot which takes 95% confidence.
Training curves for other environments?

We show some training curves in General Response Figure 3 and will also add other curves to the appendix in the revised manuscript.
How is the score recorded?

Consider that the behavior-cloning (BC) loss is not suitable for exploring OOD samples thus we choose to use online evaluation to select the model. It is worth noting that Diffusion-QL has proven the online-selected model has a similar performance as the offline-selected model. This will be clarified in the revised manuscript.
Subscript or superscript " $t$ " in Eq.(34) and Eq.(35).

All actions should use superscript sample steps (i.e., $t$ , $t-1$ , and $0$ ). We have now fixed them in the manuscript.
Why Eq.(43) holds?

Eq.(43) holds when action states $a_i^1$ and $a_i^0$ are sequentially sampled from a diffusion process. The term $\pi(a_i^0 \mid a_i^1, \hat{a}_i^0)$ is the conditional posterior sampling from Proposition 3.1. We will elaborate more details in the revised manuscript.
Likelihood loss and Noise loss in Table 5.

The "Noise" loss is similar to the simplified loss in DDPM as shown in Eq.(5). In contrast, the "Likelihood" loss is proposed in IR-SDE [2], which forces the model to learn optimal reverse paths from $x_t$ to $x_{t-1}$ . Generally, the Likelihood loss is more stable while the Noise loss tends to generate more stochastic samples. This will be clarified in the revised manuscript.
Typos

Thank you for pointing out these typos and we will fix them in the revised manuscript.

[1] Exact numerical simulation of the Ornstein-Uhlenbeck process and its integral. Physical review E, 1996.

[2] Image restoration with mean-reverting stochastic differential equations. ICML, 2023.

2024-08-11

Thank you for the detailed responses to my questions. I appreciate the effort put into addressing most of my theoretical concerns. It appears that the diffusion training and sampling schedule is derived from IR-SDE. I would suggest including more theoretical components in the revised manuscript to enhance its completeness.

However, a few questions regarding the experimental results remain unresolved:

1.The authors mention the use of online model selection to record performance, while the DQL results presented in the main table are based on offline selection. This discrepancy might lead to a mismatch in comparisons. It would be fairer to compare performance using offline model selection. I understand that the BC loss might not function optimally with this method, but do the authors have any results for offline selection performance in Antmaze, Adroit, or Kitchen based on other criteria?

2.Regarding the training curve provided in the rebuttal, specifically Figure 2 in the global response, I noticed that the DQL performance in antmaze-medium-diverse appears low. While it is expected to be around 78 (offline) or 82 (online), the training curve consistently remains below 40. This may not be a significant issue, but I would appreciate it if the authors could provide an explanation for this observation.

3.Lastly, a minor question: Is the parameter $\alpha(s)$ implemented as a neural network?

Since my theoretical questions have been resolved, I have raised my score accordingly.

2024-08-12

Thank you for your thorough review and your continued engagement with our work. We appreciate the time and effort you have invested. Below, we address your remaining concerns:

Online vs. Offline Model Selection

We acknowledge that a comparison using offline model selection is essential for ensuring fairness. The table below presents a comparison of our method with Diffusion-QL, including both online and offline results. Additionally, we include our method's performance based on offline selection using the BC Loss criterion, selecting the step where the difference between consecutive steps was less than 4e-3. We will conduct further experiments to identify better offline model selection criteria and to evaluate performance across the Antmaze, Adroit, and Kitchen environments. These results will be incorporated into the revised manuscript.

AntMaze Tasks	Diffusion-QL (Offline)	Diffusion-QL (Online)	Ours (Offline)	Ours (Online)
antmaze-umaze-v0	93.4	96.0	99.0	100.0
antmaze-umaze-diverse-v0	66.2	84.0	67.5	79.8
antmaze-medium-play-v0	77.6	79.8	84.0	91.4
antmaze-medium-diverse-v0	78.6	82.0	85.4	91.6
antmaze-large-play-v0	46.6	49.0	72.6	81.2
antmaze-large-diverse-v0	56.6	61.7	65.9	76.4
Average	69.6	75.4	79.2	86.7

Training Curve in Antmaze-Medium-Diverse

We obtained the results using official github code, where the normalized score reflects the online evaluation at each step. The error bars in the graph represent the standard error. We apologize for the minor error that remains in the updated graph; although we corrected the data source, we inadvertently used the 95% confidence interval for our method's results. As noted in the paper, the training performance of Diffusion-QL is inherently unstable, with performance occasionally dropping to zero. In addition, the best performance for each run occurs at different training steps. We also observed that in 2 out of 5 runs, the score remained at zero from the beginning to the end of training. These factors contribute to the mean and standard error appearing significantly worse than they might under more stable conditions.Using our online model selection approach, the average result we obtained for Diffusion-QL is 39.8 $\pm$ 43.3. The average result of 3 valid runs (without two runs remaining zero) is 66.3 $\pm$ 33.4.
Implementation of $\alpha(s)$ as a Neural Network

Yes, $\alpha(s)$ is implemented as a neural network with a single hidden layer consisting of 32 units.

作者回复

2024-08-07

We would like to thank all reviewers for their detailed reviews and constructive comments. We have conducted additional experiments to address the raised concerns and further validate our approach. All figures can be found in the attached PDF file. Below, we summarize the key results and discussions related to these experiments.

Figure 1 (for Reviewer jdmN): The proposed optimal sampling with different sample steps. We added the figures of data generation with fewer steps ( $T=1$ and $T=2$ ) for the toy task in Section 3.1. The results show that the optimal sampling strategy significantly outperforms reverse-time SDE in all steps, further demonstrating the efficiency and effectiveness of our method.

Figure 2 (for Reviewer jdmN): Corrected learning curves of the Diffusion-QL and our method on selected Antmaze tasks. We updated the learning curves of our method and Diffusion-QL on selected AntMaze tasks, showing that our method consistently outperforms Diffusion-QL across various environments (the subfigure for ant-medium-diverse-v0 env in the draft uses the wrong data but now it is fixed).

Figure 3 (for Reviewer jdmN): Additional learning curves of our method on different environments over 5 random seeds. We validate the robustness of our method with additional learning curves. However, due to the page limitation, we only provide the learning curves of Gym-medium-replay environments, and we will add the rest directly to the appendix of the revised manuscript.

Figure 4 (for Reviewer Jzch): Visualization of the workings of the mean-reverting SDE for action prediction. For a more intuitive explanation of our approach, Figure 4 outlines the forward and reverse processes of the mean-reverting SDE used for action prediction.

Figure 5 and Table 1 (for Reviewer T2t7): Ablation experiments of our method with different LCB coefficients. To explore the impact of different LCB coefficients $\beta$ . We add an experiment of our method with $\beta$ values of 1, 2, and 4 on AntMaze-medium environments. Figure 5 demonstrates that adjusting the LCB coefficient improves performance, particularly for higher values, which helps in managing the exploration-exploitation trade-off effectively. In addition, the numerical results are provided in Table 1 below.

Table 1: Experiment on LCB coefficients $\beta$

LCB Coefficient $\beta$	1	2	4
Antmaze-medium-play-v0	82.4 ± 4.9	88.6 ± 1.5	91.6 ± 2.3
Antmaze-medium-diverse-v0	74.6 ± 3.7	84.0 ± 7.8	91.4 ± 1.5
Average	78.5	86.3	91.5

Table 2 (for Reviewer T2t7): Ablation experiments with max Q-backup trick on AntMaze tasks. We conducted experiments with and without max Q-backup on AntMaze tasks in Table 2. The inclusion of max Q-backup significantly enhances performance, particularly in more complex environments (e.g., Antmaze-large).

Table 2: Experiment on "Max Q trick"

Max Q-backup	True	False
Antmaze-medium-play-v0	91.6 ± 2.3	89.2 ± 2.9
Antmaze-medium-diverse-v0	91.4 ± 1.5	87.6 ± 1.8
Antmaze-large-play-v0	81.2 ± 3.0	22.3 ± 7.1
Antmaze-large-diverse-v0	76.4 ± 2.1	26.5 ± 6.1

Table 3 (for Reviewer T2t7): Ablation experiments with different diffusion steps on selected AntMaze tasks. We evaluated the impact of varying the number of diffusion steps on a range of tasks, including AntMaze, Gym, and Kitchen in Table 3. Our findings indicate that while increasing the number of steps generally improves performance, five steps provide the best balance across different tasks and between performance and computational time. Thus we choose $T=5$ for all tasks in the paper.

Table 3: Experiment on diffusion step $T$

Diffusion Step $T$	3	5	10
Halfcheetah-medium-replay-v2	43.4	57.0	49.5
Hopper-medium-replay-v2	39.4	102.7	101.7
Walker2d-medium-replay-v2	51.2	94.2	98.1
Antmaze-medium-play-v0	96.6	91.6	90.2
Antmaze-medium-diverse-v0	95.8	91.4	83.8
Antmaze-large-play-v0	67.6	81.2	63.2
Antmaze-large-diverse-v0	81.0	76.4	70.0
pen-human-v1	65.4	67.2	70.0
pen-cloned-v1	67.3	66.3	68.4
Kitchen-complete-v0	7.5	82.3	92.7
Kitchen-partial-v0	10.9	60.3	66.3
Kitchen-mixed-v0	4.8	60.2	68.0
Average	52.6	77.6	76.8

Table 4 (for all Reviewers): Computational time comparison. We included a detailed comparison of training and evaluation times for Gaussian/Diffusion policies and Q-ensembles in Table 4 below. Increasing $M$ from 2 to 64 almost does not influence the evaluation time. The diffusion step $T$ has more impact on both training and evaluation time but it's a common problem in diffusion models. We regard the sample acceleration as our future work and will try to address it with reliable SDE/ODE solvers or diffusion distillation techniques.

Table 4: Computational time comparison with different settings on Antmaze-medium-play-v0

Policy	Diffusion Step $T$	# Critics $M$	Training Time (1 Epoch)	Eval Time (1k steps)
Gaussian	1	2	5m 35s	1s 450ms
Gaussian	1	64	7m 20s	1s 450ms
Diffusion	5	2	9m 30s	4s 800ms
Diffusion	5	64	11m	4s 800ms
Diffusion	10	2	12m 23s	8s
Diffusion	10	64	13m 55s	8s

最终决定Accept (poster)

2024-09-25

The reviewers and myself are all in agreement that this paper is clearly written, provides a well-founded method with adequate empirical evidence showcasing its merits. While some concerns were raised in the reviews, the authors have addressed most of them during the rebuttal. As such, I am recommending an acceptance.

Entropy-regularized Diffusion Policy with Q-Ensembles for Offline Reinforcement Learning

摘要

评审与讨论

优点

缺点

问题

局限性

优点

缺点

问题

局限性

优点

缺点

问题

局限性

优点

缺点

问题

局限性

Table 1: Experiment on LCB coefficients β\betaβ

Table 2: Experiment on "Max Q trick"

Table 3: Experiment on diffusion step TTT

Table 4: Computational time comparison with different settings on Antmaze-medium-play-v0

Table 1: Experiment on LCB coefficients $\beta$

Table 3: Experiment on diffusion step $T$