PaperHub
6.0
/10
Poster4 位审稿人
最低5最高7标准差0.7
6
5
7
6
4.0
置信度
正确性3.0
贡献度2.8
表达2.5
NeurIPS 2024

Diffusion-based Curriculum Reinforcement Learning

OpenReviewPDF
提交: 2024-05-14更新: 2024-11-06
TL;DR

A novel diffusion based curriculum reinforcement learning

摘要

Curriculum Reinforcement Learning (CRL) is an approach to facilitate the learning process of agents by structuring tasks in a sequence of increasing complexity. Despite its potential, many existing CRL methods struggle to efficiently guide agents toward desired outcomes, particularly in the absence of domain knowledge. This paper introduces DiCuRL (Diffusion Curriculum Reinforcement Learning), a novel method that leverages conditional diffusion models to generate curriculum goals. To estimate how close an agent is to achieving its goal, our method uniquely incorporates a $Q$-function and a trainable reward function based on Adversarial Intrinsic Motivation within the diffusion model. Furthermore, it promotes exploration through the inherent noising and denoising mechanism present in the diffusion models and is environment-agnostic. This combination allows for the generation of challenging yet achievable goals, enabling agents to learn effectively without relying on domain knowledge. We demonstrate the effectiveness of DiCuRL in three different maze environments and two robotic manipulation tasks simulated in MuJoCo, where it outperforms or matches nine state-of-the-art CRL algorithms from the literature.
关键词
curriculum reinforcement learningreinforcement learningdiffusion models

评审与讨论

审稿意见
6

The paper presents an intuitive way to apply curriculum learning using diffusion based models to learn a goal distribution that can interpolate between the state-visitation distribution to states with high-value and high-intrinsic reward. As a result, the curriculum generates goals that lie at the edge of the states with non-zero occupancy, and higher value/ closeness to the target-goal.

The technical details are mostly complete and seem sound upon initial reading, I did not delve into the proofs/ derivations in the appendix. But the exposition of how we go from diffusion-models, to AIM, to visitation-count modelling, and to the newly proposed DiCuRL method, is mostly clear.

Although there are multiple points of improvements, I think many practitioners will appreciate the author's work.

优点

The technical details are quite clear upon initial reading, even if one is not familiar with AIM or diffusion models. I.e., the new method should be clear enough to reproduce from reading the paper.

缺点

Major comments

  • The introduction dumps too much related work together to find the actual point and criticism that the authors want to make on the current state of the field.
  • Before reading the background/ technical details, motivation DiCuRL 1) is unclear how noising/ denoising is unique to helping exploration? Why can't another method (like a VAE, or GAN) do this through modelling the state-visitation distribution. Aren't we just choosing another, perhaps more powerful, generative method? After reading the full paper: I disagree that this is a sound motivation, any stochastic method over the state-visitation distribution could achieve this. I agree that modelling the state-visitation distribution is useful as it allows learning of goals that the agent has seen and can reach.
  • 4.0 Line 222, it is not clear from the text what problem the authors are trying to solve through the graph construction and the optimization of the curiculum goal (Eq. 12). How is the 'optimal' curriculum goal even defined? Eq 12 of course shows the objective, but why do we need this? How is the graph even constructed (meaning the edges), is this fully-connected? Initial reading of this paragraph gives the impression of severe over-engineering of the goal-sampler.
  • Figure 1 overlaps with table 1 and contains too many overlapping lines to draw a conclusion. This must be improved for presentation. Reduce the number of unnecesary baselines, show these in the appendix.
  • The results section spends most of its time speculating why the baselines perform in a certain way but does not focus on the authors' method. Line 281, states that there is a difference between OUTPACE and DiCuRL, however, neither method statistically significantly outperforms the other. Too much of the experimental setup is moved to the appendix.
  • It is unclear from figure 3 at what point during training this plot was made. Now the baseline methods look arbitrarily bad compared to the authors' method. It is color-coded, but maybe add a colorbar to figure 3 indicating the training episodes.

Technical comments

  • 3.3 Slight confusion on the reward rϕπr^\pi_\phi, it's good to mention that you're actually learning f(s)f(s) and using this to compute rr.
  • 4.0 Explanation on the mixing parameter αˉk\bar{\alpha}_k is omitted. Shortly state it in the main text.
  • 4.0 The definition of gdg_d is too hidden. I infer from Alg.2 that this is supposed to represent the true goal distribution.
  • Results, figure 1, table 2. Why plot the standard-deviations? Why not a non-parametric tolerance interval to get a sense of spread, or plot a confidence interval for the expected success-rate?

Minor comments

  • Intro paragraph 1 should be split into separate paragraphs making distinct points. Not a lumpsum of information.
  • Intro paragraph 1, maybe make a distinction between hierarchical RL + curriculum RL for goal-generation. Even if HRL can implicitly generate curriculums, the motivation is often slightly different.
  • Direct reference to papers should be done with the author: 'Person et al., (year) showed ...', not '[1, 2] showed ...'. Or you could write, 'Other studies [1, 2, 3], investigated ...' or something similar.
  • Intro paragraph 2 is not a paragraph but 1 sentence.
  • Figure 3, since DiCuRL is mostly on par with OUTPACE this should be compared in the plot for comparing curriculum goals

问题

  1. Could the authors revise the current version and improve upon (most) of my critiques, then I'd be willing to raise my score.
  2. Will the authors share code?

局限性

The authors shortly discuss the limitations of their method, which I mostly agree with.

作者回复

W1: The introduction dumps too much related work together [...]

We will restructure the introduction to separate and clarify the related work.

W2: [...] Why can't another method (like a VAE, or GAN) do this through modelling the state-visitation distribution? [...]

We acknowledge that other methods could effectively model the state distribution. However, we are cautious not to speculate on which method would yield the most successful output without further comparative analysis, which may be an interesting line of future research.

We believe that the noising/denoising mechanism of the diffusion model is particularly beneficial for the following reasons:

  • Denoising (lines 5-7 in Alg. 1): Gaussian noise is incrementally reduced by the neural network according to a specific variance schedule. This process is inherently imperfect: due to the neural network's limitations in precisely matching the sampled noise, a small degree of randomness remains. We believe that this residual noise introduces a subtle variability in the curriculum goals, which aids exploration by generating slightly varied goals.
  • Noising (lines 8-9 in Alg. 1): Original data sampled from a distribution are intentionally corrupted with Gaussian noise based on the sampled timestep k. The “noised” data are then processed by the neural network, and the loss is calculated (see Eq. 10 of the paper).

W3: [...] it is not clear from the text what problem the authors are trying to solve through the graph construction and the optimization of the curriculum goal [...]

We employ two different strategies for generating curriculum goals:

  • Bipartite Graph Optimization Strategy: In this strategy, as outlined in the paper and also utilized in baseline methods like HGG and OUTPACE, we sample a mini-batch bb from the replay buffer BB. This batch contains many states from different timesteps, which we provide to the curriculum goal generator, i.e., the diffusion model. The diffusion model generates a distribution of curriculum goal candidates GcG_c from which we select the optimal curriculum goal gcg_c using bipartite graph optimization. This graph maximizes the cost function given in Eq. 12.

    • Graph Construction: The vertices of the bipartite graph can be divided into two disjoint sets VxV_x and VyV_y, where VxV_x consists of the generated curriculum goals and VyV_y includes the desired goals sampled from the desired goal distribution. The graph is not fully connected as every pair of distinct vertices in our graph is not connected by a unique edge, but each vertex from VxV_x is only connected to vertices in VyV_y.
    • Purpose of Eq. 12: The objective here is to select K curriculum goals that are most diverse and beneficial for training, ensuring that the goals selected cover a broad spectrum of possible scenarios. We then randomly sample a single gcg_c from these K goals for each episode.
  • Single State Strategy (see Sec. B of the supp. mat.): This method involves feeding only the state from the last timestep into the diffusion model, which then generates a single curriculum goal. The rationale behind using the last timestep is the assumption that the final state is closer to the desired goal. However, this does not always hold true as agents might not always progress linearly towards the goal, sometimes moving backwards or sideways. In other words, if the last state does not accurately represent progress towards the desired goal, the generated curriculum goals might not optimally guide the agent, potentially leading to decreased sample efficiency and slower learning rates. This effect is demonstrated in Fig. 4 in Section B of our supp. mat.

W4: Fig. 1 overlaps with table 1 and contains too many overlapping lines to draw a conclusion [...]

We will revise Fig. 1 by reducing the number of baseline methods shown and displaying only the baselines that match our proposed method. Additionally, we will relocate some of the comparisons to the supp. mat.

W5: [...] states that there is a difference between OUTPACE and DiCuRL, however, neither method statistically significantly outperforms the other [...]

We acknowledge that our analysis might have focused excessively on the baselines’ performances. We will revise the discussion accordingly and move part of the discussion from the Appendix to the main text (given the possibility of including an extra page in case of acceptance).

Concerning the comparison between OUTPACE and DiCuRL, we have conducted a statistical analysis using the Wilcoxon rank-sum test to compare the no. of timesteps needed by the two methods to achieve a success rate greater than 0.99 across five different seeds for training. Here are the detailed test results for three specific environments:

  • PointSpiralMaze: p=0.04
  • PointNMaze: p=0.44
  • PointUMaze: p=0.04

For PointSpiralMaze and PointUMaze, there is statistically significant evidence to reject the null hypothesis (p<0.05), suggesting that DiCuRL statistically outperforms OUTPACE in these environments. Conversely, for PointNMaze, this is not the case. We note however that with 5 samples this analysis may be limited.

W6: It is unclear from Fig. 3 at what point during training this plot was made [...]

We will add a color bar to Fig. 3 to indicate the timestep corresponding to each different color of the curriculum goals. You can refer to the new plot layout in Fig. 15 of the attached PDF, which shows the intermediate goals for OUTPACE.

Technical and minor comments

Due to space limitations, we cannot provide detailed replies. However, we acknowledge these comments and we'll do our best to handle them in the final version of the paper.

Q1: Could the authors revise the current version and improve upon (most) of my critiques, then I'd be willing to raise my score.

We did (and will do) our best to address all the comments.

Q2: Will the authors share code?

Of course! Please see the link reported at L264.

评论

Thank you for replying to most of my comments. The additional plots are highly appreciated.

Not all of my concerns are fully resolved yet though. Could the authors still comment on major concerns below?


Major concerns

W3

Your answer clarifies some of my confusions around the goal optimization, and the results in appendix B also support the motivation. But this still remains a rough edge of the paper.

It was not clear from the main text that the baselines also utilize this method, this is important and must be mentioned and cited.

Remaining questions:

  • The objective of Eq. 12 is a sum of distances to an average goal. Could the authors comment on what gˉ\bar{g} means, and whether this is a pitfall of the method? For example, in figure 2, if one goal g1g_1 is on the upper side of the task and g2g_2 in the bottom, would gˉ\bar{g} be inside a wall?
  • How does this objective transfer to tasks where a l2l_2 distance over states doesn't work (e.g., atari)?
  • Can the term inside the square root of Eq. 12 become negative?
  • Is there prior work on other methods? Couldn't we e.g., use random network distillation to score ww based on approximate uncertainty or state-occupancy?
  • After line 12 in Algorithm 2, where is gcg_c used? Is this the set in Line 16?
  • Line 12 algorithm 2, shouldn't gcg_c be a set Gc\mathcal{G}_c?

Remaining minor remarks

W2

I see what you mean now on the advantage of diffusion methods for goal generation over e.g., GANs. I think the paper would greatly benefit from this discussion and the points you raise, including a short comment on how we observe this effect in figure 2.

W4

Thank you for addressing this, I saw that figure 4 in the appendix B also seems to have the plot-title cropped.

W5

I believe that doing statistical tests on the the final performance of the tested methods slightly misses the point of my comment. My issue was that a majority of the discussion of section 5 reads speculatory whereas the results did not show a very strong difference.

Also, the use of p-values and statistical tests are not recommended in RL, since a) the sample sizes are always too small, b) there are too many confounders for the algorithms (implementation parity, hyperparameter settings, or even the hardware), and c) results are often biased in favor of the designer. See also: Patterson et al., 2023 https://arxiv.org/abs/2304.01315.

评论

Thanks a lot for engaging in this discussion. We really appreciate your comments.

W3 [...] this is important and must be mentioned and cited.

We will clarify in the revised manuscript that the baselines (OUTPACE and HGG) also utilize the Minimum Cost Maximum Flow algorithm [a].

[a] Ravindra, K. A., et al. (1993) Network Flows: Theory, Algorithms, and Applications

  • [...] Could the authors comment on…
    gˉ\bar{g} represents the mean of the generated curriculum goals Gc\mathcal{G_c}. It is true that gˉ\bar{g} could potentially fall inside a wall, as our algorithm does not specifically prevent this because we assume that our algorithm is environment agnostic. However, we mitigate this issue through the use of diffusion loss (Eq. 10) in our total loss formulation (Eq. 11). This approach ensures that our generated curriculum goals are representative of the state-visitation distribution. Since the agent may collide with maze walls (it does not go through) during exploration, the state-visitation distribution typically does not include locations inside the walls. Consequently, the goals selected by minimizing the objective function are usually outside the walls.
    Despite these measures, the algorithm is not infallible. As illustrated in Figures 3a, 5a, and 6a, a minor amount of goals may still be generated inside the walls (actually, this same problem occurs also in OUTPACE).

  • [...] How does this objective transfer to tasks [...]
    Indeed, the current objective function may not be suitable for tasks where distance over states are not applicable, such as in Atari games. However, as detailed in Section B of our supp. mat., we have demonstrated that our algorithms are also effective without relying on this specific objective function.

  • [...] Can the term inside the square root of Eq. 12 become negative?
    In our implementation, we calculate the square of the difference (gigˉ)2(g_i -\bar{g})^2, to ensure non-negativity (see L200-201 in our repo https://anonymous.4open.science/r/diffusioncurriculum/hgg/hgg.py). However, it appears we inadvertently omitted the exponent in the paper. We will correct this to accurately reflect the calculation as (gigˉ)2(g_i -\bar{g})^2 in Eq. 12.

  • [...] Is there prior work on other methods? [...]
    Indeed, Eq. 12 can be designed more effectively and represents a promising area for future research.As you suggested, integrating methods based on uncertainty or entropy is a viable approach. For instance, the baseline method OUTPACE effectively employs curriculum goal generation that targets states of uncertainty, as demonstrated in Figure 2 of the OUTPACE paper. We are open to exploring similar enhancements to our model to further refine its capabilities by incorporating such uncertainty or state-occupancy based scoring mechanisms in future work.

  • [...] After line 12 in Algorithm 2, where is gcg_c used? [...]
    In our algorithm, gcg_c is utilized in several key parts: it serves as input to the policy network (line 7) and the Q network (line 13), and it is also used in the computation of the AIM reward (line 14). Once gcg_c is determined, it defines the goal that the agent aims to achieve. At the end of the episodes, we assess the algorithm's success rate by sampling a desired goal from the goal distribution G\mathcal{G} and observing how many of the test rollouts are successfully completed by our agent in achieving the given desired goal.

  • [...] Line 12 algorithm 2, shouldn't gcg_c be a set Gc\mathcal{G_c}?
    Indeed, at the end of the process, we select a single curriculum goal, referred to as gcg_c. More specifically, given the set Gc\mathcal{G_c} in Eq. 12, we select K curriculum goals that minimize the equation. We then randomly sample a single gcg_c from these K goals for each episode. We will revise Line 12 in Algorithm 2 to more clearly reflect this process and ensure that the line is accurate and unambiguous.

W2 [...] I think the paper would greatly benefit [...]

We will certainly add this to the final version of the paper.

W4 [...] figure 4 in the appendix B [...]

We chose to crop the plot titles and included the necessary information in the captions under each subplot for clarity. We will adjust the layout in the final version of the paper.

W5 I believe that doing statistical tests [...]

Thanks a lot for pointing us to this reference. Honestly we were also a bit hesitant in showing the statistical analysis because of the small sample size, as we admitted in the previous reply. That’s why in the paper we reported the results in terms of curves and descriptive statistics on the number of timesteps needed to reach a given success rate, as also done in other papers in the field.

Concerning the way we discussed the results in the results section, we will do our best to revise the text to make it read less speculative, highlighting the similar performance of the methods whenever appropriate.

评论

The author's response to my additional questions clears up most of my confusions around the graph-optimization part of their goal generation.

I've raised my score from 5 to 6.

评论

Thank you!

审稿意见
5

This paper studies curriculum reinforcement learning (RL) in the context of multi-goal RL, which aims to generate a series of goals with increasing difficulty to facilitate guiding learning policies. To this end, the paper proposes a framework that employs a conditional diffusion model that learns to generate a goal conditioned on the current state. The experiments in three maze navigation tasks show that the proposed method can reliably solve the tasks and perform them comparably to existing methods. This work studies a meaningful problem and proposes a reasonable framework. Yet, I am concerned with the limited domain (navigation) and tasks (maze) used for evaluation, the significance of the results, and the limited applicability beyond multi-goal RL, etc. Therefore, I am slightly leaning toward rejecting this paper, but I am willing to adjust my score if the rebuttal addresses my concern.

优点

Motivation and intuition

  • The motivation for studying curriculum learning for multi-goal RL is convincing.
  • Leveraging diffusion models to generate goals is reasonable.

Clarity

  • The overall writing is clear. The authors utilize figures well to illustrate the ideas.

Related work

  • The authors provide comprehensive descriptions of existing works in curriculum RL.

Experimental results

  • The experimental results show that the proposed method performs comparably to existing methods.

Reproducibility

  • The code is provided, which helps understand the details of the proposed framework.

缺点

Clarity

  • The first paragraph of the introduction is unnecessarily long, making it very difficult to follow.
  • While the related work section describes several existing works in detail, it fails to differentiate these works from the proposed method exactly.

Limited to goal-conditioned RL

  • The proposed method is limited to multi-goal RL, which requires a given goal. However, in many real-world applications, specifying a goal could be difficult or even impossible, making using the proposed method undoable. I feel it is entirely possible to extend the proposed method to the general RL setup, where only the current state is given. This will greatly increase the applicability of the proposed method.

Evaluation is limited to the Maze navigation

  • The proposed method was only compared to existing methods in the Maze navigation tasks, where goals are represented as coordinates. It would be a lot more convincing if the evaluation was also conducted in other domains, such as robot arm manipulation, locomotion, and games. Additionally, evaluating in grid-world navigation tasks can add value to the paper by exploring discrete state and action spaces.

Significance of the results

  • According to Figure 1, I am not entirely convinced that the proposed method performs significantly better than the baselines. Also, the plotting scheme makes it difficult to interpret when many curves overlap.

Related work

  • The related work section focuses on existing works in curriculum RL yet fails to discuss many works that use diffusion models for RL or imitation learning, including but not limited to
    • "Learning Universal Policies via Text-Guided Video Generation"
    • "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion"
    • "Learning to Act from Actionless Video through Dense Correspondences"
    • "Goal-conditioned imitation learning using score-based diffusion policies"
    • "Diffusion model-augmented behavioral cloning"
    • "Imitating human behaviour with diffusion models"

Algorithm 2

  • While Algorithm 2 is titled RL Training, Lines 15-21 are for evaluation/testing, which is a bit confusing.

Minor errors

  • L282: It seems that a non-break newline is used here, which gives no space between this paragraph and the next paragraph starting from Line 283.

问题

See above

局限性

Yes

作者回复

W1: The first paragraph of the introduction is unnecessarily long [...]

We will restructure the introduction to ensure it is more concise and better organized, which we believe will significantly improve the clarity and readability of the paper.

W2: While the related work section describes several existing works in detail, it fails to differentiate these works from the proposed method exactly.

We address this comment under W6.

W3: [...] I feel it is entirely possible to extend the proposed method to the general RL setup, where only the current state is given. This will greatly increase the applicability of the proposed method.

We recognize that the necessity for giving a curriculum goal can restrict the applicability of our approach in real-world scenarios where such goals are not readily available or definable.

Indeed, extending our method to a general RL setup, where the system operates solely based on the current state without explicit goal definitions, is a valuable direction for future research. Inspired by your suggestion and valuable references, we are considering an approach similar to that described in [a]: in this approach, a desired goal is defined using text encoding, and then a planner generates a series of future frames that illustrate the actions. Control actions are then derived from this generated video, enabling the agent to navigate without the need for specifically predefined goals. Additionally, we could adapt our policy (possibly our Q function using diffusion models), as demonstrated in [b]. Alternatively, we might explore learning directly from the environment using techniques such as RGB video, as proposed in [c], or through point clouds.

W4: [...] It would be a lot more convincing if the evaluation was also conducted in other domains, such as robot arm manipulation, locomotion, and games [...]

Given the time constraints, we have carried out additional experiments on two robot manipulation tasks, although it would be entirely possible to apply to other domains such as indeed locomotion and games as well as discrete state and action spaces. Please see our General comment and its attached PDF (Fig. 16). These experiments hopefully demonstrate the applicability of our method to tasks that are quite different from maze navigation.

W5: According to Fig. 1, I am not entirely convinced that the proposed method performs significantly better than the baselines. Also, the plotting scheme makes it difficult to interpret when many curves overlap.

We will provide clearer plots in our supp. mat. by creating different subplots and plotting together groups of 2 to 3 baseline methods. We hope that comparing only 2 or 3 baseline methods in each subplot will make the comparison clearer and easier to understand.

Concerning the comparison between OUTPACE and DiCuRL, we have conducted a statistical analysis using the Wilcoxon rank-sum test to compare the no. of timesteps needed by the two methods to achieve a success rate greater than 0.99 across five different seeds for training. Here are the detailed test results for three specific environments:

  • PointSpiralMaze: p=0.04
  • PointNMaze: p=0.44
  • PointUMaze: p=0.04

For PointSpiralMaze and PointUMaze, there is statistically significant evidence to reject the null hypothesis (p<0.05), suggesting that DiCuRL statistically outperforms OUTPACE in these environments. Conversely, for PointNMaze, this is not the case. We note however that with 5 samples this analysis may be limited.

W6: The related work section focuses on existing works in curriculum RL yet fails to discuss many works that use diffusion models for RL or imitation learning, including but not limited to [...]

We acknowledge our oversight and appreciate your detailed suggestions. We will thoroughly review these studies and will update the related work section to discuss them extensively, highlighting how they relate to our research.

W7: While Algorithm 2 is titled RL Training, Lines 15-21 are for evaluation/testing, which is a bit confusing.

Thank you for pointing this out. We will update our algorithm title to "Algorithm 2: RL Training and Testing”.

W8: L282: It seems that a non-break newline is used here, which gives no space between this paragraph and the next paragraph starting from Line 283.

Thank you for noticing this layout issue. We will fix it in the revised version of the paper.

[a] Du, Y., et al. (2024), Learning universal policies via text-guided video generation

[b] Chi, C., et al. (2023), Diffusion policy: Visuomotor policy learning via action diffusion

[c] Ko, P-C, et al. (2023), Learning to act from actionless videos through dense correspondences

评论

Thank you for the rebuttal with additional experiments, which address some of my concerns, including fixing Algorithm 2, and the significance of the results.

Clarity & minor errors: Given that NeurIPS does not allow updating submissions during the rebuttal period, it is difficult for me to assume that the authors would completely fix the issues. I would still say the paper in its current form is not ready for publication.

Limited to goal-conditioned RL: I appreciate the discussion of how the proposed method can apply beyond the goal-conditioned RL scenario. However, without seeing the results, it is not convincing.

Related work: While the authors made a promise to discuss how their work differs from the references I provided, I am unsure how this will go without seeing the actual revised paper, which, again, unfortunately, is not possible during the NeurIPS rebuttal.

Evaluation is limited to the Maze navigation: I really appreciate the additional results of FetchPush and FetchPickAndPlace. However, I still believe the evaluation is limited. As I suggested, including navigation tasks with discrete state and action spaces, locomotion tasks, and image-based games would significantly strengthen the contributions of this work.

I have mixed feelings about this work. On the one hand, I like the idea of leveraging diffusion models to generate goals for multi-goal RL; on the other hand, I hope to see an improved version of this work with clear writing, detailed discussion of related works, and evaluations in diverse domains and beyond goal-conditioned RL, which could present significant contributions to the community. Hence, I recommend rejecting this paper in its current form, and I encourage the authors to improve this submission and give it another shot if it eventually gets rejected. That said, I won't fight to reject this paper if my fellow reviewers decide to champion it.

评论

Thank you for the rebuttal with additional experiments, which address some of my concerns, including fixing Algorithm 2, and the significance of the results.

Thanks for acknowledging our rebuttal and for pointing out the significance of the results.

Clarity & minor errors: Given that NeurIPS does not allow updating submissions during the rebuttal period, it is difficult for me to assume that the authors would completely fix the issues. I would still say the paper in its current form is not ready for publication.

As far as we can see, the point of the NeurIPS rebuttal process is to engage in a scientific discussion with peers to receive suggestions on how papers can be improved to meet the quality standards of the conference. From the conference website: “Authors may not submit revisions of their paper or supplemental material, but may post their responses as a discussion in OpenReview. This is to reduce the burden on authors to have to revise their paper in a rush during the short rebuttal period.” The entire spirit of the rebuttal process is based on the idea that authors can revise an accepted paper right after the rebuttal period (not in a rush). After all, if only papers that are already good as they were to be accepted, what would be the point of the rebuttal phase?

Given this, as said we honestly promised that we will fix those issues, and we will do it. It would be against our professional ethics not to do that.

Limited to goal-conditioned RL: I appreciate the discussion of how the proposed method can apply beyond the goal-conditioned RL scenario. However, without seeing the results, it is not convincing.

We limited our experimentation to goal-conditioned scenarios to align our experimental setup with that of our baselines, and compare against state-of-the-art results. Testing our method on tasks beyond goal-conditioned scenarios has never been in the scope of this paper, but as we said this can certainly be an interesting direction for future research for the whole community.

Related work: While the authors made a promise to discuss how their work differs from the references I provided, I am unsure how this will go without seeing the actual revised paper, which, again, unfortunately, is not possible during the NeurIPS rebuttal.

We are a bit puzzled also by this comment, to be honest. We, as authors, are obviously devoted to maintaining a highly ethical scientific conduct, as we have always done. As said in our previous reply, we will analyze these works and include them in our discussion in the revised version of the paper (as a matter of fact, we have started working on it, please see below our reply on the evaluation).

Evaluation is limited to the Maze navigation: I really appreciate the additional results of FetchPush and FetchPickAndPlace. However, I still believe the evaluation is limited. As I suggested, including navigation tasks with discrete state and action spaces, locomotion tasks, and image-based games would significantly strengthen the contributions of this work.

We politely disagree on this. Our evaluation setup is actually well aligned with most of the works from the state of the art in the field, including the papers introducing our baselines and the papers suggested by the reviewer in the previous comment.

Considering the papers introducing the baselines we used in our experiments:

  • ACL [18] → 3 tasks (synthetic language modelling on text generated by n-gr models, repeat copy and the bAbI tasks)
  • GoalGAN [14] → only Ant Maze tasks
  • HGG [19] → 4 robot manipulation and 4 hand manipulation tasks
  • ALP-GMM [17] → BipedalWalker
  • VDS [16] → 4 Robot manipulation tasks, 9 hand manipulation tasks , 3 maze tasks
  • CURROT [23] → Bipedal Walker and Maze tasks
  • GRADIENT [13] → FetchPush and Maze tasks
  • OUTPACE [12] → 3 maze with points agent, 1 maze with Ant, 2 robotic tasks

Concerning the papers suggested by the reviewer:

  • Learning Universal Policies via Text-Guided Video Generation → combinatorial robot planning tasks (real robotic system as well)
  • Diffusion Policy: Visuomotor Policy Learning via Action Diffusion → 5 Robotic manipulation tasks (real robotic system as well)
  • Learning to Act from Actionless Video through Dense Correspondences → 6 different robot manipulations tasks + 4 different tasks + 3 real world tasks
  • Goal-conditioned imitation learning using score-based diffusion policies → 3 different robotic tasks (no real world task)
  • Diffusion Model-Augmented Behavioral Cloning → 1 Maze, 1 FetchPick, 1 Hand, 1 cheetah, 1 walter, 1 AntReach
  • Imitating human behaviour with diffusion models → CLAW environment, Kitchen environments, CSGO

I have mixed feelings about this work [...]

We sincerely appreciate your engagement in the rebuttal, although of course we would hope for a score improvement. We are aware that the paper can be improved and as said we will do our best to do that if the paper were to be accepted.

评论

Thank you for the further clarification.

Clarity: The clarity of this paper is not ready for publication, and I cannot simply count on the author's promise and accept this paper. To be clear, I do trust that the authors will do their best to improve the clarity; however, I am still unsure if the revised paper will be ready. I believe this paper needs significant reorganization and revision, not just fixing a few sentences, so I have to see it to believe it.

Related work: I understand that the authors promised to discuss the relevant works I provided. However, it's not just about discussing them. What really matters is how the authors discuss them and differentiate their work from these works. Again, without seeing the actual writing, I just cannot say my concern is addressed.

Limited evaluation: In short, listing existing methods that do not sufficiently evaluate their methods, at least in my opinion, does not alleviate my concern. I am not asking the authors to evaluate their method in complex real-world tasks. Setting up and evaluating their method in the simulated tasks in different domains should be totally reasonable.

In sum, I stand by my evaluation and recommend rejecting this paper in its current form.

In my opinion, the rebuttal is for clarifying what reviewers misunderstand, not for the authors to make promises and urge reviewers to take them.

评论

Dear Reviewer,

Thanks again for the engaging discussion. Please note that we have thoroughly revised the Introduction and the Related Work sections (see the comments we posted under the General comment section).

Specifically, in the Introduction we clearly highlighted the novel aspects of our algorithm (see the Contributions paragraph), and reorganized the overall text to make it more effective (splitting the first paragraph in a more coherent way). In the Related Work section, according to your suggestions, we included the references you kindly provided and added a brief paragraph highlighting the limitations of current works and the distinctive elements of our paper.

We do hope that the revised versions of Section 1 and Section 2 address your concerns regarding Claritiy and Related work. We would really appreciate it if you could acknowledge this reply.

Best regards,

The authors

评论

Thank you for the revised introduction and the related work, which are easier to follow while containing sufficient details. I will increase my score to 5.

评论

Thank you!

审稿意见
7

This work presents a novel diffusion model-based curriculum learning approach, called DiCURL, for multi-goal reinforcement learning, namely goal-conditioned RL. The proposed conditional diffusion model leverages a Q-function and a learned reward function based on the Adversarial Intrinsic Motivation principle to incentivize goals that are reachable yet challenging to an RL agent. The paper evaluates DiCURL against state-of-the-art curriculum learning approaches in maze environments with differing maps. In PointUMaze and PointNMaze, DiCURL matches or slightly outperforms OUTPACE, which seems to be the best-performing method in these maze environments. In the most challenging map, PointSpiralMaze, DiCURL outperforms OUTPACE, while the rest of the methods fail to yield an optimal policy at the end of the training.

优点

  • The related work section is extensive in terms of content and covers most of the recent advances in automatic curriculum learning for RL. The background and methodology sections are also detailed, and the problem setting and the proposed approach are explained clearly.

  • The proposed curriculum learning approach is novel as it employs a conditional diffusion model. The idea of leveraging a Q-function and a learned intrinsic reward function to select achievable but challenging goals is intuitive, as well.

  • Table 1 highlights the advantages of DiCURL, and the introduction section also supports this table.

  • The curricula generated by DiCURL in Figures 2 and 3 (as well as the ones in the appendix) illustrate how DiCuRL yields optimal policies and outperforms some existing methods in evaluated environments.

缺点

  • The introduction section should be improved in terms of writing. Content-wise, it is informative but also too dense. Some of the paragraphs are either too long or too short. Restructuring this section and making it more to the point would improve the readers' experience immensely.

  • OUTPACE is the second best-forming automatic curriculum learning method in the evaluated environments. However, the paper does not demonstrate the curricula generated by OUTPACE, unlike the curricula of GRADIENT and HGG in Figure 3, which do not perform as well.

  • All environments (point maze domain in MuJoCo with different maps) in the empirical validation section have the same dynamics, low-dimensional state, and action spaces. Although DiCuRL's advantages seem apparent as the map gets more complex, the empirical validation is insufficient to conclude that DiCuRL can outperform state-of-the-art methods in various goal-conditioned domains.

  • The roles of loss components related to the Q-function and AIM reward function sound intuitive, yet they are explained briefly. I suggest the authors run an ablation study to highlight their separate contributions.

问题

  • How do Q and AIM rewards differ in a goal-conditioned environment that provides a (sparse) reward for reaching the goal? Could you please give me an illustrative example to highlight how including both in the loss function of the diffusion model is better?

  • What is g_d that initializes g_c in Algorithm 2?

  • What do colors in figures illustrating curricula stand for specifically?

局限性

I don't see any explicit limitations regarding the proposed approach and the problem setting of interest other than those discussed by the authors in the conclusion section.

作者回复

W1: The introduction section should be improved in terms of writing [...]

We agree with the reviewer's suggestions: we will restructure the introduction to ensure it is more concise and better organized.

W2: [...], the paper does not demonstrate the curricula generated by OUTPACE [...]

In the attached PDF, we show the generated curriculum goals of OUTPACE for all maze environments (Fig. 15). Additionally, we have added a color bar to indicate which colors of the curriculum goals correspond to which timesteps. These figures will be included in the final version of the paper. Furthermore, we will update Fig. 3 in the paper to include the color bars.

W3: [...] the empirical validation is insufficient to conclude that DiCuRL can outperform state-of-the-art [...]

We have additionally tested our approach on two robotic manipulation tasks. Please refer to our General comment and the attached PDF (Fig. 16) for more details.

W4: The roles of loss components related to the Q-function and AIM reward [... ]

We have conducted an ablation study to explore the individual contributions of the Q function and the AIM reward when integrated with the diffusion loss LdL_d, as outlined in Eq. 11 in the paper. The success rate results from this study are presented in Fig. 13a in the attached PDF.

Concerning the roles of the Q function and AIM reward, here’s our intuitive explanation:

  • Loss LdL_d: Minimizing this component helps us accurately capture the state distribution. It ensures that our generated curriculum goals are representative of the state-visitation distribution.

  • Q-function and AIM reward: The Q-function predicts the cumulative reward starting from a state, following the policy, while the AIM reward estimates how close an agent is to achieving its goal. We integrate both terms in the loss function by inverting their signs because our objective is to maximize Q and the AIM reward. By doing this, the diffusion model focuses on generating goals that not only minimize LdL_d but also maximize the expected Q and AIM rewards.

This approach ensures that our generated curriculum goals are neither overly simplistic nor excessively challenging and progress towards the desired goal.

We have provided a detailed analysis with various visualizations for the PointUMaze environment in Sec. D in our supp. mat. If further details are required, we can provide a similar analysis for the PointSpiralMaze environment as well.

Q1: How do Q and AIM rewards differ in a goal-conditioned environment [...]

To demonstrate the effects of using either the AIM reward or the QϕQ_\phi function in conjunction with the LdL_d component in Eq. 11 in the paper, we have provided illustrative examples in the attached PDF. In particular, Fig. 13c and 13d display the generated curriculum goals using LdL_d with only QϕQ_\phi and LdL_d only with the AIM reward, respectively. The generated curriculum goals reveal that omitting the AIM reward results in suboptimal performance, whereas the absence of the QϕQ_\phi function leads to the agent's inability to accomplish the task.

Additionally, we have implemented our method within a sparse reward, goal-conditioned RL framework across two different robotic manipulation tasks. We compared our method with HGG [a] and HER [b], as detailed in Fig. 16 in the attached PDF.

In this setting, the curriculum goals are integrated into our policy and the Q network (see line 7 in Algorithm 2 for the policy and line 10 in Algorithm 1 for the Q function). We utilize the AIM reward for both training RL algorithms and generating curriculum goals. This reward function, also a feature of OUTPACE, is a trainable neural network parametrized by φ\varphi that is initialized randomly and trained simultaneously with the RL algorithms. It is specifically trained to minimize the Wasserstein distance between state visitation and the desired goal distribution, as detailed in Section 3.3.

The distinction between training RL algorithms using the sparse (binary) reward and the AIM reward is significant. For instance, consider an agent at position (x,y) with a goal at (m,n). In a sparse reward setting, such as that used in the HER methodology, the reward is calculated based on the Euclidean distance between the agent's position and the goal. If this distance is greater than a predefined threshold (set to 0.05m in Gymnasium-Robotics), the agent receives a reward of 0; otherwise, it receives a reward of 1. In other words, If the agent reaches a given goal within a given threshold distance, the agent receives a positive reward, otherwise a non-positive one. In contrast, both our method and OUTPACE utilize a neural network-based reward—the AIM reward—which provides a continuous reward value. To illustrate this concept, we have included a 2D color map of the AIM reward function for the PointUMaze environment, showing how the reward evolves during different training episodes. These visualizations can be found in Fig. 8a, 10a, and 12a of our supp. mat. Additionally, the changes in the AIM reward for the PointSpiralmaze environment are illustrated in Fig. 13b in the attached PDF.

Q2: What is g_d that initializes g_c in Algorithm 2?

gdg_d is the desired goal, and we initialize gcg_c with gdg_d. At the beginning of training, our diffusion algorithm has not yet been run, so we initially provide the agent with gdg_d. Then, our curriculum goal generation algorithm generates curriculum goals for the agent.

Q3: What do colors in figures illustrating curricula stand for specifically?

We will add a color bar to Fig. 3 to illustrate the corresponding timesteps for each different color of the curriculum goals. You can refer to the new plot in Fig. 15 of the attached PDF, which shows the intermediate goals for OUTPACE.

[a] Andrychowicz, M., et al. (2017), Hindsight experience replay

[b] Ren, Z., et al. (2019), Exploration via hindsight goal generation

评论

Thank you for responding to all of my comments. My concerns are addressed to a large extent. I believe the new results and ablation studies provided in the global response improve the validity of the proposed approach. Of course, the rest of the baselines should be evaluated in the new environments for the final version.

This paper showcases a nice implementation of diffusion for curriculum learning, an important novelty. Although the results do not clearly demonstrate a superiority over the existing state-of-the-art, and the introduction section is not ready for the final version, I will raise my score from 5 to 6.

If the authors pinpoint the changes in the introduction section, I will raise my score again.

评论

Dear FHis,

Please find the revised version of the Introduction under the General comment section. We hope that the intro is now more effective and easy to read (we do believe it has improved indeed). Any further feedback is welcome of course.

Best regards,

The authors

评论

I appreciate the effort put into improving the introduction section. I'm raising my score to 7.

评论

Thanks a lot! We really appreciate that.

审稿意见
6

This work introduces DiCuRL, a novel approach that uses diffusion models to generate curriculum goals for reinforcement learning agents. The method trains a model to capture the distribution of visited states, focusing on those with higher Q-values and intrinsic motivation rewards (i.e., AIM rewards). This approach aims to generate goals at an appropriate difficulty level while guiding the curriculum closer to the desired final goal. DiCuRL employs the Minimum Cost Maximum Flow algorithm to solve a bipartite matching problem to select curriculum goals.

优点

  • Strong empirical evaluation against competitors (Fig. 1)
  • The paper is information-dense but reasonably well-written. It helps with the comprehension of the proposed ideas

缺点

  • The approach is quite complicated and possibly unnecessarily so. I'd like to emphasize that I did not find any faults with the proposed method. It's just that I do not see how it will scale to more challenging, realistic environments.
  • They missed citing a rich literature on exploration and curriculum RL. For example, see papers [1-5].
  • The reward function for the Maze envs is not provided. Is this dense or sparse reward env? Note that, dense reward would not be a justifiable choice in this case.

References

  1. Riedmiller, M., Hafner, R., Lampe, T., Neunert, M., Degrave, J., Wiele, T., Mnih, V., Heess, N., and Springenberg, J. T. (2018). Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning, pages 4344–4353. PMLR.
  2. Hertweck, T., Riedmiller, M., Bloesch, M., Springenberg, J. T., Siegel, N., Wulfmeier, M., Hafner, R., and Heess, N. (2020). Simple sensor intentions for exploration. arXiv preprint arXiv:2005.07541.
  3. Nair, A. V., Pong, V., Dalal, M., Bahl, S., Lin, S., and Levine, S. (2018). Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31.
  4. Korenkevych, D., Mahmood, A. R., Vasan, G., and Bergstra, J. (2019). Autoregressive policies for continuous control deep reinforcement learning. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 2754–2762.
  5. Narvekar, S., Peng, B., Leonetti, M., Sinapov, J., Taylor, M. E., & Stone, P. (2020). Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21(181), 1-50.

问题

  • In Fig. 3, what do the colours represent? Please be more elaborate. It is not clear at all at the moment
  • In the appendix, it's mentioned that "The agent starts each episode from an initial state of [0, 0]." In RL environments, environmental resets can implicitly help exploration [1]. How would DiCuRL + fixed start state fare against SAC only + random start states?
  • How does SAC only perform in the comparisons in Fig. 1?
  • How important is the AIM reward? It is a bit weird to sum the Q value and one-step intrinsic motivation reward. This results in different scales/magnitudes of values, which is why the authors needed to tune the coefficients.
  • To ask the previous question differently, can the AIM reward be substituted with simpler intrinsic motivation rewards like RND [2] or TD-error?
  • It seems SAC + HER would be a lot simpler to use computationally and algorithmically. How does DiCuRL compare against SAC + HER?

References

  1. Vasan, G., Wang, Y., Shahriar, F., Bergstra, J., Jagersand, M., & Mahmood, A. R. (2024). Revisiting Constant Negative Rewards for Goal-Reaching Tasks in Robot Learning. arXiv preprint arXiv:2407.00324.
  2. Burda, Y., Edwards, H., Storkey, A., & Klimov, O. (2018). Exploration by random network distillation. arXiv preprint arXiv:1810.12894.

局限性

Please suggest how this work can be extended to challenging environments with larger state-action spaces.

作者回复

W1: The approach is quite complicated [...]

We carried out additional experiments on two robot manipulation tasks, please see the General comment and its attached PDF (Fig. 16) for more details.

W2:They missed citing a rich literature [...]

We will carefully review these papers and integrate them appropriately into the related work section to ensure a comprehensive discussion of existing research.

W3:The reward function for the Maze envs is not provided [...]

We are using a trainable reward function both for training the RL algorithm and for generating curriculum goals. The trainable reward function, see Eq. 4, is learned by minimizing the Wasserstein distance between state visitation and the desired goal distribution. Given that we are working with maze environments, using Euclidean distance would not be appropriate for a reward function. Following the same reward approach as OUTPACE, we use the AIM reward function trained simultaneously with the RL algorithms. As demonstrated in the supp. mat. and Fig. 13b in the attached PDF, the reward function converges to the desired goal as training progresses. Therefore, we integrated the reward function into our curriculum goal generation mechanism, as it is useful in Eq. 11 and has the potential to guide the agent toward the desired goal. In summary, we do not use a sparse or pre-defined dense reward equation. Instead, we utilize a trainable reward function which is randomly initialized and specifically trained to minimize the Wasserstein distance between state-visitation and the desired goal distribution.

Q1: In Fig. 3, what do the colours represent?

We will add a color bar to Fig. 3, which illustrates the corresponding timesteps for each different color of the curriculum goals. You can refer to our new plot in Fig. 15 in the attached PDF to have a preview of the modified figure.

Q2: [...] How would DiCuRL + fixed start state fare against SAC only + random start states?

Q3: How does SAC only perform in the comparisons in Fig. 1?

To address jointly Q2 and Q3, we removed the curriculum goal generation part from the code and provided only the final desired goal to agents in the maze tasks, which corresponds to training the agent solely with SAC. We obtained the first set of results by starting the agent in a fixed initial state at (0,0)(0,0). For the second set of results, we uniformly sampled the initial position of the agent randomly in the environment. To avoid starting the agent inside the walls of the maze, we performed an infeasibility check, such that if the initial state sampled was inside the walls, we continued sampling until a feasible initial state was found. We compared our DICURL approach with both SAC+fixed initial state and SAC + random initial state across all maze environments. The success rate is shown in Fig. 14 of the attached PDF. We observe that including curriculum goals significantly improves the success rate in simpler tasks such as PointUMaze, helping the agent achieve the task successfully across different training seeds. Without the curriculum (SAC + fixed initial position), the agent struggles to achieve complex tasks consistently, resulting in high variance in performance. With SAC + random initial state, the agent often reaches the desired goal, at least in some scenarios. This success can be attributed to trial and error (without requiring curriculum goals) because the random starting positions help the agent avoid becoming stuck in maze walls, thus enhancing its ability to navigate the environment effectively.

Q4: How important is the AIM reward? [...]

You are absolutely right; this is why we need to use different coefficients. To illustrate the effects of the Q and AIM reward functions, we conducted an ablation study. Please see our General comment and its attached PDF (Fig. 13a) for more details.

Q5: [...] can the AIM reward be substituted with simpler intrinsic motivation rewards [...]

Yes, we believe that both RND and TD-error can potentially be substituted with the AIM reward. Indeed, in the AIM reward paper [a], the authors compare this approach with RND and show that RND performs worse compared to the AIM reward in certain tasks. Based on this evidence, we preferred to use the AIM reward for our method.

Q6: [...] How does DiCuRL compare against SAC + HER?

Given the timeframe we had, we couldn’t test SAC+HER on the maze tasks. Instead, to demonstrate that our approach can be extended to robotic manipulation and sparse reward tasks, and compare it to HER, we used the official repository of [b], where the authors compare their approach with HER in a sparse reward setting. As detailed in the General comment, we implemented DiCuRL using HER settings in sparse reward robotic manipulation tasks. Please note that we used DDPG instead of SAC, which is another off-policy RL algorithm. All methods, including ours, use sparse rewards for training DDPG. However, since our method is based on the AIM reward, we only use the AIM reward for generating curriculum goals. You can find the comparison in Fig. 16 of the attached PDF.

Limitations: Please suggest how this work can be extended to challenging environments with larger state-action spaces.

To demonstrate that our proposed method can be extended to more complex environments such as robotic manipulation tasks, we compared our approach with HGG and HER. The success rates are shown in Fig. 16 of the attached PDF. We will be happy to include these additional experiments -along with a brief discussion of these new results- in the final version of our paper.

[a] Durugkar, I., et al. (2021), Adversarial intrinsic motivation for reinforcement learning

[b] Ren, Z., et al. (2019), Exploration via hindsight goal generation

评论

Dear c6He,

Given that the author-reviewer discussion period will close soon, we would be really grateful if you could acknowledge our response and let us know if we addressed all of your concerns. Of course, we are happy to provide more details if you need them. We hope we can engage in an active discussion with you as we are doing with the other reviewers.

Thanks a lot.

The authors.

评论

I would like to thank the authors for answering my questions. I apologize for the delay due to travel. I’m particularly pleased with the robot experiments in simulation and the new plots with a colour map, which reassures my rating of 6.

Thank you for the additional explanation in W3 and Q1. It clarified some of my misconceptions regarding the paper. I also like the rewritten introduction better, especially since it doesn't read like a monograph anymore.

Thank you for the engaging response. I’m looking forward to seeing them in the final manuscript.

评论

Thanks a lot for acknowledging our replies and for appreciating the new experiments.

评论

Dear Reviewer,

Please note that we have revised the Related Work section as well, including the references you suggested (see the comments we posted under the General comment section).

Best regards,

The authors

作者回复

General Comment

We sincerely thank all reviewers for the time and effort devoted to reviewing our manuscript. To address the key points raised, we have provided detailed responses to each reviewer. All responses are organized into questions and weaknesses. For example, Q1 refers to the first question, while W1 to the first weakness. Where needed, we also included a reply concerning the limitations.

For some responses, we have included additional results, which are available in the attached PDF (Fig. 13-16) included in this General comment.

We have been diligently working on improving the paper on several fronts, addressing all comments to the best of our capacity. We hope that the reviewers and chairs will appreciate our efforts and we wish to engage in a fruitful discussion during the rebuttal period.

Below, we summarize the main changes made:

  • We conducted an ablation study to investigate the impact of the AIM reward function and Q function in generating curriculum goals with our method. For that, we omitted, separately, the reward function and the Q function from Eq. 11 in the paper, and plotted the success rate (with three different seeds) in Fig. 13a for the most challenging maze environment, PointSpiralMaze. The results indicate that the agent performs worse without the AIM reward function and fails to achieve the task without the Q function. The generated curriculum goals without the reward or Q function are shown in Fig. 13c and 13d, respectively. Fig. 13b, instead, illustrates the AIM reward value across different training episodes in a clockwise direction. Specifically, the first row and first column in Fig. 13b represent the reward values at the very beginning of training. As training progresses, the reward values shift towards the left corner of the maze environment (1st row, 2nd column). In the middle of training, the reward values are concentrated around the left corner of the maze environment (2nd row, 2nd column), and, by the end of training, the reward values converge to the desired goal area (2nd row, 1st column). This progression explains why the generated curriculum goals are not guiding the agent effectively but are instead distributed in the corner points shown in Fig. 13d. We have also demonstrated the behavior of the AIM reward function across different training episodes in our supp. mat. for the PointUMaze environment.
  • Additionally, we examined the impact of SAC with a fixed initial state [0,0] and SAC with a random initial state. To do that, we removed the curriculum goal generation mechanism and assigned the desired goal, and then trained the agent using either SAC with a fixed initial state [0,0] or SAC with a random initial state. For the random initial state, we sampled goals randomly in the environment. To avoid starting the agent inside the maze walls, we performed an infeasibility check, resampling the initial state until it was feasible. We compared our approach using three different seeds with both the fixed initial state + SAC and the random initial state + SAC across all maze environments. The success rates are shown in Fig. 14 in the attached PDF. Note that the success rate for the PointUMaze environment in Fig. 14a is shown up to timestep 10510^5, whereas it was shown up to 10610^6 in the paper.
  • We also displayed the generated curriculum goals by the baseline method OUTPACE in Fig. 15, with a color bar indicating the corresponding timesteps.
  • To demonstrate the applicability of our method to different tasks, particularly in robot manipulation tasks, we implemented our approach using the official Hindsight Goal Generation (HGG) repository. We converted the HGG code from TensorFlow to PyTorch to integrate it with our diffusion model, which is based on PyTorch. We selected two robot manipulation tasks, FetchPush and FetchPickAndPlace, and increased the environment difficulty by expanding the desired goal area. This is shown in Fig. 16c and 16d, where the yellow area indicates the object sampling region and the blue area indicates the desired goal sampling region. The action space is four-dimensional, where three dimensions represent the Cartesian displacement of the end effector, and the last dimension controls the opening and closing of the gripper. The state space is 25-dimensional, including the end-effector position, position, and rotation of the object, the linear and angular velocity of the object, and the left and right gripper velocity. More detailed information regarding the action space and observation space of these robotic tasks can be found in the Gymnasium library documentation. For these additional experiments, we compared our method with HGG and HER using the DDPG algorithm, to ensure alignment with the baselines, using five different seeds. The results are shown in Fig. 16a and Fig. 16b, respectively for FetchPush and FetchPickAndPlace. Note that in this setting, all RL algorithms (including ours) use a binary reward (i.e., a sparse reward). However, since our curriculum goal generation algorithm is based on the AIM reward, we implemented the AIM reward function solely to generate curriculum goals while still using the sparse reward setting to train the DDPG algorithm using the diffusion model to generate curriculum goals.

NOTE: Reviewer 7Mgp asked about GAN. We did compare our method with GOAL-GAN, where a goal generator proposes goal regions, and a goal discriminator is trained to evaluate if a goal is at the right level of difficulty for the policy. While GOAL-GAN does not consider the target distribution as our approach, we still consider it a pertinent baseline.

Reproducibility

We have created a new (anonymous) repository containing the code for the additional experiments:

https://anonymous.4open.science/r/HER_diffusion-EB4F

The original (anonymous) codebase is available at (see L264 in the paper):

https://anonymous.4open.science/r/diffusioncurriculum/.

评论

Dear Reviewers and Chairs,

Please find below our revised version of the introduction (we post this under the General comment because changes on the intro have been asked concurrently by FHis and CrFc. The intro has been restructured to be more effective (we removed the bullet points at the end and merged them into the Contributions paragraph, and we balanced the length of all paragraphs). Note that we are also working on the related work section to include a critical analysis of the papers suggested by c6He and CrFc, discussing the differences w.r.t. our proposed method.


1 Introduction

Reinforcement learning (RL) is a computational method that allows an agent to discover optimal actions through trial and error by receiving rewards and adapting its strategy to maximize cumulative rewards. Deep RL, which integrates deep neural networks (NNs) with RL, is an effective way to solve large-dimensional decision-making problems, such as learning to play video games [1, 2], chess [3], Go [4], and robot manipulation tasks [5, 6, 7, 8]. One of the main advantages of deep RL is that it can tackle difficult search problems where the expected behaviors and rewards are often sparsely observed. The drawback, however, is that it typically needs to thoroughly explore the state space, which can be costly especially when the dimensionality of this space grows.

Some methods, such as reward shaping [9], can mitigate the burden of exploration, but they require domain knowledge and prior task inspection, which limits their applicability. Alternative strategies have been proposed to enhance the exploration efficiency in a domain-agnostic way, such as prioritizing replay sampling [8, 10, 11] or generating intermediate goals [12, 13, 14, 15, 16, 17, 18, 19]. This latter approach, known as Curriculum Reinforcement Learning (CRL), focuses on designing a suitable curriculum to guide the agent gradually toward the desired goal.

Various approaches have been proposed for the generation of curriculum goals. Some methods focus on interpolation between a source task distribution and a target task distribution [20, 21, 22, 17]. However, these methods often rely on assumptions that may not hold in complex RL environments, such as specific parameterization of distributions, hence ignoring the manifold structure in space. Other approaches adopt optimal transport [13, 23], but they are typically applied in less challenging exploration scenarios. Curriculum generation based on uncertainty awareness has also been explored, but such methods often struggle with identifying uncertain areas as the goal space expands [15, 24, 12].

Some research minimizes the distance between generated curriculum and desired outcome distributions using Euclidean distance, although this approach can be problematic in certain environments [19, 25]. Other methods incorporate graph-based planning, but require an explicit specification of obstacles [26, 27]. Lastly, approaches based on generative AI models have been proposed. For instance, [14] uses GANs to generate tasks of intermediate difficulty, but it relies on arbitrary thresholds. Alternatively, [28, 29, 30] apply diffusion models in offline RL settings [14, 28, 30].

Despite these advancements, existing CRL approaches still struggle with generating suitable intermediate goals, particularly in complex environments with significant exploration challenges. To overcome this challenge, in this paper, we propose DICURL (Diffusion Curriculum Reinforcement Learning). Our method leverages conditional diffusion models to dynamically generate curriculum goals, guiding agents towards desired goals while simultaneously considering the QQ-function and a trainable reward function based on Adversarial Intrinsic Motivation (AIM) [31].

Contributions Unlike previous offline RL approaches [28, 29, 30] that train and use diffusion models for planning or policy generation relying on pre-existing data, DICURL facilitates online learning, enabling agents to learn effectively without requiring domain-specific knowledge. This is achieved by three key elements. (1) The diffusion model captures the distribution of visited states and facilitates exploration through its inherent noising and denoising mechanism. (2) As the QQ-function predicts the cumulative reward starting from a state and a given goal while following a policy, we can determine feasible goals by maximizing the QQ-function, ensuring that the generated goals are challenging yet achievable for the agent. (3) The AIM reward function estimates the agent's proximity to the desired goal and allows us to progressively shift the curriculum towards the desired goal.

We compare our proposed approach with nine state-of-the-art CRL baselines in three different maze environments and two robotic manipulation tasks simulated in MuJoCo [32]. Our results show that DICURL surpasses or performs on par with the state-of-the-art CRL algorithms.

评论

Dear Reviewers and Chairs,

Please find below our revised version of the Related Work section (we post this under the General comment because changes on the intro have been asked concurrently by c6He and CrFc). We reorganized this section into three parts, one related to CRL (which now includes the references suggested by c6He), the second one related to diffusion models (as suggested by CrFc), and the last one highlighting the limitations of current works and the distinctive aspects of our method.

We do hope we have addressed the remaining concerns regarding the clarity and quality of the text. Any further suggestions is welcome.


2 Related Work

Curriculum Reinforcement Learning CRL [33] algorithms generally adjust the sequence of learning experiences to improve the agent’s performance or accelerate training. These algorithms focus on formulating intermediate goals that progressively guide the agent toward the desired goal, and have been successfully applied to various tasks, mainly in the field of robot manipulation [34, 35, 36, 37].

Hindsight Experience Replay (HER) [38] tackles the challenge of sparse reward RL tasks by employing hindsight goals, considering the achieved goals as pseudo-goals, and substituting them for the desired goal. However, HER struggles to solve tasks when the desired goals are far from the initial position. Hindsight Goal Generation (HGG) [19] addresses the inefficiency issue inherent in HER by generating hindsight goals through maximizing a value function and minimizing the Wasserstein distance between the achieved goal and the desired goal distribution.

CURROT [23] and GRADIENT [13] both employ optimal transport for the generation of intermediate goals. CURROT formulates CRL as a constrained optimization problem and uses the Wasserstein distance to measure the distance between distributions. Conversely, GRADIENT introduces task-dependent contextual distance metrics and can manage non-parametric distributions in both continuous and discrete context settings; moreover, it directly interprets the interpolation as the geodesic from the source to the target distribution.

GOAL-GAN [14] generates intermediate goals using a Generative Adversarial Network (GAN) [39], without considering the target distribution. A goal generator is used to propose goal regions, and a goal discriminator is trained to evaluate if a goal is at the right level of difficulty for the current policy. The specification of goal regions is done using an indicator reward function, and policies are conditioned on the goal as well as the state, similarly to a universal value function approximator [40].

PLR [15] uses selective sampling to prioritize instances with higher estimated learning potential for future revisits during training. Learning potential is estimated using TD-Errors, resulting in the creation of a more challenging curriculum. VSD [16] estimates the epistemic uncertainty of the value function and selects goals based on this uncertainty measure. The value function confidently assigns high values to easily achievable goals and low values to overly challenging ones. ACL [18] maximizes the learning progress by considering two main measures: the rate of improvement in prediction accuracy and the rate of increase in network complexity. This signal acts as an indicator of the current rate of improvement of the learner. The ALP-GMM [17] fits Gaussian Mixture Models (GMM) using an Absolute Learning Progress (ALP) score, which is defined as the absolute difference in rewards between the current episode and the previous episodes. The teacher generates curriculum goals by sampling environments to maximize the student's ALP, which is modeled by the GMM.

Finally, OUTPACE [12] employs a trainable intrinsic reward mechanism, known as Adversarial Intrinsic Motivation (AIM) [31] (the same used in our method) which is designed to minimize the Wasserstein distance between the state visitation distribution and the goal distribution. This function increases along the optimal goal-reaching trajectory. For curriculum goal generation, OUTPACE uses Conditional Normalized Maximum Likelihood (CNML), to classify state success labels based on their association with visited states, out-of-distribution samples, or the desired goal distribution. The method also prioritizes uncertain and temporally distant goals using meta-learning-based uncertainty quantification [41] and Wasserstein-distance-based temporal distance approximation.

(continues in the next comment)

评论

(continues from previous comment)

Diffusion Models for Reinforcement Learning UniPi [42] leverages diffusion models to generate a video as a planner, conditioned on an initial image frame and a text description of a current goal. Subsequently, a task-specific policy is employed to infer action sequences from the generated video using an inverse dynamic model. AVDC [43] constructs a video-based robot policy by synthesizing a video that renders the desired task execution and directly regresses actions from the synthesized video without requiring any action labels or inverse dynamic model. It takes RGBD observations and a textual goal description as inputs, synthesizes a video of the imagined task execution using a diffusion model, and estimates the optical flow between adjacent frames in the video. Then, using the optical flow and depth information, it computes robot commands.

Diffusion Policy [44] uses a diffusion model to learn a policy through a conditional denoising diffusion process. BESO [45] adopts an imitation learning approach that learns a goal-specified policy without any rewards from an offline dataset. DBC [46] uses a diffusion model to learn state-action pairs sampled from an expert demonstration dataset and increases generalization using the joint probability of the state-action pairs. Finally, Diffusion BC [47] uses a diffusion model to imitate human behavior and capture the full distribution of observed actions on robot control tasks and 3D gaming environments.

Limitations of current works and distinctive aspects of DICURL The aforementioned studies typically require offline data for training. Both [42] and [43] employ diffusion models to synthesize videos for rendering the desired task execution, and actions are then inferred from such videos. Studies such as [44, 45, 46, 47] also focus on learning policies from offline datasets. Despite these efforts, reliance on inadequate demonstration data can lead to suboptimal performance [44]. Distinct from these approaches, our method instead does not rely on prior expert data or any pre-collected datasets. As an off-policy RL method, DICURL collects in fact data through interaction with the environment.



This concludes the revised version of the Related Work section.
You can find below the list of added references ([33-37] suggested by c6He ; [42-47] suggested by CrFc):

[33] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. Journal of Machine Learning Research, 21(181):1–50, 2020. (NOTE: this paper was already cited in the original version of our manuscript, although it referred to the preprint version; now it refers to the JMLR version.)

[34] Martin Riedmiller, Roland Hafner, Thomas Lampe, Michael Neunert, Jonas Degrave, Tom Wiele, Vlad Mnih, Nicolas Heess, and Jost Tobias Springenberg. Learning by playing solving sparse reward tasks from scratch. In International conference on machine learning, pages 4344–4353. PMLR, 2018.

[35] Tim Hertweck, Martin Riedmiller, Michael Bloesch, Jost Tobias Springenberg, Noah Siegel, Markus Wulfmeier, Roland Hafner, and Nicolas Heess. Simple sensor intentions for exploration. arXiv preprint arXiv:2005.07541, 2020.

[36] Ashvin V Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcement learning with imagined goals. Advances in neural information processing systems, 31, 2018.

[37] Dmytro Korenkevych, A Rupam Mahmood, Gautham Vasan, and James Bergstra. Autoregressive policies for continuous control deep reinforcement learning. arXiv preprint arXiv:1903.11524, 2019.

[42] Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. Advances in Neural Information Processing Systems, 36, 2024.

[43] Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences. arXiv preprint arXiv:2310.08576, 2023.

[44] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.

[45] Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Lioutikov. Goal-conditioned imitation learning using score-based diffusion policies. arXiv preprint arXiv:2304.02532, 2023.

[46] Hsiang-Chun Wang, Shang-Fu Chen, Ming-Hao Hsu, Chun-Mao Lai, and Shao-Hua Sun. Diffusion model-augmented behavioral cloning. arXiv preprint arXiv:2302.13335, 2023.

[47] Tim Pearce, Tabish Rashid, Anssi Kanervisto, Dave Bignell, Mingfei Sun, Raluca Georgescu, Sergio Valcarcel Macua, Shan Zheng Tan, Ida Momennejad, Katja Hofmann, et al. Imitating human behaviour with diffusion models. arXiv preprint arXiv:2301.10677, 2023.

评论

Dear reviewers,

We sincerely thank you for the fruitful and engaging discussion and for always maintaining a positive attitude during our interactions.

We believe that the manuscript has improved a lot thanks to your constructive feedback.

Best regards,

The authors

最终决定

This paper has received unanimously positive reviews, with some Reviewers raising their scores during the discussion period. The engagement of Authors and Reviewers has been quite good, which plays in favor of the quality of the assessment of this work. In particular, Reviewers have appreciated the novelty of the proposed method, the clarity of the presentation, and the empirical evaluation.

I encourage the Authors to incorporate Reviewers' feedback in the final version.