6.2

/10

Poster5 位审稿人

最低5最高7标准差0.7

3.4

置信度

正确性3.0

贡献度2.8

表达2.8

NeurIPS 2024

Exploring the Edges of Latent State Clusters for Goal-Conditioned Reinforcement Learning

Yuanlin Duan,Guofeng Cui,He Zhu

OpenReview PDF

提交: 2024-05-16更新: 2025-01-16

摘要

Exploring unknown environments efficiently is a fundamental challenge in unsupervised goal-conditioned reinforcement learning. While selecting exploratory goals at the frontier of previously explored states is an effective strategy, the policy during training may still have limited capability of reaching rare goals on the frontier, resulting in reduced exploratory behavior. We propose "Cluster Edge Exploration" (CE$^2$), a new goal-directed exploration algorithm that when choosing goals in sparsely explored areas of the state space gives priority to goal states that remain accessible to the agent. The key idea is clustering to group states that are easily reachable from one another by the current policy under training in a latent space, and traversing to states holding significant exploration potential on the boundary of these clusters before doing exploratory behavior. In challenging robotics environments including navigating a maze with a multi-legged ant robot, manipulating objects with a robot arm on a cluttered tabletop, and rotating objects in the palm of an anthropomorphic robotic hand, CE$^2$ demonstrates superior efficiency in exploration compared to baseline methods and ablations.

关键词

Reinforcement LearningGoal-Conditioned Reinforcement LearningModel-Based Reinforcement LearningExploration StrategiesWorld Models

评审与讨论

审稿意见

评分: 6置信度: 32024-07-13

This paper considers goal selection in goal-conditioned reinforcement learning. The core idea of this paper is to group states with small temporal distance into clusters, and then select goals that are on the clutter boundaries. In addition, the method gives priority to goal states that are accessible to the agent. Once the agent reaches the goal, a go-explore style exploration can be performed to potentially add new visited states to the clusters. The authors test their system across 5 different domains, and show improved performance over previous methods.

优点

This paper is well written and easy to understand.
The idea of learning state representations that can help group state and identify easy-to-reach states is novel and exciting.
The experiment is thorough and the results are promising.

缺点

It is worth pointing out that the method is only evaluated in relatively simple domains, with the maximum state dimensions being 29.
See the question section.

问题

Figure 9 seems confusing. It seems to imply that CE2 would bias the exploration direction towards task completion, which shouldn’t be the case considering that the training phase of CE2 is task agnostic.
For previous methods like MEGA or PEG, couldn’t you still use the temporal distance to select “next goal to explore”? Why is the clustering necessary here?
Is there a more rigorous way of understanding why “less explored regions are naturally adjacent to these boundaries”? This seems intuitive in the original state space but I’m not sure if it carries over to the latent state space.

局限性

The limitation is briefly discussed in the conclusion section.

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. State Dimensionality in the Experiment Benchmarks

The maximum state dimensions in our test suite exceed 29. Block rotation and pen rotation involve an anthropomorphic robotic hand with 24 joints. In total, the action space has 20 dimensions for absolute angular positions of the actuated joints and the observation space has 61 dimensions for information about the robot’s joint and block/pen states.

R2. Figure 9 seems confusing. It seems to imply that CE2 would bias the exploration direction towards task completion, which shouldn’t be the case considering that the training phase of CE2 is task agnostic.

We apologize for the confusion. Figure 9 was intended to visualize environment exploration in CE $^2$ -G. This paper presents two training algorithms where CE $^2$ (Algorithm 2, Line 212) is for unsupervised exploration in unknown environments and CE $^2$ -G (Algorithm 3, Line 237) assumes the test goal distribution is available to the agent at training time. CE $^2$ -G progressively expands the scope of exploration around the possible trajectories leading to the environment goals. We will correct the image label in Figure 9.

R3. Is there a more rigorous way of understanding why “less explored regions are naturally adjacent to these boundaries”? This seems intuitive in the original state space but I’m not sure if it carries over to the latent state space.

Less explored regions are located near the boundaries of latent state clusters due to the way we construct the latent space. Our loss function for training the latent space ( $\mathcal{L}_{dt}$ in Line 156, Equation 2) ensures that states easily reachable from one another in the real environment (determined by a learned temporal distance network $D_t$ in Line 120, Equation 1) are also close in proximity within the latent space. By enforcing the latent space to express the temporal distance between different states, $CE^2$ enables both efficient state clustering and frontier state identification.

R4. For previous methods like MEGA or PEG, couldn't you still use the temporal distance to select “next goal to explore”? Why is the clustering necessary here?

Similar to the approach taken by MEGA, which sets frontier goals in low-density regions within the replay buffer, we could simply select goals that are furthest from the initial states, as determined by the learned temporal distance network $D_t$ (Line 120, Equation 1). However, this strategy can reduce exploratory behavior because the policy during training may still have limited capability in reaching rare goals. Instead, CE $^2$ selects the next goal to explore at the edges of latent state clusters, providing two benefits: (1) less explored regions are adjacent to these boundaries, and (2) given the easy accessibility between states within each cluster by the training policy, the agent's capability extends to reaching states even at the cluster boundaries. In other words, clustering enables CE $^2$ to precisely identify the frontier of previously explored states. For example, as visualized in Fig. 11 in the appendix for the Ant Maze environment, CE $^2$ enhances exploration efficiency by consistently setting exploratory goals within the current policy's capabilities. In contrast, MEGA and PEG often set goals that are unlikely to be reachable by the current agent.

We will update the paper to integrate all the discussions above.

2024-08-08

I thank the authors for the clarifications. I have no further questions.

审稿意见

评分: 7置信度: 32024-07-19

The paper introduces a cluster edge exploration (CE2) algorithm, which is implementing the “Go-Explore” principle in a – to the best of my knowledge – novel manner. Key idea is to use clustering of the state space latents – go to one of these clusters and then explore from there. As a main result, exploration improves and, as a consequence, success rates rise.

优点

The idea builds on prior work based on go-explore, is charmingly simple, and yields positive effects as expected.

The model outperforms previous methods on standard tasks.

缺点

The evaluations are rather restricted to some standard artificial benchmark tasks.

The temporal distance network seems tedious to train additionally.

The algorithm in the end just introduces yet another method to explore the edge of the search space.

The clustering algorithm seems detached from the latent learning algorithm – it does not structure the latent state in any way.

问题

Eq. 2 is not quite clear – particularly I wonder if here is not a unit problem as the first distance measure in latent state is representational while the second one depends on the movement distance estimates. Isn’t this severely restricting?

Eq. 4 could you elaborate slightly? How are p and q densities represented / determined?

Where does the learned value exploration function come from?

What about environment, where sub-areas are hard to get to but do not offer themselves suitably for any further exploration.

局限性

An integration / automatic abstraction via shielded / gated latents would be much more appealing.

The evaluations are restricted to a standard test suite that does not really require deep reward propagation.

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. The temporal distance network seems tedious to train additionally

The temporal distance network can be trained efficiently using supervised learning from replay buffer trajectories to predict action steps from the current state to a goal state. Temporal distance networks are commonly used to create reward functions in goal-conditioned reinforcement learning e.g. [1,2]. We follow these prior works to implement this approach.

[1] Discovering and Achieving Goals via World Models. NeurIPS 2021.

[2] Planning Goals for Exploration. ICLR 2023.

R2. The algorithm just introduces yet another method to explore the edge of the search space

Our algorithm, CE $^2$ , tackles the core challenge in the Go-Explore mechanism: how to select an exploration-inducing goal command $g$ and effectively guide the agent to $g$ ? Previous approaches, such as MEGA (Pitis et al., 2020), set exploratory goals at rarely visited regions of the state space. However, in these approaches, the policies under training may have limited capability of reaching the chosen rare goals, leading to less effective exploration. Our contribution is a novel goal selection algorithm that prioritizes goal states in sparsely explored areas of the state space, provided they remain accessible to the agent. This is the key factor in why CE $^2$ outperforms the MEGA and PEG (Hu et al., 2023) baselines in our benchmark suite in Fig. 3. As visualized in Fig. 11 in the appendix for the Ant Maze environment, CE $^2$ enhances exploration efficiency by consistently setting exploratory goals within the current policy's capabilities. In contrast, MEGA and PEG often set goals that are unlikely to be reachable by the current agent.

R3. The clustering algorithm seems detached from the latent learning algorithm – it does not structure the latent state in any way.

While our clustering algorithm does not directly structure the latent space, it requires the latent space to be organized in a specific manner to be effective. In other words, the latent space learning algorithm is a key prerequisite for the latent state clustering algorithm. Specifically, our latent space learning algorithm structures the latent space such that states easily reachable from one another in the real environment (as determined by the learned temporal distance network $D_t$ in Line 120, Equation 1) are also close together in the latent space. The clustering algorithm leverages this structure property to ensure that the latent state cluster boundaries align with the frontier of previously explored states. As such, CE $^2$ can efficiently generate exploratory goals at the frontier at training time.

R4. Eq. 2 is not quite clear – particularly I wonder if here is not a unit problem as the first distance measure in latent state is representational while the second one depends on the movement distance estimates

In the implementation, to address potential scale issues, we normalize the movement distance estimate, used in the second component of Eq. 2, by dividing it by the trajectory length to ensure it is smaller than 1, ensuring it falls within a range similar to the first component. We will clarify it in the paper.

R5. Eq. 4 could you elaborate slightly? How are p and q densities represented / determined?

Eq. 4 is used to optimize the Gaussian Mixture Models (GMMs) iteratively on sampled batches from the replay buffer. $p$ and $q$ are represented as Gaussian distributions within the GMMs. $q(c \vert \Psi(s))$ is the postior distribution over $c$ (the clusters) given an encoded state $\Psi(s)$ . $\log p(\Psi(s) \vert c)$ is the distribution donating the probability of the encoded state $\Psi(s)$ in cluster $c$ . $p(c)$ is the prior distribution of the weight of clusters in GMMs. For each round of optimization, we increase the probability of the sampled batches in GMMs by updating the weight of each cluster $c$ in GMMs and the mean and variance of them.

R6. Where does the learned value exploration function come from?

The learned exploration value function (used in Equation 7 for determining the exploration potential of a chosen goal state) is introduced in Line 109. This value function is used to guide the training of the exploration policy in our Go-Explore mechanism and is updated per learning iteration in our model-based GCRL framework (Line 12 of Algorithm 2). It encourages exploration by leveraging the Plan2Explore (Sekar et al. (2020)) disagreement objective, which motivates the agent to explore states in less familiar areas of the environment that the world models haven't adequately learned (intuitively such states often induce discrepancies among an ensemble of world models).

R7. What about environment, where sub-areas are hard to get to but do not offer themselves suitably for any further exploration

We indeed considered such environments. We tested CE $^2$ in Point Maze and Ant Maze (Fig. 2) which contain dead ends from which exploration is doomed to fail. As explained in Equations 6 and 7 (Line 200), CE $^2$ selects goal states from latent state cluster boundaries with the highest exploration potential. We use the learned exploration value function (see R6) to evaluate the exploration potential of a goal state. As exploration progresses, the exploration value of states at a dead end decreases. Consequently, CE $^2$ can then select other goal states on the cluster boundaries that have not yet been explored well to escape the dead end.

R8. The evaluations are restricted to a standard test suite that does not really require deep reward propagation.

Our benchmarks include tasks with horizons of up to 500 steps (Ant Maze and Walker). The environments are also high-dimensional; for instance, Pen Rotation and Block Rotation have 61 observation space dimensions and 20 action space dimensions.

We will revise the paper to incorporate all the above discussions.

评论- Discussion

2024-08-14

Dear Reviewer WfSq,

We kindly inquire whether the information provided in our rebuttal adequately addresses your concerns. If you have any further questions or issues, we would be grateful for the opportunity to address them. Thank you once again for your valuable feedback and constructive comments!

Best regards,

The Authors

2024-08-14

Thank you for your careful responses and elaborations on your method.

I would further encourage to discuss approaches that more suitably structure the latent state space, but also see the merit of the paper to explore latent space-oriented exploration even without any further inductive bias to shape the latent space itself.

The results clearly show highly competitive performance, and the technique applied, as it is rather simple, is definitely universally useful.

Thus, I raise my score to accept.

审稿意见

评分: 5置信度: 42024-07-22

This paper develops an approach for frontier exploration in the context of model based reinforcement learning. The key idea of the paper (inspired by prior work Go-Explore) is to cluster a group reachable states in the latent space and keep track of the current frontier, such that new goals can be sampled close to the frontier. Experiments in simulated control benchmarks show that the proposed approach ism ore effective in solving hard exploration tasks compared to a few related baselines.

优点

The paper is well motivated and targets an important problem in reinforcement learning that has been studied for many years - that of effective exploration.

Under the goal-conditioned RL setting, the proposed modification to a prior work Go-Explore is novel to the best of my knowledge, and is intuitively sound in terms of choosing goals that are near the frontier and yet "accessible" with a high probability.

The paper is well presented, with sufficient background on prior works, and appropriate details on the algorithm and simulation environments used for experiments.

The experiments show better results than baselines on some challenging exploration tasks like in-hand manipulation and point maze navigation. The ablation experiments are good and provide into different stages of the approach like goal sampling, reaching the farthest reachable state, and exploration heuristic beyond that.

缺点

One of the main weaknesses of the paper is that the delta in terms of core algorithmic contribution beyond go-explore is limited. The paper pitches the core contribution as an exploration strategy but couples it with model-based RL in the instantiation, but the exploration algorithm on its own is a small modification to Go-Explore.

A small change to an existing approach (Go-Explore) leading to massive improvements in tasks would be interesting, but the paper is not convincing in showing this. One reason for that is Go-Explore was shown to perform well in some very challenging scenarios (like Montezuma's Revenge) and the proposed approach doesn't perform direct comparisons on those environments.

Combining model-based RL with exploration is a bit confusing. In many environments, a hard exploration problem needs to be solved (because of sparse rewards and long horizon tasks but learning a model of the world is an even harder task than exploring the environment to learn a policy - it seems the proposed approach will be very limiting in such scenarios)

问题

Please refer to the weaknesses above.

The paper pitches the core contribution as an exploration strategy but couples it with model-based RL in the instantiation, but the exploration algorithm on its own is a small modification to Go-Explore. Is there a reason why this instantiation is necessary?
Can the authors comment on the choice of completely orthogonal environments to those in Go-Explore (which is a direct baseline) ?
How feasible is the approach for deployment in real world control systems where building a model of the world may be much harder than learning a policy to solve a task?

局限性

Please refer to the weaknesses and questions above, and address limitations of the approach.

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. One of the main weaknesses of the paper is that the delta in terms of core algorithmic contribution beyond go-explore is limited.

R2. Can the authors comment on the choice of completely orthogonal environments to those in Go-Explore (which is a direct baseline) ?

As discussed in R1, the core challenge in the Go-Explore mechanism lies in selecting goal states that effectively trigger further exploration upon being reached. However, the original Go-Explore method (Ecoffet et al. (2019)) does not prescribe a general goal selection method, instead opting for a hand-engineered novelty bonus for each task (e.g. task-specific pseudo-count tables). CE $^2$ is more related to recent instantiations of Go-Explore that automatically selects exploration-inducing goals in less-visited areas of the state space to broaden the range of reachable states, e.g. MEGA and PEG. Therefore, we compare our method with these tools instead of Ecoffet et al. (2019) in environments where these tools are applicable, to evaluate the strength of our goal selection method.

R3. The paper pitches the core contribution as an exploration strategy but couples it with model-based RL in the instantiation. Is there a reason why this instantiation is necessary?

Our method, CE $^2$ , follows the Go-Explore mechanism by executing an exploration policy (defined in Line 109) upon reaching a chosen goal state to expand the range of reachable states. A primary reason for implementing CE $^2$ with model-based RL is to effectively train the exploration policy. We use intrinsic explorer rewards to encourage the exploration policy to reach novel states where predictions among an ensemble of world models diverge, based on the Plan2Explore disagreement objective (Sekar et al., 2020). The exploration policy is optimized purely from imagined trajectories. In our experience, this strategy results in stronger exploration capabilities compared to epsilon exploration policies, which randomly "fill in" sparse regions of the achieved goal space. For example, our model-based MEGA baseline uses the same exploration policy as ours and demonstrates improvement over the original MEGA implementation, which uses random epsilon exploration, in the benchmark suites we consider. We adopt this model-based approach for all the baselines to ensure the exploration policy is not the bottleneck in our experiments.

R4. How feasible is the approach for deployment in real world control systems where building a model of the world may be much harder than learning a policy to solve a task?

The exploration strategy in CE $^2$ can be integrated with any model-based RL algorithm and applied to any tasks where model-based RL is applicable. Here, we specifically focus on the exploration problem in the unsupervised goal-conditioned reinforcement learning (GCRL) setting. During training, there are no predefined tasks or goals. A successful agent should be able to navigate to a wide range of previously unknown goal states upon receiving goal commands only revealed at test time. Intuitively, learning a policy in this setting is as challenging as learning a world model due to the vast range of possible goals. We believe this is a practical setting for deploying an agent into an unknown environment - the agent must learn about the environment and identify the feasible tasks it can perform without prior specifications. That being said, a model-free version of CE $^2$ should be conceptually simpler e.g., training the exploration policy with intrinsic rewards from an ensemble of goal-conditioned value functions instead of world models. Incorporating task-specific information, such as demonstrations and background knowledge, to guide CE $^2$ 's exploration towards user-preferred areas would enhance its practicality, particularly for real-world robotics applications. We leave these extensions for future work.

We will incorporate the discussions from R1 to R4 into the paper.

评论- thanks for the response. my main concerns remain and so will not fight for acceptance

2024-08-13

Dear authors,

Thanks for the rebuttal response. The explanations to my questions are helpful, and I would definitely encourage providing the additional context regarding goal selection being the key proposed novel contribution in the revised paper. However my main concerns regarding comparison with Go-Explore more directly, limited algorithmic contributions, and empirical evidence for applicability to model-free scenarios remain. In particular, I think a more direct comparison to Go-Explore is necessary because "the strength of our goal selection method" is something subjective in the larger context of the exploration strategy i.e. it may be the case that Go-Explore's 'heuristic' goal-selection strategy is good enough for tough exploration problems and the proposed approach is actually not scalable to these challenging scenarios - without empirical comparisons, we simple don't know! (I understand that experimental comparisons might be beyond the scope of the short rebuttal window) Apart from my concerns, I do not see any major flaws in the algorithm/experiments, and as such I am not recommending reject, but will not fight for acceptance if any other reviewers have major concerns.

评论- Rationale for Not Directly Comparing with Go-Explore (Ecoffet et al. (2019))

2024-08-13

Dear Reviewer 7wGc,

Thank you so much for your comments. We will provide additional context in the revised paper, emphasizing that goal selection is the key novel contribution.

We want to clarify an important reason for not comparing with Go-Explore (Ecoffet et al. (2019)). Unlike CE $^2$ , Go-Explore does not prescribe a general approach for goal selection to induce exploration; it relies on a hand-designed pseudocount metric. Creating these task-specific pseudocount tables requires significant domain knowledge, which CE $^2$ does not assume. For instance, in the Go-Explore implementation applied to robotics, the domain knowledge representation must be derived from the internal state of the MuJoCo simulator, such as the 3D positions of the robot's gripper (discretized in voxels with sides of length 0.5 meters), whether the robot is currently touching (with a single grip) or grasping (touching with both grips) the object, and whether the object is in the target location. Designing the Boolean predicates requires sophisticated, case-specific code, and these predicates are not generalizable to the diverse tasks in our benchmarks, such as the anthropomorphic robotic hand with 24 joints in the Block Rotation and Pen Rotation tasks.

Directly comparing CE $^2$ , a general, automatic algorithm for goal selection, with Go-Explore—given its reliance on extensive, case-specific domain knowledge—would not be a fair comparison. Instead, we have compared CE $^2$ with more recent versions of Go-Explore, such as MEGA (Pitis et al. (2020)) and PEG (Hu et al. (2023)), which automatically select exploration-inducing goals. These tools, like ours, built on top of Go-Explore, do not assume access to domain knowledge. We believe this comparison better highlights the effectiveness of our goal selection method in unknown robotics environments where domain knowledge is not available (as outlined in Section 2, our focus is on unsupervised exploration in unknown environments, where no predefined task information is provided during the exploration stage).

That said, we are happy to include a comparison with Go-Explore in the appendix of the revised paper to showcase how the CE $^2$ agent learns to explore the environment compared to an agent that uses domain knowledge.

We hope this response clarifies the reviewer's concerns. We look forward to the follow-up discussion and are happy to address any further comments or questions.

审稿意见

评分: 6置信度: 42024-07-26

This paper presents a method called "Cluster Edge Exploration" (CE2) to perform goal selection in goal-conditioned reinforcement learning and enable efficient exploration. Concretely, the method builds on the Go-Explore principle, which learns separate policies for exploration and goal-reaching.

The main idea is to learn a latent embedding space parameterized by Gaussian Mixture Models (GMMs) in the Dreamer framework. Selecting a goal is a two-step process. First, the method samples points from the GMM and only keeps a low probability collection. These points approximate the boundary of the existing explored states. Next, the algorithm uses the goal-conditioned policy to perform an imaginary rollout towards each goal candidate. That goal is selected if the rollout ends at a state with the highest exploration value. In addition, when the test goal distribution is known, the CE2-G variant learns the GMM only with trajectory data from test goals. This further provides an inductive bias to sample goals near the test distribution.

The paper presents extensive empirical studies in navigation, locomotion, and manipulation. Quantitatively, both CE2 and CE2-G outperform various baselines in their respective settings. Through visualizations of the sampled goals, CE2 bears the advantage that it samples points that are both near the frontier of the existing data and feasible to reach.

优点

The method's main idea is sound: sampling goals that are feasible but underexplored makes sense.
The authors did a good job explaining the algorithm and training objectives. They also made the distinction between EC2 and EC2-G clear.
The experiments are done in a good span of different domains. This shows that the exploration strategy is not tailored only for a certain application, such as navigation.
Visualizations of the goals picked by the CE2 and relevant baselines are insightful.

缺点

It seems that some of the tasks haven't been trained to convergence. This makes it harder to draw definitive conclusions on the method's sample efficiency or final performance.
The method experiments appear to be done in state-based environments only. Since Dreamer is a strong algorithm for vision-based control, I'm curious how this approach performs with image observations. Would additional challenges arise for feature learning?

问题

Equation 7 performs imaginary rollouts toward the goal candidates. How is $T$ chosen here?
In the ablation studies, the relative performance of CE2 and CE2-noPEG seems stochastic. Why would the inclusion of exploration value estimate hurt performance? Does this mean that the exploration value function is under-trained?

局限性

Yes, there's a dedicated section on page 15.

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. It seems that some of the tasks haven't been trained to convergence. This makes it harder to draw definitive conclusions on the method's sample efficiency or final performance.

We thank the reviewer for the suggestion. In the global response, we have provided updated training results over extended environment interaction steps for our benchmarks in Fig. 1 of the attached PDF. Our method continues to outperform the baselines regarding both sample efficiency and final task success rates.

R2. The method experiments appear to be done in state-based environments only. Since Dreamer is a strong algorithm for vision-based control, I'm curious how this approach performs with image observations. Would additional challenges arise for feature learning?

Our method, CE $^2$ , has so far only been evaluated in state space. A promising extension would be to handle image observations by optimizing goals in a compact latent space. This would likely require only minor adjustments to our code, as CE $^2$ (built upon Dreamer) already learns a latent space from state observations for goal command selection. We plan to explore this extension in future work.

R3. Equation 7 performs imaginary rollouts toward the goal candidates. How is $T$ chosen here?

In our implementation, we set $T$ to half of the maximum episode length for all environments. The time limits for both the Go and Explore phases during real environment exploration are also set to this value. We will clarify this in the paper.

R4. In the ablation studies, the relative performance of CE2 and CE2-noPEG seems stochastic. Why would the inclusion of exploration value estimate hurt performance? Does this mean that the exploration value function is under-trained?

Block Rotation is the only environment where CE $^2$ -noPEG outperforms CE $^2$ . In CE $^2$ , our method selects goals by identifying states with the highest exploration value estimate sampled from the latent state cluster boundaries to initiate our Go-Explore procedure. Since we consider unsupervised exploration, the test goal distribution is not available to the CE $^2$ agent during training. In the Block Rotation environment, the CE $^2$ agent often pursues states where the block falls from the palm, due to their "high" exploration potential determined by the exploration policy value functions. In contrast, the CE $^2$ -noPEG agent explores the state space more evenly, gaining more in-hand manipulation skills, which is crucial for achieving the block-rotation goals revealed at test time. CE $^2$ outperforms CE $^2$ -noPEG in maze navigation environments (e.g., PointMaze and AntMaze) because it can use the exploration value estimates to escape the dead ends in mazes. We will incorporate this discussion into the paper.

2024-08-14

I appreciate the authors for their response and for performing extended experiments. The discussion on CE2 vs CE2-noPEG is insightful, especially on how sometimes unsafe states can lead to high exploration potentials.

My questions have been answered, and I will keep my good rating.

审稿意见

评分: 7置信度: 32024-07-31

The authors propose a method for model-based exploration based on exploring reachable trajectories. The expand upon the Dreamer algorithm by optimizing the encoder to encode a notion of distance between state and optimizing for a likely reachable goal when training in the learned world model. The method is evaluated on a wide range of tasks, including dexterous manipulation tasks, outperforming other baselines in final performance.

优点

The paper is clear and the algorithm well-motivated. The strong experimental performance in the complex tasks is promising and they provide nice visualizations of their algorithm versus others and over time.

缺点

One thing I would have like to see is some analysis of the additional computation time required to optimize for the goal states versus other methods. If it is prohibitively expensive, then the wall time could still be much longer than other methods.

问题

Why do you think the method is not able to achieve 100% on pen rotation?

局限性

Yes, on computational costs and if it can be applied in model-free settings

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. One thing I would have like to see is some analysis of the additional computation time required to optimize for the goal states versus other methods. If it is prohibitively expensive, then the wall time could still be much longer than other methods.

We compared the computation time needed to optimize goal states for launching the Go-Explore procedure among our tool CE $^2$ and the baseline methods MEGA and PEG in the 3-Block Stacking environment. The average wall clock time are recorded in the table below:

Method	Seconds / Episode
CE $^2$	0.56
PEG	0.53
MEGA	0.47

The results indicate that there is little difference in speed between methods. CE $^2$ adds only a minimal overhead in generating candidate goal states compared to the other goal selection baselines. We will include this discussion and the complete results for each environment in the revised paper.

R2. Why do you think the method is not able to achieve 100% on pen rotation?

Pen Rotation is particularly challenging due to the pen's thin structure, which requires precise control to prevent it from dropping. We intended to convey that this is the most difficult benchmark (with 61 observation space dimensions and 20 action space dimensions) in our test suite. We will clarify this in the revised paper.

评论- Rebuttal Response

2024-08-13

I thank the authors for taking the time to answer my questions!

作者回复

2024-08-07

We sincerely appreciate the valuable feedback and suggestions from the reviewers. This global rebuttal includes a PDF file with updated training results and ablation study findings over extended training steps for our benchmarks (suggested by Reviewer NgsZ). Our method consistently outperforms the baselines in both sample efficiency and final performance.

We will address each reviewer's concerns in the individual review responses.

最终决定Accept (poster)

2024-09-25

The paper Builds on Go-Explore principle and Dreamer framework using a two-step goal selection process. It introduces Cluster Edge Exploration (CE2) for goal-conditioned reinforcement learning, All reviewers agree that the paper is written clearly and provides an elegant and sound algorithmic idea. Further, strong experiments are provided across various domains. Initial concerns were on limited novelty (wrt Go-Explore), thoroughness of running the experiments, and importantly limitations to state based environments. Further additional computational overhead from temporal distance network training is also pointed out. Authors point out that the method is extensible to image based environments. We recommend that the authors consider performing initial experiments to that claim and also add related work in the space of image based exploration to the papers. In particular possible papers could be: Lexa, LEAF, ExPLORe

However, the paper presents a technically interesting idea with sufficient evaluations. All reviewers lean accept, and the Metra-reviewer concurs. The authors are advised to review the final comments, and update the manuscript accordingly.