6.3

/10

Poster4 位审稿人

最低5最高8标准差1.1

3.8

置信度

正确性2.8

贡献度2.3

表达3.0

NeurIPS 2024

Learning World Models for Unconstrained Goal Navigation

Yuanlin Duan,Wensen Mao,He Zhu

OpenReview PDF

提交: 2024-05-16更新: 2025-01-16

摘要

关键词

World ModelsReinforcement LearningGoal-Conditioned Reinforcement LearningModel-Based Reinforcement LearningExploration Strategies

评审与讨论

审稿意见

评分: 5置信度: 22024-07-14

The authors aim to improve learning in environments with sparse rewards by leveraging world models; this is achieved by proposing a novel exploration algorithm that as a result allows to create a richer buffer of experiences that are more appropriate for learning an accurate world model.

优点

The paper is well organized and generally easy to follow. Tackling environments with sparse rewards is also important in RL.

缺点

in line 123 it is stated that the representation might be inaccurate when moving backwards; it would be useful to ellaborate on why moving backwards needs to be taken into account, and not only forward
for completeness and better interpretability of the results, it would be interesting to see the world models that result as representations of the different states

Some minor:

figure both refered to as Fig. and Figure (ex: line 117)
model-based RL both written as "model-based RL" (line 115) and "Model-based RL" (line 117)

问题

in figure 2 and section 3.1 the advantages of the bidirectional replay buffer are described. From the figure it is possible to see that with a bidirectional buffer there is a much wider range of possible trajectories covered. However, if an optimal trajectory can be found with a normal buffer, as illustrated in the figure, is there a need to learn a more complex space if the previous is enough to solve the task optimally?
how reasonable it is to find subgoals based on the FPS algorithm? just because there is a lot of variations in actions does it mean that these states correspond to different subgoals?
how do the authors define what is the resonable amount of subgoals over an episode?
finding subgoals is a popular topic in reinforcement learning; it is stated that the current approach is on the fact that key actions are needed at certain stages; is this easily generalizable? it sounds that this can depend a lot on the environment and might not very accurate in multiple cases

局限性

N/A

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. In line 123 it is stated that the representation might be inaccurate when moving backwards; it would be useful to ellaborate on why moving backwards needs to be taken into account, and not only forward

We will revise the paper to improve the clarity. In model-based RL, a control policy is trained over imagined rollouts generated by a learned world model. A major issue is that the world model may generate hallucinated trajectories, leading to a significant discrepancy between the policy’s behavior in the model and in the real environment. Retraining the world model with new forward rollout samples from the real environment does not necessarily prevent these hallucinations. MUN addresses this by learning a world model that characterizes state transitions between any states within the replay buffer, thereby minimizing hallucinations if the replay buffer is representative. Tables 1 and 2 in the global response quantitatively measure the world model quality of MUN and GC-Dreamer, which only learns from forward rollout samples, across our benchmarks. The world models trained by MUN show a much smaller generalization gap to the real environment compared to GC-Dreamer. Consequently, MUN can effectively leverage these world models to train control policies that generalize well to the real environment.

R2. In figure 2 and section 3.1 the advantages of the bidirectional replay buffer are described. From the figure it is possible to see that with a bidirectional buffer there is a much wider range of possible trajectories covered. However, if an optimal trajectory can be found with a normal buffer, as illustrated in the figure, is there a need to learn a more complex space if the previous is enough to solve the task optimally?

In model-based RL, the replay buffer (visualized in Figure 2) is not used directly to train a policy. Instead, it is used to train a world model, and the policy is trained over imagined rollouts generated by this model for sample efficiency. Consequently, having a few optimal trajectories in the replay buffer does not guarantee that the policy trained on the world model is optimal or generalizes well to the real environment. MUN enriches the replay buffer with state transitions by moving backward along recorded trajectories or across different trajectories. This approach trains high-quality world models that generalize well to the real environment (see R1 for more details) and uses these world models to learn optimal control policies.

R3. For completeness and better interpretability of the results, it would be interesting to see the world models that result as representations of the different states.

Please refer to the PDF in the global rebuttal. Figures 1(h) and 1(i) illustrate both the imagined and real environment trajectories for 3-Block Stacking and Pen Rotation, starting from the same initial state. Among the baselines, MUN demonstrates the smallest compound model error. (See R1 and Tables 1 and 2 in the global rebuttal for more quantitative comparisons).

R4. How reasonable it is to find subgoals based on the FPS algorithm? just because there is a lot of variations in actions does it mean that these states correspond to different subgoals? Finding subgoals is a popular topic in reinforcement learning; it is stated that the current approach is on the fact that key actions are needed at certain stages; is this easily generalizable?

The heuristic we use for subgoal generation is inspired by and most applicable to object manipulation and navigation tasks. We will clarify this limitation in the revised paper. In navigation tasks, changing the robot's direction inherently requires distinct actions compared to simply moving the robot in a single direction. Similarly, in manipulation tasks, actions such as moving the gripper to contact an object, pushing the object, and moving the gripper away upon goal reaching are all distinct. States resulting from the most distinct actions are likely subgoals, representing significant changes in events.

R5. How do the authors define what is the reasonable amount of subgoals over an episode?

We define the number of subgoals $N_s$ within an episode to be 2. In our experience, setting $N_s$ greater than 2 does not improve training performance. Consider the task of block stacking with 3 potential subgoals found by DAD in Figure 5: (B) moving the grippers over a block, (C) closing the grippers to hold the block, and (D) moving the block to its goal region. Learning how to transition from (B) to (C) and then from (C) to (D) by a policy is sufficient to solve the task. It is unnecessary to specifically learn transitioning from (B) to (C) to (D) in one episode. Based on our experience, setting $N_s$ to 3 slows down the training performance compared to our default value of $N_s = 2$ . Please refer to the global response (the MUN-NS-3 ablation) for the quantitive comparisons. We will incorporate this discussion into the paper.

审稿意见

评分: 8置信度: 52024-07-15

The paper introduces MUN, a novel goal-directed exploration algorithm for model-based reinforcement learning. MUN focuses on improving the quality of world models by learning to predict state transitions between any pair of states in the replay buffer. This is achieved by training the world model on a bidirectional replay buffer and identifying key subgoal states. By doing so, MUN enhances the generalizability of the learned world model and leads to more effective policy learning.

Key Contributions:

Bidirectional World Model: Trains the world model to predict state transitions both forward and backward along trajectories, improving generalization.

Key Subgoal Identification: Proposes a method called DAD to identify key subgoals, which are crucial for complex tasks.

Improved Exploration: Demonstrates superior performance compared to baseline methods in various challenging environments.

Overall, the paper presents a promising approach to improving the effectiveness of model-based reinforcement learning. Addressing the mentioned limitations and exploring potential extensions could further enhance the contribution of this work.

优点

Novel Approach: Addresses the limitations of traditional world models by focusing on bidirectional learning and key subgoal identification.

Strong Empirical Results: Shows significant improvement over baseline methods in various complex environments.

Clear Explanation: Provides a detailed description of the proposed method and its components.

The codebase is shared and supplementary provides detailed report on experiments and environments.

缺点

Limited Baselines: While the paper compares to several baselines, a more comprehensive comparison with other state-of-the-art model-based RL methods could strengthen the claims. Real world environments in indoors and outdoors could have been explored.

Hyperparameter Sensitivity: The paper does not discuss the sensitivity of the method to hyperparameter tuning.

Lack of Theoretical Analysis: There is no theoretical analysis of the proposed method's convergence or optimality guarantees.

问题

How does the performance of MUN scale with the complexity of the environment (e.g., larger state and action spaces)?

Have the authors explored different methods for identifying key subgoals, and how do they compare to DAD?

What is the computational overhead of training a bidirectional world model compared to a traditional unidirectional model?

How does the choice of the world model architecture (RSSM) impact the performance of MUN?

Can the authors provide a more detailed analysis of the ablation study comparing MUN with MUN-noDAD?

局限性

The paper has a section B of Appendix to handle this. No -ve societal impact for this work.

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. Limited Baselines

We appreciate the reviewer’s suggestion. Our world model training strategy is compatible with any modern model-based RL framework. In this paper, we integrate it with the Dreamer framework due to Dreamer’s efficiency in optimizing policies directly in the latent space. We plan to explore integration with other model-based RL frameworks, such as MBPO and PlaNet, in a future extension of this work.

R2. Hyperparameter Sensitivity

We implement MUN on top of the authors’ code from DreamerV2 (Hafner et al., 2021) and LEXA (Mendonca et al., 2021). We used the default hyperparameters for training the world model, policies, value functions, and temporal reward functions. Besides, MUN has three key hyperparameters: (1) the number of key subgoals ( $N_{subgoals}$ ) to sample from the replay buffer using the DAD algorithm, (2) the number of subgoals ( $N_s$ ) to sequentially explore during each training episode, and (3) the maximum number of timesteps ( $T_s$ ) allocated for navigating to a specific subgoal. MUN is not sensitive to $N_{subgoals}$ , and we set it to 20 for the most challenging benchmarks, such as 3-Block Stacking, Block Rotation, and Pen Rotation, to provide sufficient candidate subgoals. For $N_s$ , we found that setting it higher than 3 slows down the training performance compared to our default value $N_s = 2$ ; see the global response (the MUN-NS-3 ablation) for quantitative comparisons. For $T_s$ , we simply set it as the maximum episode length divided by $N_s$ . More details can be found in Table 2 of the appendix.

R3. What is the computational overhead of training a bidirectional world model compared to a traditional unidirectional model?

MUN-noDAD periodically directs the agent to reach goal states sampled either from the replay buffer or from the environment's goal distribution, while standard model-based RL typically samples only from the latter. This goal state setting difference in MUN does not add computational overhead compared to standard model-based RL.

MUN introduces additional overhead compared to MUN-noDAD, as the DAD algorithm employs the Farthest Point Sampling (FPS) technique for identifying key subgoals from a batch of samples in the replay buffer. FPS is computationally efficient, with a time complexity of $O(nk)$ , where $n$ is the batch size and $k$ is the number of key subgoals to discover (a constant).

R4. How does the performance of MUN scale with the complexity of the environment (e.g., larger state and action spaces)?

As mentioned in R3, MUN does not add significant overhead to standard model-based RL baselines such as Dreamer. MUN scales in accordance with Dreamer's handling of environmental complexity. The FPS algorithm, used to discover key subgoal states in DAD, operates in action spaces. Since the dimensionality of action spaces is typically smaller than that of state spaces, DAD remains scalable to environments with high-dimensional state spaces.

R5. Have the authors explored different methods for identifying key subgoals, and how do they compare to DAD?

We considered applying DAD to discover key subgoals in the state spaces instead of the action spaces. Please refer to our global response (the MUN-KeyObs ablation) for details.

R6. How does the choice of the world model architecture (RSSM) impact the performance of MUN?

MUN improves the quality of a world model through its enriched replay buffers and a novel goal-sampling strategy. It is not tied to a specific world model architecture and can be applied to any model-based RL framework. We will clarify this in the paper.

R7. Can the authors provide a more detailed analysis of the ablation study comparing MUN with MUN-noDAD?

The ablation MUN-noDAD replaces the goal sampling strategy DAD (Algorithm 2) with a simpler method that chooses goal states at fixed time intervals in trajectories sampled from the replay buffer. This ablation investigates the importance of identifying key subgoal states. It seeks to determine whether training world models from state transitions between these key subgoal states in MUN is essential, or if using any states from the replay buffer would suffice. As shown in Fig. 4, MUN outperforms MUN-noDAD on all of our benchmarks, particularly in the challenging 3-block stacking and high-dimensional block rotation environments. Fig. 7 (page 20) depicts the subgoals identified by the DAD algorithm during the training process for these benchmarks. For 3-block stacking, DAD identifies critical subgoals including block grasping, lifting, horizontal movement, vertical movement, and gripper release. For block rotation, DAD identifies crucial subgoals for finger movements that enable rotation. These results demonstrate that DAD can discover the high-level task structure, which is important for world modeling. Tables 1 and 2 in the global response quantitatively measure the world model one-step and compound errors of MUN and MUN-noDAD. The world models trained by MUN exhibit a smaller generalization gap to the real environment compared to those trained by MUN-noDAD. Consequently, MUN can effectively utilize these higher-quality world models to train control policies that generalize better to the real environment. We evaluated MUN and MUN-noDAD on tasks not encountered during training to assess the model generalizability, as discussed in Sections 4.5 and F.5 of the appendix. These novel tasks feature different initial state and final goal configurations compared to the training tasks. MUN achieved an average success rate improvement of 16.5% over MUN-noDAD on these unseen tasks, as shown in Tables 3, 4, and 5 in the appendix. We will incorporate this discussion into the paper.

评论- Rebuttal Comments

2024-08-12

Thanks for providing the rebuttal and I have gone through all the comments and reviews. I also ask the authors to address the concerns raised by Reviewer Rfj2, mainly more details of the method and the philosophy beyond the success shown. I am maintaining the positive rating.

评论- Thank you for your support!

2024-08-13

We sincerely thank the reviewer for their support and constructive feedback on our paper.

We will revise the paper to enhance the clarity of our main learning algorithm, Algorithm 1. (1) We will introduce clear markers to differentiate between the steps conducted in the world model and those in the real environment. Specifically, we will use a marker to indicate that from Line 4 to Line 18, the MUN strategy (enriching the replay buffer with state transitions by moving backward along recorded trajectories or across different trajectories) is applied to explore the environment and collect trajectories for training the world model. Then, at Line 19, we will mark where the model is exploited to train a policy using imagined trajectories. (2) Upon careful review, we find Algorithm 1 to be self-contained and would greatly appreciate Reviewer Rfj2’s further input regarding any concerns about undefined variables. Nonetheless, we will revise the presentation by adding additional comments within the algorithm and the accompanying text to clarify the purpose of each variable and how they relate to the definitions provided in Section 3: Problem Setup.

We will also improve our explanation of the philosophy behind MUN's effective environment exploration strategy. MUN is motivated by a key challenge in world model learning—the potential generation of hallucinated trajectories, which can create a significant discrepancy between the policy’s behavior in the model and in the real environment. We will clarify in Section 3.1 that MUN addresses this by learning a world model that characterizes state transitions between any subgoal states within the replay buffer, thereby reducing hallucinations. We will use Tables 1 and 2 in the global response (as well as visualized trajectories and model error curves throughout the training steps) to quantitatively compare the quality of world models trained by MUN with those trained by the baselines that rely solely on forward rollout samples. This comparison demonstrates that MUN's effectiveness stems from its significantly smaller generalization gap to the real environment compared to the baselines. We will emphasize that MUN effectively leverages these higher-quality world models to train policies that generalize better to the real environment. We will clarify the concerns regarding the assumptions about the symmetry of the problem setup and also elaborate on how MUN can be applied to non-prehensile manipulation tasks, as outlined in the author responses.

审稿意见

评分: 6置信度: 42024-07-17

This paper introduces a novel goal-directed exploration algorithm called MUN to address the challenges of efficient exploration in long-horizon, sparse-reward environments within the context of goal-conditioned reinforcement learning (GCRL). The key insight is that improving the generalizability of learned world models can significantly enhance the agent's capacity to explore and navigate in the real environment. MUN focuses on training world models that can accurately characterize state transitions between arbitrary subgoal states, whether by retracing along recorded trajectories or transitioning between states on separate trajectories. Additionally, the paper presents a practical strategy called DAD (Distinct Action Discovery) to identify pivotal subgoal states that represent critical milestones in completing complex tasks. By training world models and policies for seamless transitions between these key subgoals, MUN is able to learn generalizable goal-conditioned policies that can adapt to novel test scenarios.

优点

This paper is well-written with thorough experimental analysis.
The DAD method for identifying pivotal subgoal states is a practical and effective strategy for enhancing policy generalization.

缺点

MUN does not seem to outperform the baselines significantly as seen in Figure 5.
DAD as an acronym conflicts with another work in the same literature, "Dynamic-Aware Discovery of Skills" (Sharma et al., 2019)
The distinction from Go-Explore approach is not super clear. An pictorial explanation may help a lot. The "Comparison with Go-Explore" paragraph is a bit dense. Given that N_s is always set to 2 in the paper, it's not clear how big of a distinction this makes.
Ablations are lacking besides MUN-noDAD; there are many moving pieces in the algorithm, and it would be nice to see their effects on the final method performance.
The paper uses the phrasing "as the quality of the world model improves" many times, but it's not clear what that means in this paper.; no prediction quality of the dynamics model was studied.

问题

Questions and concerns are stated above.

局限性

Yes.

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. MUN does not seem to outperform the baselines significantly as seen in Figure 5 (Figure 4?).

Figure 4 might give a misleading impression as we blend the baselines and ablations in the same image. In this figure, the tool with the closest performance to MUN is our ablation, MUN-noDAD. We will revise the paper to separate the comparisons between the baselines and the ablation to more clearly demonstrate the performance advantages of MUN.

The final performance in terms of task success rates of MUN and the baselines (evaluated over 100 seeds) is summarized below:

Env/Method	MUN	PEG-G	MEGA-G	GC-Dreamer
Ant Maze	97%	80%	87%	76%
Point Maze	100%	90%	100%	22%
Walker	91%	79%	83%	72%
3-Block Stacking	98%	61%	39%	70%
Rotate Block	86%	70%	71%	62%
Rotate Pen	72%	60%	45%	49%

It can be seen that MUN outperforms the baselines by a large margin in challenging environments such as 3-Block Stacking, Block Rotation, and Pen Rotation (with 61 state dimensions and 20 action dimensions).

R2. DAD as an acronym conflicts with another work in the same literature, "Dynamic-Aware Discovery of Skills" (Sharma et al., 2019)

Thanks for pointing this out. We will rename DAD in the revised paper.

R3. The distinction from Go-Explore approach is not super clear.

A Go-Explore agent in training is directed to traverse to a goal state with high exploration potential, sampled from the replay buffer. Once at this goal state, the agent switches to a separate exploration policy to further explore the environment, which is trained to maximize an intrinsic exploration reward. In contrast, a MUN agent eliminates the need for a separate exploration policy and intrinsic exploration rewards. It is directed to sequentially traverse two goal states sampled from the replay buffer to explore the environment during each training episode. The trajectories collected from this exploration are used to train a world model, helping to close the generalization gap between the policy’s behavior in the real environment and the world model. This enables the MUN agent to effectively exploit the model for learning a policy that generalizes well to the real environment. We will visualize these key differences in the revised paper.

R4. Ablations are lacking besides MUN-noDAD; there are many moving pieces in the algorithm, and it would be nice to see their effects on the final method performance.

We appreciate the reviewer for the great suggestion. We included additional ablation studies in the global response (the MUN-KeyObs and MUN-NS-3 ablations), which will be incorporated into the paper.

R5. The paper uses the phrasing "as the quality of the world model improves" many times, but it's not clear what that means in this paper.; no prediction quality of the dynamics model was studied.

This is a great point! We included a comparison of one-step model prediction errors and compound model errors between MUN and the baselines in the global response Tables 1 and 2, which will be incorporated into the paper.

审稿意见

评分: 6置信度: 42024-07-23

After rebuttal the score was changed from reject to weak accept.

The paper proposes to use the experiences stored in an RL replay buffer differently when training a world model. The change is to attempt to not only use the experience in a "forward" direction but also a "backward" and "across traces" manner. The hope is that this would result in a world model capable of better predicting the behavior of the real world when generalizing to novel trajectories. A secondary addition is a heuristic to select exploration goals which relies on the assumption that large state space differences correspond to meaningful subgoals.

优点

Improving the generalization capability of the world model is a sensible approach

缺点

Strong implied assumptions about symmetry of the problem setup not considered or discussed.
Assumptions of the proposed solutions not discussed.
Detail of the method remain at a very high level

The core idea of the paper is to improve the world model by using replay buffer trajectories backwards or transitioning between states on different trajectories. If these transitions are correct they will improve the quality of the world model as it has more data diversity. However, the paper provides no discussion or insight into how these transitions can be verified to be correct. The proposed approach can work, however, only for specific kinds of problem setups where the transitions are symmetric. For tasks such as cube stacking with pose-based states this idea is reasonable. For many other tasks that include real physical dynamics, such as driving, pouring liquids, or non-prehensile manipulation it is unlikely that this approach would produce good results. This limitation has to be clearly addressed and explored in the paper. It doesn't mean that the idea is invalid because of it, but without a thorough discussion of it the paper is incomplete.

The subgoal discovery heuristic similarly makes the assumption that distant points in the state space correspond to crucial key states for a task. While the experiments show that it can be beneficial, there is no theoretical or intuitive evidence provided that this should work in general. Whether this heuristic is beneficial will heavily depend on the design or embedding of the state space. While one can follow the basic idea behind the heuristic more rigorous discussion and theoretical work is required on this front.

While the high-level ideas of the paper are easy to understand and follow the detailed information is lacking or not clear enough. While the algorithms are ok some of the variables used within, especially Algorithm 1, are not described and it is not entirely clear which steps are done in the world model and which in the real/simulated environment.

The experiments show how the proposed approach can outperform the selected baselines in certain tasks. Despite this there are several aspects that I would have expected the experiments to validate. The first is an evaluation of the world model. The main idea and contribution of the paper is a mechanism to build a better world model. Yet that quality of the world model is never properly evaluated. Figure 6 attempts this to some extent on the task most amenable of all to the implied assumptions. As such a thorough evaluation of the quality of the world model itself would be needed as part of the discussion and evaluation of the method's assumptions and limitations. Another is the evaluation of the exploration goal selection mechanism. While the experiments compare with methods utilizing approaches in the spirit of go-explore, there is no ablation of the proposed system using that mechanism. As such it is impossible to assess what, if any, benefit that strategy has as the results are confounded with the modified world model. An interesting aspect the plots in Figure 4 do not show is the performance achievable by different methods upon convergence. While faster learning is always good it is also interesting to know what performance difference can be achieved at convergence. The evaluation of the subgoal discovery policy is unconvincing. While the images show images of interesting steps it is unclear that they are meaningful for the policy learning. A rigorous quantitative evaluation of this would be needed to gain insights into the heuristic.

问题

How does the proposed subgoal heuristic compare to more complex approaches such as the ones mentioned in the paper?
Would it be possible to automatically detect whether the proposed backwards / sideways trace playback for world model learning is applicable for the given problem, such that the approach can be used in scenarios where its assumptions are met

局限性

The paper does not contain a limitations section in the body of the paper, and the limitation section in the appendix does not truly address the assumptions and limitations stemming from them outlined in this review.

作者回复

2024-08-07

We appreciate your insightful feedback and constructive comments!

R1. The subgoal discovery heuristic assumes that distant points in the state space correspond to crucial key states. Whether this heuristic is beneficial will depend on the design or embedding of the state space

MUN does not directly find key subgoal states from the state space and, therefore, does not depend on the embedding of the state space for key subgoal discovery. The DAD strategy in MUN discovers actions that significantly differ along trajectories in the replay buffer and extracts the corresponding states triggering these actions as potential key subgoal states. Our strategy is based on the observation that the agent often performs distinct actions to trigger key subgoal states e.g. closing grippers to grasp objects. See R2 for more examples.

R2. Strong assumptions about symmetry of the problem setup not considered. For many other tasks that include real physical dynamics, such as driving, pouring liquids, or non-prehensile manipulation it is unlikely that this approach would produce good results

MUN is effective for non-prehensile manipulation tasks. To demonstrate this, we applied MUN to the Gymnasium Fetch Slide task, where a manipulator slides a puck to a target on a slippery table. This environment has asymmetric state transitions: when the puck is slid outside the robot's workspace, the manipulator cannot reach the puck's position to slide it backward due to physical constraints. The training performance in terms of task success rates is depicted in Fig. 1(g) in the global response. MUN significantly outperforms the goal-conditioned Dreamer baseline. By examination, we found MUN with the DAD strategy can discover key subgoals for this task like contacting the puck, placing the manipulator at different angles around the puck, and stabilizing the manipulator upon reaching the goal (these key states result from distinct actions). MUN enables learning state transitions between these key subgoals to discover the high-level task structure. It learns a generalizable world model that handles sliding the puck between any positions within the workspace and predicts low probabilities for infeasible transitions from puck positions outside the workspace. Particularly, it enables the agent to hit the puck multiple times if it is within its workspace, thereby improving task success rates. That said, the current goal selection mechanism in MUN lacks a mechanism to filter out infeasible goals from a current state, potentially affecting sample efficiency. Please refer to R3 as a possible solution. We will clarify this limitation in the revised paper.

R3. Would it be possible to automatically detect whether the proposed backwards / sideways trace playback for world model learning is applicable for the given problem?

Yes, we can track the learning progress of a goal state $g$ from a state $s$ by monitoring changes in the expected discounted returns (value functions) in the learned world models across training episodes. We could sample goal states based on their learning progress in the goal sampling step at Line 10 of Algorithm 1. If progress towards $g$ stagnates, due to factors like asymmetric state transitions, we can reduce its selection probability. We plan to explore this approach in future work.

R4. Some of the variables used within Algorithm 1 are not described and it is not clear which steps are done in the world model and which in the real environment.

Line 19 in Algorithm 1 is the only step we perform simulation in a world model for policy training. The other steps occur in the real environment. We will improve the clarity of Algorithm 1.

R5. Evaluation of the world model

Please see Tables 1 and 2 in the global response.

R6. Figure 4 does not show the performance upon convergence

We have extended the training steps. Fig.1 (a)-(f) in the global response shows that MUN outperforms the baselines in both final performance and sample efficiency.

R7. No ablation of the exploration goal selection mechanism

Please see the additional ablations (MUN-KeyObs and MUN-Ns-3) in the global rebuttal.

R8. How does the proposed subgoal heuristic compare to more complex approaches such as the ones mentioned in the paper?

The prior work (Paul et al. 2019) learns subgoals from expert trajectories, which we do not assume are available. Please refer to our global response (the MUN-KeyObs ablation) for the quantitative comparison with (Zhang et al. 2021) which directly discovers key subgoals scattered across the state space.

R9. The evaluation of subgoal discovery is unconvincing. It is unclear whether the subgoals are meaningful for policy learning.

To evaluate DAD (the subgoal discovery Algorithm 2), we designed the MUN-noDAD ablation to assess how the subgoals improve policy training. MUN-noDAD replaces DAD with a simpler method that selects goal states at fixed time intervals in trajectories from the replay buffer. As shown in Fig. 4, MUN outperforms MUN-noDAD across all benchmarks, especially in the challenging 3-block stacking and high-dimensional block rotation environments. Tables 1 and 2 in the global response illustrate that the world models trained by MUN have a smaller generalization gap to real environments compared to those trained by MUN-noDAD. Consequently, MUN's superior world models enable more effective training of control policies that generalize better to real-world scenarios. We further evaluated MUN and MUN-noDAD on tasks not encountered during training to assess the model generalizability, as discussed in Sections 4.5 and F.5 of the appendix. These novel tasks feature different initial state and final goal configurations compared to the training tasks. MUN achieved an average success rate improvement of 16.5% over MUN-noDAD on these unseen tasks, as shown in Tables 3, 4, and 5 in the appendix. We will revise the paper to clarify the comparison with MUN-noDAD.

2024-08-13

The replies and additional results provided by the authors is appreciated. Especially the world model quality evaluation and ablation of the key state selection are welcome. I still have reservation about the general applicability of the proposed method due to not fully explored implied problem structures and wish the paper could show the boundaries of that such that the community could know when to proposed work is likely to be beneficial and in which scenarios it may fail to perform as expected. However, the due to the additional information and ablations that provide some information about limits I will raise my assessment to a weak accept.

One interesting observation is that the state space clustering key selection seems to be worse than the default approach, at least in one instance.

评论- Discussion

2024-08-13

Dear Reviewer Rfj2,

We kindly ask if the information presented in our rebuttals sufficiently addresses your concerns. Should you have any additional questions or concerns, we would greatly appreciate the opportunity to address them before the discussion period concludes. Thank you once again for your insightful feedback and constructive comments!

Best regards,

The Authors

评论- Thank you for your support

2024-08-14

Thank you for considering our rebuttal and the newly presented experimental results. We will integrate the evaluation of the world model quality, the ablation study on key state selection, and the results from the new non-prehensile manipulation task into the paper. Additionally, we will strengthen the discussion on problem structures and the limitations of our approach as outlined in the rebuttals (e.g., emphasizing that MUN is primarily designed to address complex object manipulation and navigation tasks). We appreciate your advice for improving the quality of the paper.

作者回复

2024-08-07

We greatly appreciate the valuable feedback and suggestions provided by the reviewers! We will begin by addressing the concerns raised by the majority of the reviewers in the global rebuttal. We will address the concerns of each reviewer in the individual review responses.

1 Quantitive Measurement of World Model Predication Quality (Reviewer Rfj2, BNZB, 24DF)

Table 1 shows the single-step prediction error of learned world models. We randomly sample state transition tuples $s_i, a_i, s\_{i+1}$ within the replay buffers from all of our baselines (MUN, MUN-noDAD, GC-Dreamer, MEGA-G, and PEG-G) to form a validation dataset. Table 1 reports the mean squared error on this dataset.

$**Table 1: One-step Model Predication Error**$

	MUN	MUN-noDAD	PEG-G	MEGA-G	GC-Dreamer
Ant Maze	1.6740	1.9751	2.1154	2.2416	2.9666
Point Maze	0.0013	0.0013	0.0014	0.0011	0.0032
Walker	0.8165	0.9971	1.4759	1.2353	2.1824
3-Block Stacking	0.0070	0.0071	0.0476	0.0853	0.0392
Rotate Block	1.0570	1.5609	1.7753	1.9433	2.3723
Rotate Pen	0.6708	1.1999	1.9622	2.8598	1.8359

Table 2 shows the compounding error (multistep prediction error) of learned world models for evaluation when generating the same length simulated trajectories. More specifically, assume a real trajectory of length $h$ is denoted as $(s_0, a_0, s_1, \ldots, s_{h})$ . For a learned model $M$ , we sample from $s_0$ and generate forward rollouts $(\hat{s}_0, a_0, \hat{s}_1, \ldots, \hat{s}_h)$ where $\hat{s}_0 = s_0$ and for $i \geq 0$ , $\hat{s}\_{i+1} = \mathcal{M} (\hat{s}_i, a_i)$ . Then the corresponding compounding error of $\mathcal{M}$ is defined as $\frac{1}{h} \sum\_{i=1}^{h} \|\hat{s}_i - s_i\|_2^2$ . We set $h$ to be the maximum number of timesteps in our environments.

$**Table 2: Compound Model Predication Error**$

	MUN	MUN-noDAD	PEG-G	MEGA-G	GC-Dreamer
Ant Maze	18.83	22.42	29.42	23.69	40.36
Point Maze	5.59	5.43	6.32	4.74	9.57
Walker	13.03	16.72	26.54	21.21	39.72
3-Block Stacking	0.45	0.55	0.70	0.95	0.94
Rotate Block	11.55	12.86	14.38	14.13	15.06
Rotate Pen	4.63	6.10	7.40	9.85	9.36

In Tables 1 and 2, we used the final world models trained by all methods after the same number of environment interaction steps. These results provide a quantitative comparison of the world model prediction quality between MUN and the baselines across our benchmarks. The world models trained by MUN show a much smaller generalization gap to the real environment compared to goal-conditioned Dreamer (and the other baselines). Consequently, MUN can effectively leverage these world models to train control policies that generalize well to the real environment. This explains the superior task success rates of MUN compared to the baselines in our experiment.

We will revise the paper to include this result and plot the model error curve throughout the training steps.

2 Ablation Study for the Exploration Goal Selection Mechanism (Reviewer Rfj2, BNZB, aZux, 24DF)

We conducted the following ablation studies to investigate MUN's exploration goal selection mechanism:

I. Number of Goal States ( $N_s$ ): MUN sequentially traverses $N_s = 2$ goal states sampled from the replay buffer to explore the environment during each training episode. We introduced an ablation MUN-Ns-3 that sets $N_s=3$ . The results in Fig.1 (d), (e), and (f) of the attached PDF show that setting $N_s$ greater than 2 slows down the training performance, supporting our claim in the paper it suffices to set $N_s = 2$ .
II. Key Subgoal Discovery: MUN discovers key subgoals for exploration as states in the replay buffer resulting from distinct actions in the action space. We provided an ablation MUN-KeyObs that discovers key subgoals directly from the state space as centroids of clusters of states in the replay buffer, following the strategy in [1]. The performance of this ablation as shown in Fig.1 (d), (e), and (f) of the attached PDF does not match MUN, highlighting that discovering key subgoals in the action space is both simpler and more effective.

We will incorporate these ablation studies into our paper.

[1] Zhang, L., Yang, G., and Stadie, B. World Model as a Graph: Learning Latent Landmarks for Planning. ICML 2021.

评论- Validation Dataset Details for One-Step and Multistep Prediction Error Analysis

2024-08-10

In our rebuttal, we forgot to include the sizes of the validation datasets used to generate Tables 1 and 2.

For Table 1, we randomly sampled 10,000 state transition tuples $s_i, a_i, s_{i+1}$ from the replay buffers, ensuring an equal number of samples from each baseline (MUN, MUN-noDAD, GC-Dreamer, MEGA-G, and PEG-G), to form the validation dataset for measuring one-step model prediction errors.

For Table 2, we evaluated the compounding (multistep) prediction error of the learned world models by generating 500 trajectories for each benchmark, simulated on both the models and the real environments.

最终决定Accept (poster)

2024-09-25

The paper introduces a goal-directed exploration algorithm for model-based reinforcement learning, designed to improve learning in sparse-reward, long-horizon environments. The approach learns world models by using experiences stored in the replay buffer not just in a forward direction but also backwards and across traces. This bidirectional training helps the world model better predict state transitions, even between arbitrary subgoal states. Another contribution is a heuristic that discovers novel actions and uses them to identify key subgoal states that are critical for completing complex tasks. By training on these subgoals, MUN improves exploration efficiency and leads to more effective goal-conditioned policies in novel scenarios.

Reviewers were generally positive about the approach, the novelty and the clarity of writing. However, there were initially some concerns about the paper, and the authors did a significant amount of work in the rebuttal phase. This included additional ablation studies, evaluating the quality of the learned model in isolation, as well as additional experiments on domains lacking the kinds of "symmetries" one might suspect the method requires (e.g. reversible dynamics). These results give further evidence to the strength and generality of the approach, although there are still open questions about under what conditions the method does (or does not) work. In particular, there seem to be many implicit assumptions that the conceptual ideas rely on, and without any theoretical foundation, it is difficult to understand when the method will succeed.

Nonetheless, I think that the additional experiments conducted during the rebuttal sufficiently address the concerns of the reviewers and therefore recommend acceptance. Authors are strongly encouraged to incorporate these results into the final version of the paper, as well add additional explanations to improve clarity (particularly around Algorithm 1)