PaperHub
4.3
/10
withdrawn3 位审稿人
最低3最高5标准差0.9
3
5
5
4.0
置信度
ICLR 2024

MuDreamer: Learning Predictive World Models without Reconstruction

OpenReviewPDF
提交: 2023-09-16更新: 2024-03-26
TL;DR

We present MuDreamer, a reinforcement learning agent that builds upon the DreamerV3 algorithm by learning a predictive world model without the need for reconstructing input signals.

摘要

关键词
Model-Based Reinforcement LearningDreamerReconstruction-Free

评审与讨论

审稿意见
3

This paper combines the methods from DreamerV3 and MuZero, and presents a new model named MuDreamer for visual reinforcement learning. The key contribution is to introduce a new world model architecture that involves the prediction of environment rewards, value functions, continuation flags, and inverse dynamics. The proposed model showcases a comparable performance to DreamerV3 in multiple domains, including DeepMind Control Suite and the Atari100k benchmark.

优点

  1. This paper is well-organized and easy to follow.
  2. The proposed model is extensively evaluated on widely used visual control benchmarks. It also provides comprehensive ablation studies to explore the effectiveness of each model component.
  3. The model achieves performance comparable to DreamerV3, and as claimed by the authors, it is more efficient in the training time.

缺点

  1. As stated by the authors, 'MuDreamer solves tasks without the need for a reconstruction loss.' However, this seems to be in contrast with the loss function described in Eq. (3), which still involves optimizing the image decoder with a reconstruction loss. If my understanding is accurate, the distinction from DreamerV3 lies in the fact that the gradient from the reconstruction loss doesn't back-propagate to the dynamics module. In light of this, I recommend that the authors consider revising the paper's title.
  2. The proposed model offers limited novelty when compared to DreamerV3. The introduction of inverse dynamics and continuation prediction loss is not a novel contribution in the field of model-based RL.
  3. While MuDreamer trains faster than DreamerV3, the difference in training time is relatively modest (4 hours vs. 4 hours and 20 minutes).
  4. The outperformance of the proposed model compared with DreamerV3 is observed in only 3 out of 26 games on the Atari100k, which may not be sufficient to establish its overall effectiveness.

问题

My main concerns are about the technical novelty and the experimental results. Please see my comments above.

评论

We would like to thank the reviewer for the feedback.

Please find below our answer to your comments.

  1. About the reconstruction loss.

The reconstruction loss in equation 3 is optional and is only used to observe the latent dynamics imagined by the model. We explain in the above paragraph in section 4.1 that a "stop gradient" sg(.) operation is added to prevent the reconstruction loss gradients to be back-propagated to the rest of the network: "We optionally learn a decoder network to reconstruct the sequence of observations while using the stop gradient operator sg(.) to prevent the gradients from being back-propagated to other network parameters." The sg(.) operation is also shown in equation 3. The reconstruction loss is hence not required as the title says.

  1. Paper novelty.

Inverse dynamics and continuation prediction loss are not contributions of our paper. We propose to learn hidden representations in a self-supervised manner using an action predictor network and a value predictor network. The action predictor predicts the sequence of action leading to the observed environment change while value predictor predicts the discounted sum of future rewards of the sampled trajectory. We also show that the use of batch normalization in the representation network is required to prevent a collapse of the model latent state to invariant features, where the standard deviation of the model categorical state logits converges to zero.

  1. Training time.

The difference in training time is indeed relatively modest. The amount of parameters is also relatively small and MuDreamer requires less than 4GB of GPU memory during training. However, we anticipate MuDreamer to have a more significant impact on training time and memory requirement when scaling to larger encoder/decoder sizes.

审稿意见
5

This paper introduces MuDreamer, an enhanced version of the DreamerV3 algorithm. MuDreamer eliminates the requirement of reconstructing input signals by learning a predictive world model that predicts the environment value function and previously selected actions. The importance of batch normalization in preventing learning collapse is highlighted, and the impact of KL balancing on convergence speed and learning stability is examined.

优点

The motivation is sound, as this paper combines the strengths of dreamerV2 and MuZero to tackle tasks from image inputs with both continuous and discrete action spaces, without the need for input signal reconstruction.

缺点

  1. Value predictors in this research are inspired by MuZero, and the inclusion of action prediction is a common practice in various model-based approaches. As a result, the novelty may be relatively constrained.

  2. The comparison is unfair as it only considers Dreamerv3. It would be more equitable to include more model-based methods for comparison, such as Dreamerpro[1] and denoised MDPs[2]. Dreamerpro, in particular, is a highly relevant method within the domain of Reconstruction-free model-based reinforcement learning.

[1] Dreamerpro: Reconstruction-free model-based reinforcement learning with prototypical representations. ICML 2022.

[2] Denoised mdps: Learning world models better than the world itself. ICML 2022.

  1. The experimental results are not satisfactory, as it appears that DreamerV3 performs better. I am aware that MuDreamer has fewer parameters, but it is important to analyze specifically where the differences lie. Please provide an analysis and remove the corresponding parts from DreamerV3 to assess the performance. Additionally, since the authors were inspired by MuZero in several aspects, it would be beneficial to compare this approach as well.

  2. Why is batch normalization used instead of other normalization techniques such as layer normalization? Can layer normalization achieve similar effects?

  3. Many explanations are not sufficiently in-depth. For example, in KL balancing, why does using a slight regularization of the representations toward the prior with βrep = 0.05 solve both of these issues?

问题

Please refer to the weaknesses.

评论

We would like to thank the reviewer for the insightful feedback.

Please find below our answer to your questions.

  1. Value predictors in this research are inspired by MuZero, and the inclusion of action prediction is a common practice in various model-based approaches. As a result, the novelty may be relatively constrained.

Our value predictor network is inspired by the MuZero value function. However, we didn't find similar works using an action predictor network to predict the sequence of preceding actions leading to the observed environment change. If this is the case, could you share with us literature papers using a similar method in model-based RL?

  1. The experimental results are not satisfactory, as it appears that DreamerV3 performs better. I am aware that MuDreamer has fewer parameters, but it is important to analyze specifically where the differences lie. Please provide an analysis and remove the corresponding parts from DreamerV3 to assess the performance. Additionally, since the authors were inspired by MuZero in several aspects, it would be beneficial to compare this approach as well.

We evaluated MuDreamer on Atari100k in order to test our MuDreamer algorithm in another domain. While MuDreamer successfully solves DeepMind Control tasks, we observed that the agent was not able to achieve the same performance on Atari. This is especially the case for 'boxer' where MuDreamer fails to achieve good scores, significantly penalizing the final human normalized score. However, MuDreamer outperforms DreamerV3 in a subset of the games. This is why we categorized these early results as promising and plan to identify the reason of these failure in future works. One way would be to replace the value predictor lambda returns by n-step returns. Since the sampled batch includes trajectories with different policies, computing returns using a limited or dynamic number of TD-steps, depending on the training step the trajectory was generated, may provide better quality targets for representation learning.

  1. Why is batch normalization used instead of other normalization techniques such as layer normalization? Can layer normalization achieve similar effects?

DreamerV3 uses layer normalization to stabilize learning and train larger models. In our case, Figure 7 and Table 3 show that using Layer Normalization create instabilities for some of the tasks. In some tasks like 'Walker Walk', layer normalization can be used instead of batch normalization. But this often create instabilities where the model categorical state logits become constant with an averaged std over all features near zero. This is a known issue in self-supervised learning for images and can be solved using batch normalization to regularize the output features.

  1. Many explanations are not sufficiently in-depth. For example, in KL balancing, why does using a slight regularization of the representations toward the prior with βrep\beta_{rep} = 0.05 solve both of these issues?

The default regularization of βrep\beta_{rep} = 0.1 was too high to efficiently learn hidden representations. Using a βrep\beta_{rep} of 0.2, we observed that unnecessary information such as the environment floor and the agent shadow in DeepMind Control tasks were sometimes reconstructed as monochrome surfaces without the original details. Following BLAST, removing the representation loss LrepL_{rep} using a βrep\beta_{rep} of zero solved this problem. However, this resulted in instabilities with the prior latents sometimes diverging. We suppose this create difficulties for the prior to predict posterior representations since we do not use a slow moving teacher encoder network. The Representation network and Dynamics Predictor also do not share a common architecture while methods like SimSiam use a unique predictor network on top of the learned features. Using a smaller regularization of the representations toward the prior with βrep\beta_{rep}=0.05 was sufficient to keep posterior and prior latents close to each other without impacting the learning of hidden representations.

审稿意见
5

This paper presents an exploration of 4 modifications to DreamerV3, and presents promising results on the Control Suite and more limited results on Atari100k.

They assess 4 changes: removing the observation reconstruction loss, adding a previous action prediction head, replacing LayerNorm with batch normalization and changing the weights of the L_dyn and L_rep losses.

Overall, this is a clear and well-executed piece of work, but it has quite limited scope and the results (although promising) aren’t clearly demonstrating a strong benefit over DreamerV3 or EfficientZero.

优点

  1. The paper is clear, presents the scope, problem it wants to tackle and related work well.
  2. The presentation of the method is clear and the modifications are easy to follow (although quite directly inspired from the Dreamer papers)
  3. Results are complete and well presented, with a good coverage of Control Suite results experiments as well as Atari100k. Baselines choices are good.
  4. Ablation study of 5.2 is clean and well executed once again.

缺点

  1. The similarities between MuDreamer and DreamerV3 are potentially too strong to make this work significant enough in this state. The paper looks like yet another version of Dreamer, with the exact same math and text dangerously close to a copy, with only a few extra ablations and modifications.
  2. Results aren’t as clear-cut as I’d like. There have been a lot of MBRL papers in recent years which explored many combinations of losses, models, actors, but it is quite hard to find which components really matter.
    1. The Control Suite results are slightly better, but not groundbreakingly so
    2. The Atari100k results aren’t that competitive, especially compared to EfficientZero which would be the clear baseline if one would take the strict desire of not learning to reconstruct observations.
    3. Figure 5 and 6 in the Appendix demonstrates this well, where all curves are fairly similar and do not show a strong enough signal for me.
  3. Despite the removal of the observation reconstruction loss, Figure 2 and others indicate that image reconstruction is still done nearly perfectly, which goes counter to the original motivation. It is unclear why that is the case, but it does feel like the model is not as different in what the latent space capture compared to DreamerV3 as it could be?

问题

  1. Why is Figure 2 so good at reconstructing the observation?
    1. One would have expected to only capture what mattered for the task if the assumptions from the abstract/introduction were true?
    2. Are there games where you have examples of “Dreamer failing to perceive crucial elements”, which MuDreamer does capture?
    3. As it stands, it is unclear to me that the latent space is any different and more abstract than DreamerV3.
  2. It would have been interesting to point to specific games where this effect should arise, and make a clear comparison between DreamerV3, MuDreamer and EfficientZero.
    1. For example, having a good score on Frostbite seemed interesting (as it does contain quite a lot of hard details to model well), but looking in the Appendix Figure 6, this seems to be more about 1 seed of DreamerV3 doing badly…
  3. Did you explore using the Action predictor network directly for acting, instead of having another Actor network?
评论

We would like to thank the reviewer for the insightful feedback.

Please find below our answer to your questions.

  1. Why is Figure 2 so good at reconstructing the observation?

Depending of the environment, some details of the observation like the background will not change or be very similar. These 'invariant' details do not need to be encoded inside the model latent state and can simply be learned by the decoder. In the other hand, you can see that details related to the agent body in Figure 2 are slightly more blurry than the original. These details progressively become less and less blurry along training, until becoming very close to the original image. While invariant background details are usually learned by the decoder in the early phase of training.

1.1 One would have expected to only capture what mattered for the task if the assumptions from the abstract/introduction were true?

Moving details that are not necessary for the agent are sometimes ignored and not reconstructed as the original. For instance, when setting ßrep to 0.1 and 0.2 or using less than 8 discrete latents for the model hidden state, we observed that unnecessary information such as the environment floor and the agent shadow in DeepMind Control were sometimes reconstructed as monochrome surfaces without the original details. A loss of performance was also observed. The model dynamics loss does not distinguish between relevant and irrelevant information. However, value and action prediction losses require encoding necessary information in the latent space, discarding the extra information if the model hidden state has not enough capacity.

1.2 Are there games where you have examples of “Dreamer failing to perceive crucial elements”, which MuDreamer does capture?

Dreamer sometimes fails to perceive small elements like the ball of the 'BallInCup' task in DeepMind Control. These details sometimes disappear and reappear at a different position during the imagination phase, which may harm behavior learning. In the case of BallInCup, this resulted in instabilities with training phases where the agent suddenly did not perceive the ball and was unable to correctly solve the task. We did not observe this kind of behavior for MuDreamer along training.

  1. It would have been interesting to point to specific games where this effect should arise, and make a clear comparison between DreamerV3, MuDreamer and EfficientZero.

DreamerV3 performance is particularly impacted when visual distractions are present in the observation. In order to study the effect of visual distractions on DreamerV3 and MuDreamer performance, we experimented with the natural background setting, where the DeepMind Control tasks background is replaced with natural videos. You will find more details in section 'i' in the appendix of the revised paper. Figure 10 shows the comparison of MuDreamer with DreamerV3 under the natural background setting. MuDreamer successfully learn a policy while DreamerV3 fails on every task. Figure 11 shows the decoder reconstruction of observations by DreamerV3 and MuDreamer for the Walker Run and Finger Spin tasks. MuDreamer correctly reconstruct the agent body with a monochrome or blurry background while DreamerV3 focus on the background details, discarding the agent body and necessary information.

  1. Did you explore using the Action predictor network directly for acting, instead of having another Actor network?

The actor predictor learns to predict the action that led to the observed environment change, so it cannot be used directly for planning. One way would be to generate the desired next model state to predict the action leading to this state. However, this seems to be a complex problem to solve compared to optimizing the actor and value networks.

评论

We added a revision to the original paper including two main modifications.

  1. Updated ablations results using 3 different seeds!

  2. Additional section 'i' in the appendix page 22-23 'NATURAL BACKGROUND SETTING' in order to illustrate our comments below.