PaperHub
6.0
/10
Rejected4 位审稿人
最低6最高6标准差0.0
6
6
6
6
4.0
置信度
正确性2.8
贡献度2.3
表达3.0
ICLR 2025

Knowledge Retention in Continual Model-Based Reinforcement Learning

OpenReviewPDF
提交: 2024-09-28更新: 2025-02-05

摘要

We propose DRAGO, a novel approach for continual model-based reinforcement learning aimed at improving the incremental development of world models across a sequence of tasks that differ in their reward functions but not the state space or dynamics. DRAGO comprises two key components: $Synthetic Experience Rehearsal$, which leverages generative models to create synthetic experiences from past tasks, allowing the agent to reinforce previously learned dynamics without storing data, and $Regaining Memories Through Exploration$, which introduces an intrinsic reward mechanism to guide the agent toward revisiting relevant states from prior tasks. Together, these components enable the agent to maintain a comprehensive and continually developing world model, facilitating more effective learning and adaptation across diverse environments. Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios.
关键词
Continual learningModel-based reinforcement learningWorld modelCatastrophic forgetting

评审与讨论

审稿意见
6

The authors introduced a new method for model-based reinforcement learning that mitigates the catastrophic forgetting problem for the world model. The authors operate in the unique setting where data from previous tasks cannot be carried over to the new task, which differs from some classic settings. The method involves using synthetic data replay and intrinsic reward to revisit previous areas to add to the training data of current task for the world model. Experiment results show superior performance compared to baseline methods.

优点

  • The method is intuitive and easy to understand
  • The paper is well written and has good illustrations to enable clear understanding

缺点

  • I believe the baselines needs an additional one, Continual TDMPC+exploration. Since the DRAGO used intrinsic rewards, it would be fair to also add a strong intrinsic reward method like RND or curiosity, to see if classic exploration methods are enough for these settings and if the generative replay+your specific intrinsic reward design are superior and fitting for the problem setting
  • (minor) Last two sentences in the conclusion are repeating
  • The experiments domains are not very diverse, only 3 domains explored. For example, we are not sure how the method performs in POMDP settings like image observations
  • Only TDMPC is used, would be useful to try other MBRL methods like dreamer to see the effects of the learning algorithm
  • It is not entirely obvious that a better transition model translates to a better learner performance. How come your method is better compared to other? is it your policy during training has a better exploration or better world model equals better performance?

问题

  • For the "dreamed" state action pairs are they from the generative model decoder where you sample zz to produce (s,a)(s, a), then you use the frozen old world model to produce ss'?
  • For the "reviewer" policy, whats the proportion of the time where each acts during online training/data collection? Is it half half? Do the baselines use the same number of steps for training/data collection as your learner steps+reviewer steps, to ensure fair comparisons?
评论

“For the "dreamed" state action pairs are they from the generative model decoder where you sample zz to produce (s,a)(s,a), then you use the frozen old world model to produce ss’?”

A: Yes, that is correct. The "dreamed" state-action pairs are generated by sampling latent variables from the prior distribution, passing them through the generative model decoder to produce synthetic state-action pairs. We then use the frozen old world model to predict the next state.

“For the "reviewer" policy, whats the proportion of the time where each acts during online training/data collection? Is it half half? Do the baselines use the same number of steps for training/data collection as your learner steps+reviewer steps, to ensure fair comparisons?”

A: Thank you for the question. The portion varies for different tasks. In general, to ensure fair comparisons, we use the same amount of steps for continual training of different algorithms and ensure they all converge to similar near-optimal performance. We have included additional details about this in Appendix C, Table 3, showing that DRAGO and Continual TDMPC converge to similar episode returns during the continual training process before evaluating transfer performance.

[1] Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. Chua et al. NeurIPS 2018.

评论

We thank the reviewer for their positive feedback and thoughtful comments. We would like to address the reviewer’s concerns below.

“I believe the baselines needs an additional one, Continual TDMPC+exploration. Since the DRAGO used intrinsic rewards, it would be fair to also add a strong intrinsic reward method like RND or curiosity, to see if classic exploration methods are enough for these settings and if the generative replay+your specific intrinsic reward design are superior and fitting for the problem setting”

A: Thank you for the suggestion! We have included new results on four tasks compared with Continual TDMPC + exploration like the reviewer suggested in Appendix D of the updated draft. As we shown in Table 4, DRAGO still outperforms this new continual MBRL baseline in all the four tested tasks. We should note that while this is a reasonable baseline, the comparison is a little unfair for our method as DRAGO can also be combined with any exploration method in a straightforward way. Specifically, while we have a separate reviewer model that aims to maximize our proposed intrinsic reward, our learner model that aims to solve each specific task can also directly add any intrinsic reward method like curiosity to encourage exploration, which does not contradict with the intrinsic reward of the separate reviewer.

“Last two sentences in the conclusion are repeating”

A: Thank you. We have fixed it in the updated draft.

“Only TDMPC is used, would be useful to try other MBRL methods like dreamer to see the effects of the learning algorithm”

A: Due to time constraints, we were unable to make Dreamer work on our tested benchmarks. Instead, we were able to successfully combine DRAGO with another popular Model-based RL method- PETS[1] (over 1500 citations) and have included the preliminary results in Appendix D in the updated draft. As shown in Table 6, DRAGO-PETS outperforms Continual PETS in 3 out of 4 tested tasks in MiniGrid, demonstrating that DRAGO is a useful continual MBRL technique that is potentially useful for extending many other existing MBRL algorithms.

”The experiments domains are not very diverse, only 3 domains explored. For example, we are not sure how the method performs in POMDP settings like image observations”

A: Thank you for the suggestion. We would like to first point out that the domains that we tested on (MIniGrid & Deepmind Control Suite) are two of the most popular RL benchmarks that have been evaluated in a large number of prior papers and have been shown to be quite challenging environments. The tasks we designed for evaluating transfer performance are even more challenging on DMC tasks as they require the agent to learn to transition from one locomotion mode (jump, run etc.) to another (run forward, run backward). While due to time constraints we cannot test the method on image observation settings, we think this is an exciting and challenging problem for future work as we mentioned at the end of Section 3.1, especially as we can replace VAE with diffusion models that are capable of generating high-quality image data.

“It is not entirely obvious that a better transition model translates to a better learner performance. How come your method is better compared to other? is it your policy during training has a better exploration or better world model equals better performance?”

A: We would like to clarify that only the world model (dynamics) and the generative model is reused on new tasks. The policy&reward&value functions are all discarded after the training finished on each task and reinitialized on each new task. So compared to learning from scratch, as everything else remains the same, the only single factor that leads to the change of performance on new tasks is the world model initialized at the beginning of the new tasks. We observed in our experiments that DRAGO learns a better world model (as shown in Figure 4) and better transfer performance (as shown in Table 1). Thus we conclude that DRAGO learns a better transition model, which translates to a better learner performance.

审稿意见
6

This paper proposes DRAGO as a new method for continual model-based reinforcement learning. Unlike previous research, the author considers the continual reinforcement learning problem in which only the prior transition models are available instead of the collected data. To address the catastrophic forgetting problem occurring in this setting, DRAGO adopts two key components: synthetic experience rehearsal and regaining memories through exploration. A continual training strategy for the generative model and an intrinsic reward mechanism are introduced in the two components, respectively. The empirical results show that DRAGO outperforms the existing methods on several challenging continual learning benchmarks.

优点

  1. The paper is well organized and written, and the key idea is delivered clearly and easy to follow.
  2. The problem setting in this paper is interesting and worth discussing, which has no access to the previously collected data. I didn't see many papers discussing this issue.
  3. The experiments conducted in this paper are comprehensive and reproducible.

缺点

The motivations and analysis of the method could be explained further. (see the questions)

问题

  1. The continual learning for the generative model seems similar to some knowledge distillation tasks, but it's not included in the related works. I would appreciate it if the author could discuss it more and compare the similarities and differences.
  2. At the beginning of each new task, the reward model, value model, and policy are all randomly initialized. I wonder if it's possible that similar methods could be used to utilize previously learned models/policies to facilitate the training, just like the generative model.
  3. I'm not sure what's the main purpose of introducing another reviewer model. According to eq. (7), the calculation of intrinsic reward does not require an additional model. In such cases, what if the intrinsic reward is calculated and added directly to the total reward of the learner? I would appreciate it if the author could explain it more or provide experimental results using a single model.
  4. The intrinsic reward defined in eq. (7) has a side effect that the agent will be penalized for visiting states that are less familiar to the previous model. Will it affect the final performance? If so, how did you solve that problem?
评论

"The intrinsic reward defined in eq. (7) has a side effect that the agent will be penalized for visiting states that are less familiar to the previous model. Will it affect the final performance? If so, how did you solve that problem?"

A: We appreciate your observation. The intrinsic reward is designed to encourage the reviewer policy to:

  • Visit states well-understood by the previous model (positive reinforcement).
  • Avoid states already well-predicted by the current model (to focus on areas needing reinforcement).

This mechanism is intentional and serves to guide the reviewer policy's exploration toward states that help reinforce and connect knowledge from previous tasks.

Importantly, this penalization does not negatively affect the final performance on the current task because:

  • The learner policy is not influenced by the intrinsic reward and focuses solely on maximizing the external reward.
  • The reviewer policy operates separately, collecting data that enhances the completeness of the world model without interfering with the learner's objective.

By combining data from both policies, the agent builds a more comprehensive world model, which ultimately benefits performance across tasks. The penalization in the intrinsic reward guides the reviewer to efficiently allocate exploration efforts, without hindering the agent's ability to solve the current task.

But related to the reviewer’s (reviewer of this paper, not the reviewer in our method) next question, if we directly mix this intrinsic reward with the learner’s environment reward, it’s very possible that the learner’s training will be hindered as penalizing for visiting states that are less familiar to the previous model can make it hard for the agent to reach the goal defined by the individual tasks.

Im not sure what's the main purpose of introducing another reviewer model. According to eq. (7), the calculation of intrinsic reward does not require an additional model. In such cases, what if the intrinsic reward is calculated and added directly to the total reward of the learner? I would appreciate it if the author could explain it more or provide experimental results using a single model.

A: Thank you for the question. As we partially explained in the answer of the last question, directly adding our intrinsic reward to the external reward and training only one single learner model may make it hard for the agent to complete the original task goal. If we only have one agent model (one policy), the intrinsic reward can have a side effect that 1. discourages the agent to visit places that it is already familiar with, thus hinders it to find the optimal solution to solve the task. 2. Encourages it to visit places that the previous mode is familiar with, which could be completely irrelevant for solving the current task.

We have added new results in the updated version of our paper. In Table 5, Appendix D, we show that if we add the intrinsic reward directly to the total reward of the learner, we will see a large drop of performance for the continual training tasks, and this performance gap becomes larger and larger as the agent encounters more tasks, since it is encouraged to visit more and more possibly irrelevant states.

By having a separate reviewer policy that maximizes the intrinsic reward, we decouple the objectives. The learner policy focuses on maximizing the external reward to solve the current task effectively, while the reviewer policy explores states that help in retaining knowledge and connecting different regions of the state space. This separation allows both policies to operate without hindering each other's performance.

评论

We thank the reviewer for their positive feedback and thoughtful questions. We are pleased that you found our paper well-organized and the problem setting interesting. We address your questions below.

The continual learning for the generative model seems similar to some knowledge distillation tasks, but it's not included in the related works. I would appreciate it if the author could discuss it more and compare the similarities and differences.

A: Thank you for your suggestion. We have revised our paper to include some knowledge distillation works. In general, knowledge distillation shares some similarity with our method as it involves transferring knowledge from a larger, pre-trained model (teacher) to a smaller model (student), which can be used to mitigate the forgetting problem.

The key difference lies in what is being distilled and how. Instead of distilling the output distributions or representations from a teacher model into a student model, our approach distills the data-generating process itself. Specifically, we train a generative model (e.g., a VAE) to capture the distribution of state-action pairs from previous tasks. We then use this generative model, along with the frozen old world model, to generate synthetic transitions s,a,ss,a,s’. This allows us to rehearse experiences from prior tasks without retaining the actual data, which is crucial in our setting where previous data cannot be stored. By continually updating the generative model with both current task data and synthetic data from previous models, we prevent forgetting within the generative model itself. This approach is tailored to the needs of continual model-based reinforcement learning, where preserving the ability to generate representative experiences from all tasks is essential.

At the beginning of each new task, the reward model, value model, and policy are all randomly initialized. I wonder if it's possible that similar methods could be used to utilize previously learned models/policies to facilitate the training, just like the generative model.

A: In our paper, we focus on a setting where tasks share the same dynamics but have entirely different reward functions and objectives. In such cases, transferring knowledge from previous reward models, value functions, or policies may not be straightforward or beneficial. Reusing the world model across tasks is feasible because the environment's dynamics remain consistent. However, reusing reward/value functions or policies requires the agent to distinguish between tasks, introducing partial observability. The observations may not contain sufficient information to identify the current task, necessitating additional inputs like task IDs or goal descriptions. This can complicate the learning process and may not generalize well to new tasks, especially when tasks are diverse or numerous. Moreover, initializing reward and value models with previous parameters could lead to negative transfer if the tasks are dissimilar, potentially hindering learning on the new task. Our approach opts for random initialization to ensure that the agent can learn task-specific reward structures without interference from previous tasks. That said, in scenarios where tasks are related or share similarities, leveraging previous reward and value models could be advantageous. This is an interesting avenue for future work. Our current focus is on learning a comprehensive world model without access to past data, which we believe is a crucial foundation for continual learning in model-based RL.

评论

Thanks for your insightful rebuttal and addressing my concerns. Since 7 is not an available score option, I will keep my score but increase my confidence.

审稿意见
6

This paper proposes two techniques to enhance the continual learning capability of the TDMPC MBRL agent: Synthetic Experience Rehearsal and Regaining Memories Through Exploration.

In Synthetic Experience Rehearsal, the authors propose to learn a separate generative model (via VAE) for the distribution of state-action pairs (s,a)(s, a), such that the synthetic pairs (s^,a^)(\hat{s}, \hat{a}) can augment the prior tasks' data for re-training, mitigating forgetting of the dynamics model. To reduce forgetting of the generative model itself, the authors use the generative model at the end of the previous task to synthesize data and mix it with the current task's real data for training.

Regaining Memories Through Exploration uses an intrinsic reward defined by the accuracies of the current dynamics model and the previous one to encourage exploration. The hope is that the agent could explore pathways (in the state-action space) in the real environment that connect different tasks seen so far in order to build a cohesive understanding of the world.

The proposed method is tested on simulation environment tailored for a continual setting, and is shown to perform better than vanilla TDMPC (either re-training or continual-training) and a EWC TDMPC baseline.

优点

  1. The motivation is clear, with a nice illustration in Figures 1 and 2 showing the need for continual MBRL and the intuition of the proposed method.
  2. The presentation of the method is easy to follow.
  3. The empirical results look promising.

缺点

W1. System complexity. To enable continual MBRL, the authors propose two techniques, synthetic data and exploration. The former needs to build an additional generative model for the state-action pair, and the latter even requires training an extra reviewer (a full set of reward model, value model, and policy). But this is okay as long as during the continual learning process, the computation/memory does not grow with the number of tasks.

W2. Potential mode collapse. In eq5 part two, the dynamics model's training target (the next state) is labeled by the older version of itself, on synthetic data (s^,a^)(\hat{s}, \hat{a}) sampled from the generative model. It is likely that such process would lead to mode collapse, especially when the real data Di\mathcal{D}_i covers a very different region from the synthetic data, in which case little correction is possible.

W3. Unclear experimental setting & results. Taking MiniGrid in Figure 3 as an example, I expect the continual learning should start from room 1, and at some time the initial state distribution and reward function both switch to those of room 2 and then rooms 3 & 4. However, from the results in Figure 5, it is hard to tell when the task switching happens. Besides, there is no clear sign of forgetting in the results for the baselines (forgetting should result in a drop immediately after task switching), making the results confusing to me.

问题

Q1. Referring to W1, will you train a reviewer per task? Will the reviewers trained on prior tasks be discarded?

Q2. Regarding inference, eq8 requires a reward function and a value function for planning. Since the agent does not see the test tasks, I assume eq8 uses a known reward function to plan? Also, which value function did you use?

Q3. Section 5.1 introduces related work about MBRL, and concludes with "none of these methods investigate the continual learning setting for MBRL". However, at the end of section 5.2, the authors mentioned a few MBRL works for the continual setting. This is a bit contradicting, and perhaps the authors could reorganize to make it clearer.

Besides, to my knowledge, some related works are missing. For example, [1] learns the world model online for Dyna-style MBRL, which is closely related to yours, since both [1] and yours are trying to learn the world model while minimizing forgetting, such that it can benefit agent learning in the long run (without storing prior data). [2] also specifically studies forgetting in MBRL, and applies their method to a multi-task continual setting. Both the single-task online setting and the multi-task continual setting are reasonable to study the forgetting problem, hence [1,2] may be relevant to this work.

[1] Liu Z, Du C, Lee WS, Lin M. Locality Sensitive Sparse Encoding for Learning World Models Online. ICLR 2024.

[2] Huang Y, Xie K, Bharadhwaj H, Shkurti F. Continual model-based reinforcement learning with hypernetworks. ICRA 2021.

评论

“Unclear experimental setting & results. Taking MiniGrid in Figure 3 as an example, I expect the continual learning should start from room 1, and at some time the initial state distribution and reward function both switch to those of room 2 and then rooms 3 & 4. However, from the results in Figure 5, it is hard to tell when the task switching happens. Besides, there is no clear sign of forgetting in the results for the baselines (forgetting should result in a drop immediately after task switching), making the results confusing to me.”

A: We are sorry that our experimental setting is not clear. We included Figure 3 in the experiment section to show the continual training and testing setting of the experiments. Marked in blue, for MiniGrid, the continual learning does follow the process described by the reviewer: starting from room1, after certain number of episodes and the learner converges we switch to room2, with a different initial state distribution as well as reward function, and then rooms 3 & 4. However, as there is very little overlap between these tasks (all entirely new rooms), we do not expect positive transfer happening between these tasks. To evaluate whether our method enables the agent to incrementally learn a better world model as it’s exposed to more parts of the world dynamics, we test it on some holdout tasks at different stages of the continual learning process. The holdout tasks are marked in red in Figure 3, and the order in the figure also shows the time when we evaluated the agent on them. For example, we evaluated the agent in task Room1to2 after continually training on task/room 1 and 2. Room1to2 requires the agent to start in room 1 and move to the goal position in room 2, if the agent can maintain the knowledge learned in both previous tasks, it will reach a better performance, which are demonstrated in Figure 5 and Table 1. The other holdout(red) tasks are similarly designed to test if the agent can maintain the knowledge learned from previous tasks and incrementally learn a more and more complete world model. These holdout tasks are not part of the training sequence but are used to assess the agent's ability to retain and integrate knowledge from previous tasks.

”Section 5.1 introduces related work about MBRL, and concludes with "none of these methods investigate the continual learning setting for MBRL". However, at the end of section 5.2, the authors mentioned a few MBRL works for the continual setting. This is a bit contradicting, and perhaps the authors could reorganize to make it clearer. Besides, to my knowledge, some related works are missing.”

A: Thank you for the suggestions! We have revised the paper and included the additional related papers as you suggested. The related work section is organized as: Section 5.1 is mainly for MBRL prior works in general and we mentioned “none of these methods investigate the continual learning setting for MBRL” for the MBRL-only methods introduced in this section. Then later in Section 5.2, we introduced continual learning related work, and then also introduced continual learning + MBRL related works in the end.

评论

Thanks for your reply! Most of my questions were answered.

  1. Yet I am still not convinced about the evaluation. I understand you have a sequence of training tasks, and some holdout test tasks. When you train, you maintain a world model and a generative model across tasks, but discard old reward function, value function, etc., and randomly re-initialize them. My question is, when you evaluate, what do you use for the holdout test task (what R and Q in eq.8)? It sounds like your method needs to learn the reward function and value function on test tasks until they can be used for eval?

  2. Besides, the following question is not fully addressed:

However, from the results in Figure 5, it is hard to tell when the task switching happens. Besides, there is no clear sign of forgetting in the results for the baselines (forgetting should result in a drop immediately after task switching), making the results confusing to me.

I would suggest that you could draw some dashed vertical lines to indicate the task switching of the training tasks. Could you elaborate more on why "there is no clear sign of forgetting in the results for the baselines (forgetting should result in a drop immediately after task switching)"?

  1. Regarding the mode collapse.

To mitigate this risk, we continually train the generative model using a combination of synthetic data (from the previous generative model) and real data from the current task, as outlined in Equation 6.

If I understand correctly, you only keep one previous generative model (not all)? That means, you expect this model could capture the distribution of all prior tasks, relying on training on its own generated data. The real data of the current task could help a bit, but I still believe the generative model may forget (thus collapse due to training it on its own generated data) when the number of tasks goes large. I am happy to hear more thoughts from the authors on this point.

评论

We thank the reviewer for their thoughtful comments. We address your concerns below.

“System complexity. To enable continual MBRL, the authors propose two techniques, synthetic data and exploration. The former needs to build an additional generative model for the state-action pair, and the latter even requires training an extra reviewer (a full set of reward model, value model, and policy). But this is okay as long as during the continual learning process, the computation/memory does not grow with the number of tasks.” “ Referring to W1, will you train a reviewer per task? Will the reviewers trained on prior tasks be discarded?”

A: Thank you for raising the concern about system complexity. We would like to clarify that while we do train a reviewer (consisting of a reward model, value model, and policy) during each task, we discard the reviewer after completing the task. We do not retain reviewers from prior tasks. Only the generative model and the world model are retained and continually learned, as described in Section 3.1. This approach ensures that the computation and memory requirements do not grow with the number of tasks, keeping the system complexity manageable throughout the continual learning process.

“Regarding inference, eq8 requires a reward function and a value function for planning. Since the agent does not see the test tasks, I assume eq8 uses a known reward function to plan? Also, which value function did you use?”

A: Thank you for this question. In our method, for each new test task, we do not use pretrained reward or value functions from previous tasks. Instead, we randomly initialize the reward function and value function and train them from scratch on the new task. We reuse only the generative model and the world model learned from prior tasks. Our focus is on the continual learning of the world model, which captures the shared dynamics across tasks. The reward and value functions are task-specific and are trained anew for each task, ensuring that our method does not rely on prior knowledge of the test tasks' rewards.

“Potential mode collapse. In eq5 part two, the dynamics model's training target (the next state) is labeled by the older version of itself, on synthetic data sampled from the generative model. It is likely that such a process would lead to mode collapse, especially when the real data covers a very different region from the synthetic data, in which case little correction is possible.”

A: We appreciate your concern regarding potential mode collapse when training the dynamics model using synthetic data generated from the generative model. To mitigate this risk, we continually train the generative model using a combination of synthetic data (from the previous generative model) and real data from the current task, as outlined in Equation 6. This strategy helps the generative model capture diverse modes corresponding to different tasks and prevents it from collapsing to a narrow distribution. In our experiments, we did not observe mode collapse. For example, in the MiniGrid domain (Figure 4), even though the real data from the second room covers a different region from the synthetic data of the first room, our method successfully learns a general world model that accurately predicts dynamics in both rooms (see the bottom row for post-task2). This demonstrates that our approach effectively maintains diversity in the generative model and avoids mode collapse, even when the data distributions change significantly between tasks.

评论

Thank you for the suggestions! We have revised the limitation section and experiment section in the updated draft to make these points clear.

We agree with the reviewer that the evaluation is focused on transfer, which we emphasized several times in our experiment section by calling them “transfer tasks” and “transfer performance”. However, we believe that the key to the performance boost of these transfer tasks is knowledge retention, and that’s why we design these transfer tasks. This is because the agent needs to find a way to retain the knowledge it learned from previous tasks to accelerate the learning on these transfer tasks. It is supported by our empirical observations: naive continual MBRL agents (e.g., Continual TDMPC) fail to demonstrate such retention, as shown in the first row of visualizations in Figure 4, where Continual TDMPC only remembers the dynamics of the room corresponding to the current task, even though we initialize the world model using the one learned from previous rooms. This inability to retain knowledge underscores the primary motivation of our paper, as explained in the introduction and visualized in Figure 1. We would like to argue that knowledge transfer can happen because of many other factors, but knowledge retention is the most important one for the evaluation settings in our paper.

Please let us know if you have more questions.

评论

Thanks for your reply and the revision! I agree that forgetting and transfer are closely related, and non-forgetting (retention in the paper's keyword) helps transfer. Although evaluating transfer is an indirect way to measure forgetting, I think it suffices for this paper's setting. I am happy to raise my score to 6.

评论

Thank you very much for your helpful feedback and for increasing your score!

评论

Thank you for the quick response! We are glad that most of your questions are solved.

Yet I am still not convinced about the evaluation. I understand you have a sequence of training tasks, and some holdout test tasks. When you train, you maintain a world model and a generative model across tasks, but discard old reward function, value function, etc., and randomly re-initialize them. My question is, when you evaluate, what do you use for the holdout test task (what R and Q in eq.8)? It sounds like your method needs to learn the reward function and value function on test tasks until they can be used for eval?

I would suggest that you could draw some dashed vertical lines to indicate the task switching of the training tasks. Could you elaborate more on why "there is no clear sign of forgetting in the results for the baselines (forgetting should result in a drop immediately after task switching)"?

A: Thank you for the questions. We realized that we have not explained our evaluation setting well and decided to answer the first two questions together. We would like to clarify that we evaluate at different stages of the continual training process and each time the agent should just have finished training on one of the training tasks. For example, as shown in Figure 3, we evaluate the agent on jump2run after it finishes training on Cheetah jump. And the evaluation process of this new task is basically the same setting with the training tasks - we just load the world model and train the agent through online interaction with this new task. So for the holdout test tasks, we still reinitialize R and Q and train them from scratch as part of the TDMPC training process. This mirrors the training process of the original TDMPC algorithm, with the key difference being that the world model is pre-trained through the continual learning process.

That being said, it is not that we pretrain R and Q until they can be used - training them is part of the model-based RL algorithm thus part of our evaluation process. We evaluate the agent’s performance throughout the training process on the test tasks, and the R and Q are used no matter how good they are. Importantly,

  • Figure 5 presents the agent's performance during training on the holdout test tasks, not during the sequential training of the initial tasks.
  • Each plot corresponds to a single test task, and the agent's performance is tracked as it learns that task from scratch, using the retained world model.
  • Within the training of each test task, there are no task switches, so there are no vertical lines indicating task transitions.

Since the plots show the agent's performance during the learning of a new test task, any forgetting of previous tasks by the world model manifests as slower learning or lower performance on the test task. In this setting, we are not trying to evaluate the magnitude of forgetting by observing the drop of performance after task switching, instead we directly measure if the agent can benefit from previous tasks to accelerate learning on the new test tasks, as they are designed in a way such that they will be easier to solve if the agent can maintain the knowledge learned from previous tasks. And as we observed in Figure 5 and Table 1, naive continual TDMPC performs bad and sometimes performs worse than training from scratch on those test tasks, indicating bad backward transfer (forgetting) from previous tasks.

Please let us know if you have more questions!

Regarding the mode collapse.If I understand correctly, you only keep one previous generative model (not all)? That means, you expect this model could capture the distribution of all prior tasks, relying on training on its own generated data. The real data of the current task could help a bit, but I still believe the generative model may forget (thus collapse due to training it on its own generated data) when the number of tasks goes large. I am happy to hear more thoughts from the authors on this point.

A: Yes we only keep one previous generative model. We understand your concern that the generative model might become less effective as the task number grows since the source of the old knowledge comes from its own. While this is indeed a limitation of DRAGO (and more broadly it’s a problem for continually learning generative models), we do think it can be mitigated to some extent by mixing with real-world data - note that the real world data can also come from the data collected by our reviewer, which connects to the previous tasks. In our experiments with a moderate number of tasks (e.g., four rooms in MiniGrid), we did not observe significant mode collapse. As shown in Figure 4, even when the agent is training on the fourth task, the world model still remembers the dynamics of the first room very well.

Thank you for mentioning this potential issue! We have included this limitation discussion in the appendix.

评论

Hi, thank you for your reply.

It's nice to see you have added discussions on the method's limitation, which is important for future research to work on. One more comment would be: the current tasks tested in the paper are not complex, and there is only a few number of tasks, so the proposed method could work well. Developing continual generative models would be much more challenging, but also more rewarding towards the goal of real continual agent.

I now understand your evaluation protocol, and feel your setting is interesting and different from the normal settings I've seen (where they do not train on test tasks). Such setting makes me feel that the evaluation is more focused on transfer, but not forgetting. However, forgetting seems to be the paper's main topic (the title is like "knowledge retention" instead of "knowledge transfer"). Since Figure 5 only measures the performance of test tasks while it's being trained, it's more like standard single-task RL with better initialization. Perhaps you could rewrite some parts of the paper to make the setting clearer. E.g., near this sentence "We evaluate the agent’s continual learning performance on test tasks that requires the combination of knowledge from more than one previously learned tasks.", you may emphasize that you still train on these test tasks, etc., and for Figure 5, state explicitly that the curves only measure for the period that the test task is being trained, instead of the entire training.

One last point. If you'd like to claim the method's superiority on combating forgetting (or knowledge retention), I think you may need to continuously eval performance of previously seen tasks along the entire training. I believe such results will be much more convincing to support a non-forgetting world model claim.

审稿意见
6

This paper addresses continual model-based reinforcement learning where an agent faces a sequence of tasks that share state/action spaces and dynamics yet differ in terms of rewards and initial states. The agent is assumed not to have access to data collected from previous tasks and, hence, utilizes a learned world model to adapt to such nonstationarity. The authors claim that naive model-based RL methods struggle with catastrophic forgetting, a phenomenon where the agent loses knowledge acquired in earlier tasks. To address this, the authors develop DRAGO, combining two components called 'Synthetic Experience Rehearsal' and 'Regaining Memories Through Exploration', respectively. The former utilizes a generative model to learn from synthetic experiences resembling the agent's past experiences in earlier tasks. Through an intrinsic reward function, the latter incentivizes the agent to explore states similar to those seen in the previous task while penalizing for visiting states that its current learned dynamics model is familiar with. The authors evaluate DRAGO on three domains, where as the task changes, the dynamics remain the same, yet there is a small overlap between the parts of the state space where the agent spends time. The empirical results indicate that DRAGO can retain previous knowledge and use it to transfer it to new tasks. In contrast, baselines either fail to do so or achieve similar performance but inconsistently.

优点

  • The motivation behind this work and the intuition behind why naive methods fail are clearly explained with the support of visualizations. Section 3 is generally well-written, but it may be missing some details about how a generative model operates in this setting.

  • The experimental set-up makes sense and is clear about the properties of the environments, e.g., overlap between visited states in different tasks.

  • Qualitative and quantitative results demonstrate how DRAGO is more advantageous than the baselines in the studied domains. The ablation study and the experiments on few-shot transfer provide more details about DRAGO's performance.

  • Related work is very detailed.

缺点

  • Although the nonstationary nature of the environment comes from the change in rewards and initial states, DRAGO focuses on continually learning a world model. If I understand it correctly, the agent needs access to reward and value models associated with a task at hand at inference time, and these models are relearned for every new task. This is possibly why the other methods can perform similarly in continuous control benchmarks (middle and bottom rows in Figure 5), where there is sufficient overlap between visited states in different tasks.

  • If the agent knows when there is a new task, continually learning reward and value models would be possible, possibly by using unique IDs for each task. Could you please discuss the potential advantages and drawbacks of continually learning reward and value models as well?

  • Four sentences start with 'And'. I also see multiple instances of orphan words, i.e., single words left alone at the end of a paragraph.

  • EWC is a nice continual learning baseline, yet I believe more recent approaches should be evaluated in the experiments. The related work mentions some toward the end of the related paragraph, yet they are not evaluated in the experiments.

  • An ablation study over different prediction horizons for inference would help understand how well the world model adapts and how long it can last.

问题

  • Is there a reason why ORSO couldn't address changing dynamics, too?

  • During inference, does DRAGO use learned reward and value models associated with the test task while utilizing the most recent world model? For example, in Walker back2run, does DRAGO use the reward and value models it learned during training for Walker run and use the world model it obtained after Walker backward? Have you mentioned this anywhere in Section 3, as this indicates you need more than the world model?

评论

“EWC is a nice continual learning baseline, yet I believe more recent approaches should be evaluated in the experiments. The related work mentions some toward the end of the related paragraph, yet they are not evaluated in the experiments.”

A: We appreciate this suggestion. In our paper, we consider a setting where the agent does not have access to data from previous tasks during training on a new task. To the best of our knowledge, our work is the first to address continual model-based RL under this constraint. The continual MBRL methods mentioned in the related work typically assume access to previous data or task-specific parameters, which is incompatible with our setting. Nevertheless, we recognize the importance of comparing to more recent baselines. We added another baseline: Continual TDMPC + Curiosity, in Appendix D of the updated draft, where we add the curiosity-based intrinsic reward to the continual TDMPC policy to increase exploration. As shown in Table 4, DRAGO still outperforms this baseline across all evaluated tasks. We should note that, while this is a reasonable baseline, the comparison is a little unfair for our method as DRAGO can also be combined with any exploration method in a straightforward way. Specifically, while we have a separate reviewer model that aims to maximize our proposed intrinsic reward, our learner model that aims to solve each specific task can also directly add any intrinsic reward method like curiosity to encourage exploration.

“An ablation study over different prediction horizons for inference would help understand how well the world model adapts and how long it can last.”

A: Thank you for the insightful suggestion. We have included an ablation study in Appendix D (Figure 7, fourth subfigure) of the updated paper, comparing DRAGO's performance with different prediction horizons on the Cheetah jump2back task. The results indicate that DRAGO performs better with longer horizons (15 and 25), suggesting that our world model effectively adapts and maintains accuracy over extended planning sequences. This demonstrates the robustness of the learned world model and its utility in long-horizon planning.

“Is there a reason why ORSO couldn't address changing dynamics, too?”

A: Thank you for your question. We believe there may be a misunderstanding, as we are not aware of a method named ORSO in the context of our paper or the related literature on continual model-based reinforcement learning. It is possible that this was a typographical error or a reference to a method we discussed under a different name. Could you please clarify or provide a reference to the method you are referring to? We would be happy to address your question once we have more information.

[1] Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models. Chua et al. NeurIPS 2018.

[2] Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables. Rakelly et al. ICML 2019.

评论

We want to thank the reviewer for their positive comments and insightful feedback. We believe we can resolve the reviewer’s concerns.

“During inference, does DRAGO use learned reward and value models associated with the test task while utilizing the most recent world model? For example, in Walker back2run, does DRAGO use the reward and value models it learned during training for Walker run and use the world model it obtained after Walker backward? Have you mentioned this anywhere in Section 3, as this indicates you need more than the world model?”

A: Thank you for this important question. No, during inference on new tasks, DRAGO does not reuse learned reward and value models from previous tasks. For each new test task, we randomly initialize the reward and value models and reuse only the world model (dynamics). Our focus is on continually learning a comprehensive world model that captures the shared dynamics across tasks, which can then be leveraged to solve new tasks efficiently. While some model-based RL methods involve learning reward and value functions (e.g., TDMPC), many others, such as PETS [1] and general Model Predictive Control (MPC), focus solely on learning the dynamics without learning task-specific reward/value functions. In real-world scenarios, tasks often share the same dynamics but have different reward functions, making continual learning of reward and value functions challenging and less generalizable. We apologize for any confusion, and we have clarified this point in Section 3 of the updated paper.

“If the agent knows when there is a new task, continually learning reward and value models would be possible, possibly by using unique IDs for each task. Could you please discuss the potential advantages and drawbacks of continually learning reward and value models as well?”

A: Thank you for raising this point. Continually learning reward and value models is indeed possible when task boundaries are known. Here are some potential advantages and drawbacks:

Advantages:

  • Knowledge Transfer: If tasks are closely related and share similar reward structures, continually learning reward and value models could facilitate knowledge transfer, potentially improving learning efficiency on new tasks.
  • Unified Policy Learning: Conditioning the reward and value functions on task-specific identifiers or latent context variables, as in meta-RL approaches [2], can enable the agent to generalize across tasks.

Drawbacks:

  • Scalability Issues: Assigning unique IDs or conditioning on task-specific variables may not scale well with a large number of tasks, especially in complex environments.
  • Interference and Forgetting: Continually updating reward and value models across tasks with different objectives can lead to interference, potentially degrading performance on earlier tasks.
  • Partial Observability: Varying reward functions introduce partial observability challenges, as the agent may not be able to infer the task identity solely from observations, complicating the learning process. In our work, we focus on continually learning the world model (dynamics) because it captures the underlying structure of the environment that is common across tasks, regardless of the specific rewards. This approach allows for greater generality and applicability to a wide range of tasks.

In Appendix D of the updated paper, we included new experimental results (Figure 7, third subfigure) on the Cheetah jump2run domain. These results demonstrate that if we also reuse the policy, value, and reward models from the Cheetah run task, our method achieves even better performance. However, this requires prior knowledge about the similarity of reward functions between tasks, which may not be feasible or scalable in general settings.

“Four sentences start with 'And'. I also see multiple instances of orphan words, i.e., single words left alone at the end of a paragraph.”

A: Thank you for the suggestions! We have revised the paper’s writing based on your comments.

评论

Thank you for your detailed rebuttal. Your clarifications, new experiments, and ablation studies thoroughly addressed my concerns. I will keep my score as is but increase my confidence in my rating, as I believe this paper is above the acceptance threshold.

About ORSO, that's a typo on my side, and I apologize for that. I meant DRAGO; hence the question should have been:

Is there a reason why DRAGO couldn't address changing dynamics, too?

评论

Thank you for your clarification and we are glad your concerns are addressed! For our paper, we chose an environment with static dynamics to focus on memory retention without the added complexity of changing dynamics. This allows us to better evaluate the transfer performance of learnt world model through performance in finetune tasks that share the same dynamics. A simple extension can enable DRAGO to address changing dynamics as well - e.g. using recurrent networks or transformer as the model to deal with the partial observability, or including a context variable like in some meta-rl works, but it's not the focus in our paper and we leave that for future work.

评论

We appreciate all the reviewers' careful and valuable comments. In the updated manuscript, we fixed some writing issues and included several new experimental results:

  • Comparison with an additional baseline: Continual TDMPC + curiosity
  • Combining DRAGO with PETS, which is another popular Model-based RL method, and empirically comparing with Continual PETS
  • Ablation study of the effect of planning horizon
  • Ablation study of using one learner (add the intrinsic reward to the total reward) without reviewer
AC 元评审

The authors propose an approach for continual model based reinforcement learning, assuming no access to data from past tasks. The review scores were borderline and so after discussing with the SAC I conducted a review myself in order to cast the deciding call.

Overall, the paper focuses on an overly contrived problem, mentioning in bold several times that storing data is almost impossible in the "real world". This is certainly not the case, and the authors do not cite any real world studies to back this claim. Indeed, the authors cite works from the past few years (e.g. Lampinen 2021) that use memory from previous tasks, as well as Kessler 2023 which specifically uses replay from previous tasks for a continual model based setting. Neither of these types of approaches are used as baselines. Indeed, I would also expect a simple "upper bound" baseline of how well the method would perform with all data, to get an idea of how much of this is captured with the current approach.

The method itself introduces several ideas, packaged as one method. Some of these are neat, for instance updating the generative model with new data. That being said, the problem setting being overly contrived/constrained + lack of relevant baselines (we cannot simply compare against EWC 8 years after it came out) means the method is seeking a nail for a hammer.

审稿人讨论附加意见

The discussion was largely focused on clarification issues, reviewers remain on the borderline.

最终决定

Reject