c:["$","div",null,{"className":"container py-8 max-w-6xl mx-auto","children":["$","$e",null,{"fallback":null,"children":["$","$L16",null,{"paper":{"id":"FFvCjbhpDq","title":"The Role of Forgetting in Fine-Tuning Reinforcement Learning Models","abstract":"$17","keywords":["reinforcement learning","transfer learning"],"primary_area":"transfer learning, meta learning, and lifelong learning","venue":"Submitted to ICLR 2024","conference":"ICLR","year":2024,"status":"rejected","is_accepted":false,"avg_rating":5,"avg_rating_normalized":5,"rating_min":3,"rating_max":8,"rating_std":2.12132,"review_count":4,"comment_count":20,"creation_date":"2023-09-23","modification_date":"2024-02-11","forum_link":"https://openreview.net/forum?id=FFvCjbhpDq","pdf_link":"https://openreview.net/pdf?id=FFvCjbhpDq","arxiv_id":null,"arxiv_url":null,"arxiv_match_method":null,"arxiv_matched_at":null,"tldr":null,"created_at":"2026-01-21T10:31:23.80226+00:00","updated_at":"2026-04-22T06:44:37.525884+00:00","authors":[{"id":"~Maciej_Wolczyk1","name":"Maciej Wolczyk","openreview_id":"~Maciej_Wolczyk1","position":0},{"id":"~Bartłomiej_Cupiał1","name":"Bartłomiej Cupiał","openreview_id":"~Bartłomiej_Cupiał1","position":1},{"id":"~Mateusz_Ostaszewski1","name":"Mateusz Ostaszewski","openreview_id":"~Mateusz_Ostaszewski1","position":2},{"id":"~Michał_Bortkiewicz1","name":"Michał Bortkiewicz","openreview_id":"~Michał_Bortkiewicz1","position":3},{"id":"~Michał_Zając1","name":"Michał Zając","openreview_id":"~Michał_Zając1","position":4},{"id":"~Razvan_Pascanu1","name":"Razvan Pascanu","openreview_id":"~Razvan_Pascanu1","position":5},{"id":"~Łukasz_Kuciński1","name":"Łukasz Kuciński","openreview_id":"~Łukasz_Kuciński1","position":6},{"id":"~Piotr_Miłoś1","name":"Piotr Miłoś","openreview_id":"~Piotr_Miłoś1","position":7}]},"stats":{"ratings":[{"id":"Hh7kgch9si","value":8,"confidence":4},{"id":"aqwaOeg5UQ","value":3,"confidence":4},{"id":"EhQYXXxdfR","value":6,"confidence":4},{"id":"X1vFgAJU2b","value":3,"confidence":4}],"avg_rating":5,"rating_min":3,"rating_max":8,"rating_std":2.449489742783178,"detailed_scores":{"soundness":[],"contribution":[],"presentation":[],"originality":[],"quality":[],"clarity":[],"significance":[]}},"commentTree":[{"id":"Hh7kgch9si","paper_id":"FFvCjbhpDq","replyto":"FFvCjbhpDq","number":1,"type":"Official_Review","role":"reviewer","rating":8,"confidence":4,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":"8: accept, good paper","summary":"This work investigates catastrophic forgetting in fine-tuning pre-trained reinforcement learning (RL) policies on subsequent tasks sequentially in a stationary environment while data distribution shifts. It first shows how fine-tuned policies would deteriorate in performance for previous tasks. Then, the paper identifies two conditions in which forgetting occurs, namely, state coverage gap and imperfect cloning gap. Experimentally, the work further shows how existing knowledge retention methods like elastic weight consolidation (EWC) mitigate forgetting during the fine-tuning process.","questions":"1. How is non-stationary enironment different from data shifts in stationary environment? Is it not the same underlying data shift problem?\n2. What if we pretrain ‘CLOSE’ states first instead? Do we see better forward transfer?\n3. Can the authors provide their views on why pre-trained models (counterintuitively) do not seem to exhibit any signs of positive transfer? Existing methods do seem insufficient for RL to leverage pretraining\n4. Why is EWC missing in some of the subsequent experiments?","soundness":"3 good","strengths":"1. This is an important research problem for both the understanding of deep RL training and potential practical deployments. We have seen extensive studies on fine-tuning of supervised learning. The same aspect in RL is relatively less studied. As deep RL moves towards large-scale pretraining, understanding the best practices of fine-tuning with downstream tasks is crucial.\n2. The paper shows strong empirical analysis in understanding the problem, accompanied with extensive experimental results. I find the identification of the two conditions to be informative to researchers of this subfield\n3. The paper in general is clearly written with key results elaborately explained.\n4. Experimental results are comprehensively displayed. I particularly find figure 4 to be intuitive and helpful in visualizing the forgetting phenomenon.","confidence":"4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.","weaknesses":"1. The choice of benchmarking algorithms for knowledge retention, although somewhat representative of existing methods, does not quite match with state-of-the-art approaches. Newer methods like [1], if added, can strengthen the conclusions of the paper.\n2. It is unclear to me how is this setting different from continual/lifelong RL\n3. [Minor] the term ‘realistic RL algorithms’ is confusing \n\n[1] Ben-Iwhiwhu, E., Nath, S., Pilly, P. K., Kolouri, S., & Soltoggio, A. (2022). Lifelong reinforcement learning with modulating masks. arXiv preprint arXiv:2212.11110.","contribution":"3 good","presentation":"3 good","code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2023-10-31T00:00:00+00:00","modified_at":"2023-11-11T00:00:00+00:00","replies":[{"id":"M3jcBG63xA","paper_id":"FFvCjbhpDq","replyto":"Hh7kgch9si","number":11,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer 5ZRC","comment":"$18"},"created_at":"2023-11-16T00:00:00+00:00","modified_at":"2023-11-16T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Response to Reviewer 5ZRC

","comment":"$19"}},{"id":"k1sOxmMDat","paper_id":"FFvCjbhpDq","replyto":"Hh7kgch9si","number":15,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer 5ZRC30, revision 2","comment":"Following your suggestion, we have conducted additional experiments incorporating EWC in Montezuma’s Revenge. While we observed that EWC provides some benefit in mitigating forgetting, we found that behavioral cloning remains a more effective and robust approach for addressing the forgetting problem in our specific context. The results and a detailed analysis of these experiments are now included in our revised manuscript (figure 5, 6 and 7). We believe this comparison adds valuable insights to the discussion of different methods for addressing forgetting in fine-tuning RL models."},"created_at":"2023-11-21T00:00:00+00:00","modified_at":"2023-11-21T00:00:00+00:00","replies":[{"id":"AR3TLEYuPM","paper_id":"FFvCjbhpDq","replyto":"k1sOxmMDat","number":16,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"comment":"Thank you very much for addressing my questions together with the additional experiments and discussion. My rating of this work remains to be positive. It is an informative and detailed contribution to this research area."},"created_at":"2023-11-21T00:00:00+00:00","modified_at":"2023-11-21T00:00:00+00:00","replies":[],"contentHtml":{"comment":"

Thank you very much for addressing my questions together with the additional experiments and discussion. My rating of this work remains to be positive. It is an informative and detailed contribution to this research area.

"}}],"contentHtml":{"title":"

Response to Reviewer 5ZRC30, revision 2

","comment":"

Following your suggestion, we have conducted additional experiments incorporating EWC in Montezuma’s Revenge. While we observed that EWC provides some benefit in mitigating forgetting, we found that behavioral cloning remains a more effective and robust approach for addressing the forgetting problem in our specific context. The results and a detailed analysis of these experiments are now included in our revised manuscript (figure 5, 6 and 7). We believe this comparison adds valuable insights to the discussion of different methods for addressing forgetting in fine-tuning RL models.

"}}],"contentHtml":{"rating":"

8: accept, good paper

","summary":"

This work investigates catastrophic forgetting in fine-tuning pre-trained reinforcement learning (RL) policies on subsequent tasks sequentially in a stationary environment while data distribution shifts. It first shows how fine-tuned policies would deteriorate in performance for previous tasks. Then, the paper identifies two conditions in which forgetting occurs, namely, state coverage gap and imperfect cloning gap. Experimentally, the work further shows how existing knowledge retention methods like elastic weight consolidation (EWC) mitigate forgetting during the fine-tuning process.

","questions":"

How is non-stationary enironment different from data shifts in stationary environment? Is it not the same underlying data shift problem?
What if we pretrain ‘CLOSE’ states first instead? Do we see better forward transfer?
Can the authors provide their views on why pre-trained models (counterintuitively) do not seem to exhibit any signs of positive transfer? Existing methods do seem insufficient for RL to leverage pretraining
Why is EWC missing in some of the subsequent experiments?

","soundness":"

3 good

","strengths":"

This is an important research problem for both the understanding of deep RL training and potential practical deployments. We have seen extensive studies on fine-tuning of supervised learning. The same aspect in RL is relatively less studied. As deep RL moves towards large-scale pretraining, understanding the best practices of fine-tuning with downstream tasks is crucial.
The paper shows strong empirical analysis in understanding the problem, accompanied with extensive experimental results. I find the identification of the two conditions to be informative to researchers of this subfield
The paper in general is clearly written with key results elaborately explained.
Experimental results are comprehensively displayed. I particularly find figure 4 to be intuitive and helpful in visualizing the forgetting phenomenon.

","confidence":"

4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.

","weaknesses":"

The choice of benchmarking algorithms for knowledge retention, although somewhat representative of existing methods, does not quite match with state-of-the-art approaches. Newer methods like [1], if added, can strengthen the conclusions of the paper.
It is unclear to me how is this setting different from continual/lifelong RL
[Minor] the term ‘realistic RL algorithms’ is confusing

[1] Ben-Iwhiwhu, E., Nath, S., Pilly, P. K., Kolouri, S., & Soltoggio, A. (2022). Lifelong reinforcement learning with modulating masks. arXiv preprint arXiv:2212.11110.

","contribution":"

3 good

","presentation":"

3 good

","code_of_conduct":"

Yes

"}},{"id":"aqwaOeg5UQ","paper_id":"FFvCjbhpDq","replyto":"FFvCjbhpDq","number":2,"type":"Official_Review","role":"reviewer","rating":3,"confidence":4,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":"3: reject, not good enough","summary":"This is an experimental paper that studies the forgetting issue in finetuning pre-trained models with RL. The paper focuses on two special cases of the problem: state coverage gap and imperfect cloning gap. To study the two problems respectively, the paper compares several existing methods in Meta-World, Montezuma's Revenge, and NetHack. Results shows that RL with behavior cloning on the pre-training dataset outperforms other methods, maintaining the pre-trained capabilities better during RL.","questions":"1. Can the experimental results provide different insights into the two problems? In addition to these two problems, does the problem of forgetting include other cases?\n\n2. Based on the experimental results, are there any insights in improving the existing methods or further addressing the forgetting problem?","soundness":"2 fair","strengths":"1. Forgetting of the previously learned skills is a problem worth studying in RL. \n\n2. The paper refines this problem into two cases and conducts appropriate experimental evaluation.","confidence":"4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.","weaknesses":"$1a","contribution":"1 poor","presentation":"3 good","code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2023-10-31T00:00:00+00:00","modified_at":"2023-11-11T00:00:00+00:00","replies":[{"id":"wwBF8Evpq1","paper_id":"FFvCjbhpDq","replyto":"aqwaOeg5UQ","number":10,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer bshN","comment":"$1b"},"created_at":"2023-11-16T00:00:00+00:00","modified_at":"2023-11-16T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Response to Reviewer bshN

","comment":"$1c"}},{"id":"61JFH35lVh","paper_id":"FFvCjbhpDq","replyto":"aqwaOeg5UQ","number":14,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer bshN31, revision 2","comment":"Following your suggestion, we implemented and evaluated an additional method on Continual World; Episodic Memory (EM), in which we populate SAC’s replay buffer with trajectories from the pre-training tasks. We observe that EM works slightly better than EWC, but worse than behavioral cloning. \n\nAdditionally, we run experiments with EWC on Montezuma’s Revenge. The results show that EWC provides improvements, but is not as good as Behavioral Cloning. Details of the experiments are presented in Figure 5.\n\nWe believe that overall new experiments and analyses considerably strengthen our paper. We, thus, gently ask you to consider raising the score or pointing out remaining deficiencies."},"created_at":"2023-11-21T00:00:00+00:00","modified_at":"2023-11-21T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Response to Reviewer bshN31, revision 2

","comment":"

Following your suggestion, we implemented and evaluated an additional method on Continual World; Episodic Memory (EM), in which we populate SAC’s replay buffer with trajectories from the pre-training tasks. We observe that EM works slightly better than EWC, but worse than behavioral cloning.

Additionally, we run experiments with EWC on Montezuma’s Revenge. The results show that EWC provides improvements, but is not as good as Behavioral Cloning. Details of the experiments are presented in Figure 5.

We believe that overall new experiments and analyses considerably strengthen our paper. We, thus, gently ask you to consider raising the score or pointing out remaining deficiencies.

"}}],"contentHtml":{"rating":"

3: reject, not good enough

","summary":"

This is an experimental paper that studies the forgetting issue in finetuning pre-trained models with RL. The paper focuses on two special cases of the problem: state coverage gap and imperfect cloning gap. To study the two problems respectively, the paper compares several existing methods in Meta-World, Montezuma's Revenge, and NetHack. Results shows that RL with behavior cloning on the pre-training dataset outperforms other methods, maintaining the pre-trained capabilities better during RL.

","questions":"

\n
Can the experimental results provide different insights into the two problems? In addition to these two problems, does the problem of forgetting include other cases?
\n
\n
Based on the experimental results, are there any insights in improving the existing methods or further addressing the forgetting problem?
\n

","soundness":"

2 fair

","strengths":"

\n
Forgetting of the previously learned skills is a problem worth studying in RL.
\n
\n
The paper refines this problem into two cases and conducts appropriate experimental evaluation.
\n

","confidence":"

","weaknesses":"$1d","contribution":"

1 poor

","presentation":"

3 good

","code_of_conduct":"

Yes

"}},{"id":"EhQYXXxdfR","paper_id":"FFvCjbhpDq","replyto":"FFvCjbhpDq","number":4,"type":"Official_Review","role":"reviewer","rating":6,"confidence":4,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":"6: marginally above the acceptance threshold","summary":"This paper examines finetuning of pretrained RL agents in a single environment. Two problematic mechanisms are identified. A state coverage gap occurs when the agent is pretrained on a part of the state space but, in the fintetuning phase, has to first learn a policy on a different part. Then, the policy on the first part of the state space is lost during finetuning and must be relearned. The second, the imperfect cloning gap, occurs when the agent is pretrained through imitation learning. As the policy is finetuned, the performance on states later in trajectories also degrades. \nThe use of behavioru cloning on states from the first task and other forgetting mitigation techniques are shown to solve these issues. A variety of environments are considered including toy tasks, metaworld and Nethack to demonstrate the problem and the utility of the solutions.","questions":"$1e","soundness":"3 good","strengths":"- There are extensive experiments on a variety of environments. The sequence of metaworld tasks was an interesting custom addition. \n\n- The identified problem could be relevant in a variety of practical settings. The imperfect cloning gap seems to be particularly applicable since we may often want to start with imitation learning from previous policies if possible. \n\n- There was sufficient detail in the text to understand the experiments and the figures were clear in general.","confidence":"4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.","weaknesses":"$1f","contribution":"2 fair","presentation":"3 good","code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2023-11-05T00:00:00+00:00","modified_at":"2023-11-23T00:00:00+00:00","replies":[{"id":"DLWVdE7MWg","paper_id":"FFvCjbhpDq","replyto":"EhQYXXxdfR","number":8,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer dWac, pt. 2","comment":"$20"},"created_at":"2023-11-16T00:00:00+00:00","modified_at":"2023-11-16T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Response to Reviewer dWac, pt. 2

","comment":"$21"}},{"id":"iQ6hIyEjRn","paper_id":"FFvCjbhpDq","replyto":"EhQYXXxdfR","number":7,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer dWac","comment":"$22"},"created_at":"2023-11-16T00:00:00+00:00","modified_at":"2023-11-16T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Response to Reviewer dWac

","comment":"$23"}},{"id":"mPSD85cF4G","paper_id":"FFvCjbhpDq","replyto":"EhQYXXxdfR","number":12,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer dWac05, revision 2","comment":"We are happy to announce new results. \n\nFollowing your suggestion, we implemented and evaluated an additional method on Continual World; Episodic Memory (EM), in which we populate SAC’s replay buffer with trajectories from the pre-training tasks. We observe that EM works slightly better than EWC, but worse than behavioral cloning.\n\nAdditionally, we run experiments with EWC on Montezuma’s Revenge. The results show that EWC provides improvements, but is not as good as Behavioral Cloning. Details of the experiments are presented in Figures 5.\n\nPlease let us know if these experiments and comments address your concerns. Otherwise we ask you to consider raising the score."},"created_at":"2023-11-21T00:00:00+00:00","modified_at":"2023-11-21T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Response to Reviewer dWac05, revision 2

","comment":"

We are happy to announce new results.

Please let us know if these experiments and comments address your concerns. Otherwise we ask you to consider raising the score.

"}},{"id":"4B9tNx0mrN","paper_id":"FFvCjbhpDq","replyto":"EhQYXXxdfR","number":17,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"comment":"Additionally, in the latest revision, we include experiments with EWC on NetHack and we run an experiment to answer the Reviewer’s question about what would happen if we kept training a randomly initialized model (APPO) for a larger number of steps. Indeed, results presented in Appendix F, paragraph “Training from scratch with more steps”, show that a model trained from scratch manages to outperform pure fine-tuning. However, after some time the training stagnates and in the end falls significantly short of fine-tuning with additional knowledge retention (e.g. APPO-BC)."},"created_at":"2023-11-22T00:00:00+00:00","modified_at":"2023-11-22T00:00:00+00:00","replies":[{"id":"xObCKcBY22","paper_id":"FFvCjbhpDq","replyto":"4B9tNx0mrN","number":20,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"comment":"Thank you for the clarifications and additions to the paper. \nI am willing to revise my score upwards."},"created_at":"2023-11-23T00:00:00+00:00","modified_at":"2023-11-23T00:00:00+00:00","replies":[],"contentHtml":{"comment":"

Thank you for the clarifications and additions to the paper.\nI am willing to revise my score upwards.

"}}],"contentHtml":{"comment":"

Additionally, in the latest revision, we include experiments with EWC on NetHack and we run an experiment to answer the Reviewer’s question about what would happen if we kept training a randomly initialized model (APPO) for a larger number of steps. Indeed, results presented in Appendix F, paragraph “Training from scratch with more steps”, show that a model trained from scratch manages to outperform pure fine-tuning. However, after some time the training stagnates and in the end falls significantly short of fine-tuning with additional knowledge retention (e.g. APPO-BC).

"}}],"contentHtml":{"rating":"

6: marginally above the acceptance threshold

","summary":"

This paper examines finetuning of pretrained RL agents in a single environment. Two problematic mechanisms are identified. A state coverage gap occurs when the agent is pretrained on a part of the state space but, in the fintetuning phase, has to first learn a policy on a different part. Then, the policy on the first part of the state space is lost during finetuning and must be relearned. The second, the imperfect cloning gap, occurs when the agent is pretrained through imitation learning. As the policy is finetuned, the performance on states later in trajectories also degrades.\nThe use of behavioru cloning on states from the first task and other forgetting mitigation techniques are shown to solve these issues. A variety of environments are considered including toy tasks, metaworld and Nethack to demonstrate the problem and the utility of the solutions.

","questions":"$24","soundness":"

3 good

","strengths":"

\n
There are extensive experiments on a variety of environments. The sequence of metaworld tasks was an interesting custom addition.
\n
\n
The identified problem could be relevant in a variety of practical settings. The imperfect cloning gap seems to be particularly applicable since we may often want to start with imitation learning from previous policies if possible.
\n
\n
There was sufficient detail in the text to understand the experiments and the figures were clear in general.
\n

","confidence":"

","weaknesses":"$25","contribution":"

2 fair

","presentation":"

3 good

","code_of_conduct":"

Yes

"}},{"id":"X1vFgAJU2b","paper_id":"FFvCjbhpDq","replyto":"FFvCjbhpDq","number":3,"type":"Official_Review","role":"reviewer","rating":3,"confidence":4,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"rating":"3: reject, not good enough","summary":"Summary:\n\nThis paper studies fine-tuning in RL, and specifically the issue of forgetting and potential mitigation strategies. They demonstrate in several settings (simulated robot manipulation, Montezuma's Revenge and NetHack) that if a policy is pretrained on some part of the state space which is far from the initial state distribution, the knowledge is often forgotten and there are little to no improvements over training from scratch. They furthermore investigate different knowledge retention strategies (such as L2 penalties between pretrained and fine-tuned policy weights, possibly weighted by fisher information, as well as simple BC regularization on the pretraining data). They find that BC regularization helps the most, and can help prevent forgetting the behaviors encoded in the pretrained policy.","questions":"Some suggestions on the writing:\n\n- In the intro, it would be helpful to give a bit more details on the \"knowledge retention techniques\" used to mitigate the forgetting problems. Currently, the ready does not have much idea on the methodological aspects going into the paper. \n\n- Example in 2-state MDP: the notation here is confusing. Both $theta$ and $f_\\theta$ are used before being defined. Please add the definitions in the main text.","soundness":"3 good","strengths":"- The paper's main takeaway message, that adding BC regularization helps avoid forgetting previous behaviors during fine-tuning, is well supported by the experiments. This is demonstrated in 3 environment, including continuous control (MetaWorld), a pixel-based Atari game (Montezuma's Revenge) and a procedurally generated, long-horizon game with complex dynamics (NetHack). \n\n- The paper does a nice job with their analysis and visualizations illustrating the forgetting behavior.","confidence":"4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.","weaknesses":"- The main takeaway, which is essentially that co-training on the old tasks prevents forgetting when learning a new task, is pretty unsurprising and has been demonstrated before in previous works in continual learning both for the supervised case and the RL case. It's not clear what the contribution of this work adds. \n- An obvious downside of co-training on previous tasks is that the memory requirement increases linearly with the number of tasks and the computation increases quadratically - this is not adequately discussed. \n- It would have been nice to include result for the L2 and EWC on Montezuma and NetHack.","contribution":"1 poor","presentation":"3 good","code_of_conduct":"Yes","flag_for_ethics_review":["No ethics review needed."]},"created_at":"2023-11-05T00:00:00+00:00","modified_at":"2023-11-11T00:00:00+00:00","replies":[{"id":"KXgGrbEUpd","paper_id":"FFvCjbhpDq","replyto":"X1vFgAJU2b","number":9,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer BuFP","comment":"$26"},"created_at":"2023-11-16T00:00:00+00:00","modified_at":"2023-11-16T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Response to Reviewer BuFP

","comment":"$27"}},{"id":"48yVieaBqM","paper_id":"FFvCjbhpDq","replyto":"X1vFgAJU2b","number":13,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Response to Reviewer BuFP04, revision 2","comment":"We are happy to announce new results. \n\nFollowing your suggestion, we ran additional experiments with EWC on Montezuma’s Revenge. Our findings indicate that while EWC provides improvements, it still fails short compared to performance achieved by Behavioral Cloning. Detailed results and analyses of these experiments are presented in Figure 5 the revised manuscript. \n\nThank you once again for your constructive feedback, which improves our work. We believe that our paper improved considerably during the rebuttal period. We, thus, gently ask you to consider raising the score, or providing us with further questions."},"created_at":"2023-11-21T00:00:00+00:00","modified_at":"2023-11-21T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Response to Reviewer BuFP04, revision 2

","comment":"

We are happy to announce new results.

Following your suggestion, we ran additional experiments with EWC on Montezuma’s Revenge. Our findings indicate that while EWC provides improvements, it still fails short compared to performance achieved by Behavioral Cloning. Detailed results and analyses of these experiments are presented in Figure 5 the revised manuscript.

Thank you once again for your constructive feedback, which improves our work. We believe that our paper improved considerably during the rebuttal period. We, thus, gently ask you to consider raising the score, or providing us with further questions.

"}}],"contentHtml":{"rating":"

3: reject, not good enough

","summary":"

Summary:

This paper studies fine-tuning in RL, and specifically the issue of forgetting and potential mitigation strategies. They demonstrate in several settings (simulated robot manipulation, Montezuma's Revenge and NetHack) that if a policy is pretrained on some part of the state space which is far from the initial state distribution, the knowledge is often forgotten and there are little to no improvements over training from scratch. They furthermore investigate different knowledge retention strategies (such as L2 penalties between pretrained and fine-tuned policy weights, possibly weighted by fisher information, as well as simple BC regularization on the pretraining data). They find that BC regularization helps the most, and can help prevent forgetting the behaviors encoded in the pretrained policy.

","questions":"$28","soundness":"

3 good

","strengths":"

\n
The paper's main takeaway message, that adding BC regularization helps avoid forgetting previous behaviors during fine-tuning, is well supported by the experiments. This is demonstrated in 3 environment, including continuous control (MetaWorld), a pixel-based Atari game (Montezuma's Revenge) and a procedurally generated, long-horizon game with complex dynamics (NetHack).
\n
\n
The paper does a nice job with their analysis and visualizations illustrating the forgetting behavior.
\n

","confidence":"

","weaknesses":"

The main takeaway, which is essentially that co-training on the old tasks prevents forgetting when learning a new task, is pretty unsurprising and has been demonstrated before in previous works in continual learning both for the supervised case and the RL case. It's not clear what the contribution of this work adds.
An obvious downside of co-training on previous tasks is that the memory requirement increases linearly with the number of tasks and the computation increases quadratically - this is not adequately discussed.
It would have been nice to include result for the L2 and EWC on Montezuma and NetHack.

","contribution":"

1 poor

","presentation":"

3 good

","code_of_conduct":"

Yes

"}},{"id":"8wCHbiSYAI","paper_id":"FFvCjbhpDq","replyto":"FFvCjbhpDq","number":6,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"General response","comment":"$29"},"created_at":"2023-11-16T00:00:00+00:00","modified_at":"2023-11-16T00:00:00+00:00","replies":[],"contentHtml":{"title":"

General response

","comment":"$2a"}},{"id":"O0aR8Ujsuq","paper_id":"FFvCjbhpDq","replyto":"FFvCjbhpDq","number":18,"type":"Official_Comment","role":"mixed","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Summary of the second revision","comment":"We would like to again thank the Reviewers for their comments and suggestions that led to numerous improvements to the paper. Since the last revision, we ran additional experiments and we updated the manuscript to include them.\n\nFor clarity, changes introduced in today’s revision (21 Nov) are highlighted in green, while changes introduced in the previous revision (15 Nov) are still highlighted in blue.\n\nWe ran the following experiments:\n- We verified the performance of the episodic memory approach on Continual World.\n- We evaluated EWC on Montezuma’s Revenge\n- We evaluated EWC on NetHack\n- We ran experiments on NetHack with a higher number of training steps (4B instead of 2B).\n\nAs the discussion period draws to an end, we would like to gently ask the Reviewers to let us know if they have any further questions after the rebuttal and to consider raising the scores in light of the introduced improvements. We greatly value your feedback."},"created_at":"2023-11-22T00:00:00+00:00","modified_at":"2023-11-22T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Summary of the second revision

","comment":"

We would like to again thank the Reviewers for their comments and suggestions that led to numerous improvements to the paper. Since the last revision, we ran additional experiments and we updated the manuscript to include them.

For clarity, changes introduced in today’s revision (21 Nov) are highlighted in green, while changes introduced in the previous revision (15 Nov) are still highlighted in blue.

We ran the following experiments:

We verified the performance of the episodic memory approach on Continual World.
We evaluated EWC on Montezuma’s Revenge
We evaluated EWC on NetHack
We ran experiments on NetHack with a higher number of training steps (4B instead of 2B).

As the discussion period draws to an end, we would like to gently ask the Reviewers to let us know if they have any further questions after the rebuttal and to consider raising the scores in light of the introduced improvements. We greatly value your feedback.

"}},{"id":"OqP5d9umor","paper_id":"FFvCjbhpDq","replyto":"FFvCjbhpDq","number":1,"type":"Meta_Review","role":"area_chair","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"metareview":"$2b","justification_for_why_not_lower_score":"N/A","justification_for_why_not_higher_score":"While the experimental evaluation in the paper is thorough and the considered problem relevant, the obtained insights are not very deep, mainly to be expected and can be well explained and resolved using existing methodology."},"created_at":"2023-12-06T00:00:00+00:00","modified_at":"2024-02-17T00:00:00+00:00","replies":[],"contentHtml":{"metareview":"$2c","justification_for_why_not_lower_score":"

N/A

","justification_for_why_not_higher_score":"

While the experimental evaluation in the paper is thorough and the considered problem relevant, the obtained insights are not very deep, mainly to be expected and can be well explained and resolved using existing methodology.

"}},{"id":"JvGKzaopmF","paper_id":"FFvCjbhpDq","replyto":"FFvCjbhpDq","number":1,"type":"Decision","role":"program_chair","rating":null,"confidence":null,"soundness":null,"contribution":null,"presentation":null,"originality":null,"quality":null,"clarity":null,"significance":null,"content":{"title":"Paper Decision","comment":"","decision":"Reject"},"created_at":"2024-01-16T00:00:00+00:00","modified_at":"2024-02-17T00:00:00+00:00","replies":[],"contentHtml":{"title":"

Paper Decision

","decision":"

Reject

"}}],"submissionHistory":[]}]}]}]