PaperHub
6.7
/10
Spotlight3 位审稿人
最低6最高7标准差0.5
6
7
7
3.7
置信度
正确性2.7
贡献度3.0
表达3.3
NeurIPS 2024

A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06

摘要

关键词
plasticity lossreinforcement learningregularizationcontinual learningoptimization

评审与讨论

审稿意见
6

The article introduces an extensive empirical analysis of plasticity loss in on-policy reinforcement learning (RL), focusing on Proximal Policy Optimization (PPO). The main findings include that plasticity loss is also present in on-policy RL and that “regenerative” methods that regularly grow network parameters work well in this setting. The empirical evaluation consists of ProcGen and ALE benchmarks. Additional environmental distribution shifts were introduced to study the plasticity loss root cause thoroughly. 8 previously introduced methods to counteract plasticity loss in supervised and off-policy RL are examined, namely L2 Norm, LayerNorm, CReLU activations, Regenerative Regularization, Resetting final layer, Shrink+Perturb, Plasticity Injection, and ReDo. The work presents and analyzes metrics, such as weight magnitude, number of dead neurons, and gradient norm, previously related to the loss of plasticity.

优点

The strong part of the work is the extensive evaluation setup of on-policy RL. The claims are well supported by experiments. The paper also interestingly indicates that the problem of warm-start might be partially mitigated by “regenerative” methods that target weight magnitude growth. This has the downstream effect of minimizing the number of dead units in the network previously connected to the plasticity loss.

缺点

While the experimental setup is comprehensive, the findings could benefit from deeper analysis. The article does not present new insights into plasticity loss nor propose novel methods to mitigate it, primarily noting that the problem also exists in on-policy RL. A more thorough exploration of the interactions between methods and their impact on plasticity would significantly enhance the work. In particular, it would be beneficial to understand the differences in plasticity loss dynamics between off-policy and on-policy methods.

Additionally, some of the figures are overly cluttered, making them difficult to interpret quickly. To improve clarity, consider summarizing the results from Figures 4 and 5 into a single scalar per method and moving the detailed figures to the appendix. I also suggest using the rliable [1] library for more effective results aggregation.

[1] https://github.com/google-research/rliable

问题

Regarding deeper insights:

  1. In off-policy methods, it has been shown that the critic network mainly loses plasticity. Can the authors comment on their on-policy experiments through this lens? Specifically, is resetting only the actor's or only the critic's head more or less beneficial? What role does the common backbone play in this problem? If you separate the actor and critic networks, which can overall impact performance [1], will the conclusions remain similar?

  2. What happens if we increase the number of epochs? If we combine this with the examined methods, how will it impact the final performance? Does increasing the number of epochs hurt performance due to more and more outdated samples, or is it primarily due to plasticity loss?

[1] Andrychowicz, M., Raichuk, A., Stańczyk, P., Orsini, M., Girgin, S., Marinier, R., ... & Bachem, O. (2020). What matters in on-policy reinforcement learning? A large-scale empirical study. arXiv preprint arXiv:2006.05990.

局限性

yes

作者回复

Thank you for taking the time to review our paper and provide helpful feedback.

Our intention with this work was to focus on the on-policy setting, given that the majority of the work in the area of plasticity loss has focused exclusively on the off-policy setting. Our decision not to include off-policy experiments was based on the fact that we believed there was ample evidence characterizing the problem and various solutions already existing in the literature. We agree that a direct comparison utilizing a single codebase between the two may have presented additional value, but we chose to instead allocate resources towards extensive e experiments specifically in the on-policy setting.

We agree that the information in Figures 4 and 5 could be presented in a more interpretable way. Our goal is to allow readers to see the entire trends for each method, but given the number of methods and task conditions, this presents a challenge to represent succinctly. Thank you for pointing out the rliable library. We have used it to generate a potential replacement for Figure 4, which you can find here. If it seems like an improvement, we will also replace Figure 5 using the same layout, and move the current figures to the appendix.

In our experiments which involved resetting the “final layer”, we reset both the critic and policy heads. Preliminary experiments (not reported in the paper) suggested that resetting both was superior to resetting either individually. We believe that this differs from the off-policy setting due to the critical role of the policy in the collection of samples. Whereas in off-policy learning there is a stable buffer of data to draw on, in the on-policy setting the data distribution used for training is much more sensitive to the policy. As such, resetting the policy head increases the overall policy entropy, thus ensuring that the data distribution used for training can remain diverse enough to prevent policy collapse. That said, the majority of methods considered here act on the entire network, rather than just the final layers of the network. As such, we believe that our results are largely robust to this choice. Relatedly, because we were considering discrete action policies, we utilized a shared encoder network, as this is the standard set-up in the PPO algorithm.

The total number of epochs was chosen such that convergence would always be reached within a single round of the experiment. As such, running experiments longer did not result in measurably different behavior than what has been discussed in this paper, and we did not find policy collapse or degradation within a single round. We believe that this suggests that the effects we observe in our experiments are indeed the result of plasticity loss. We will state this clearly in the camera ready.

评论

Thank you for responding to my comments and taking into account my suggestions regarding the charts. I believe that the new form of the chart facilitates comparison. Still, I also understand the authors' intentions regarding the original version of the chart, so I leave the decisions regarding the form of chart 5 to the authors. But if they decide to keep the original version, I suggest the new chart be available to the reader in the appendix. I am sure the article fills the gap in a very reliable way regarding the loss of flexibility in on-policy algorithms. Although I think it is very interesting to draw common conclusions for the on- and off-policy methods, I understand that it is beyond the scope of this article.

审稿意见
7

This work studies the loss of plasticity phenomenon in the on-policy continual deep RL setting, where previous work has focused on studying and identifying mitigation strategies for the off-policy RL or supervised learning settings. They conduct experiments over a variety of settings (gridworld, CoinRun, and Montezuma’s revenge) over different variants of environment distribution shift, demonstrating that loss of plasticity still occurs in the on-policy regime. They perform a further analysis of the correlations in both train and test performance with various quantities studied in previous works, across several mitigation strategies previously presented. Based on this, they provide several hypotheses for what properties are needed for successful intervention at addressing plasticity loss as well as ensuring good generalization performance.

优点

  • Paper is overall well-written and particularly introduces the loss of plasticity phenomenon/warm start problem well. Empirical investigations of loss of plasticity for continual RL are important for the community.
  • Experiments are comprehensive and presented clearly, including domains of varied complexity. The authors took care in implementing several intervention methods that have arose previously in the literature, and report correlation results comprehensively.
  • There are new insights about previously successful intervention methods not working in their setting, such as concatenated ReLUs and plasticity injection. From their correlation results, they provide connections between their most successful methods (regularization-based) to the greatest predictor of plasticity loss (weight magnitude, and surprisingly not gradient magnitude or magnitude of weight change).
  • There are interesting next directions exploring mechanisms of plasticity loss and using these metrics as an indicator for what is occurring in the optimization landscape for continual reinforcement learning.

缺点

  • The graphs are difficult to read, given the number of interventions and the colors chosen. It’s difficult for me to see, eg. how much combining soft shrink + perturb with LN improves upon only doing one or the other.
  • In Appendix D, it might be useful to highlight the significantly improved intervention methods for each environment and shift condition, and some written insight about the table results.

Minor typos:

  • Line 24: arrive -> arrives
  • Line 142: withing -> within

问题

  • What is the reason for adding LN to soft shrink and perturb, versus original shrink and perturb? It seems that shrink and perturb seems to be doing better than the soft variant for most of your results, and I would guess that the procedure being applied after each step of gradient descent instead of longer intervals could hurt performance.
  • Do you have any insights why gradient magnitude/weight change magnitude did not have a significant correlation with plasticity or generalization in your setting, in contrast to previous work?
  • Your Montezuma's Revenge experiments only analyze two intervention methods. How should I interpret the reward numbers in Figure 6? Do you have similar quantities (weight magnitude, gradient norm, dead units, etc.) for these methods?

局限性

Authors acknowledge their correlation plots do not attribute causality. Experiments are comprehensive in testing intervention methods except Montezuma's Revenge, which the authors address.

作者回复

Thank you for your valuable feedback on our paper.

Both you and another reviewer have brought up the readability of our graphs. We are working to improve this for the camera-ready version. Reviewer NjbE suggested utilizing the “rliable” library to generate plots. We took their suggestion and generated a potential new version of Figure 4 using the library, which you can see here. Please let us know if it improves readability, and if so we will replace both Figure 4 and 5 with this new layout and move the current figures to the appendix.

Thank you for pointing out the typographical errors. We have addressed them in our current draft of the manuscript.

The issue with the shrink-perturb method is that it is performed only occasionally, rather than at every step of SGD. This is fine in the original context under which it was introduced, where the problem setting enjoys a clear indicator of when it should be applied. In non-contrived RL settings, however, the environment might not provide this information—an implementation of shrink-perturb would either require us to either set the intervention cadance as a hyperparameter or to use additional machinery to detect appropriate points for application. We refer to methods that are applied at every step of SGD (like soft shrink perturb) as continuous, and we argue that this property is desirable in realistic RL settings.

It is worth clarifying our results concerning predictive measures for plasticity loss. We find that weight magnitude and dead unit count, when considered independently, are the two significantly predictive metrics of plasticity loss. Although we find that by itself gradient magnitude is not predictive, it becomes predictive as part of a Generalized Linear Model (GLM) which uses all the measures as predictors. This suggests that gradient magnitude is able to account for additional variance in plasticity loss which is not accounted for by weight magnitude, suggesting that it may contribute to some aspect of plasticity loss. In contrast, dead unit count is no longer predictive in the GLM due to its high correlation with weight magnitude, which is the more fundamental variable (because dead unit count is a function of weight magnitude).

We believe our results deviate from previous findings largely because of the breadth of settings we considered. Rather than studying only a single task, type of distribution shift, and model architecture, which might provide only narrow insight into the plasticity loss phenomenon, we explore several. Of course, our study also deviates from prior work in that we also consider only on-policy agents, and the underlying dynamics of plasticity loss in the two settings may differ.

We considered Montezuma's Revenge to be a more "natural" RL task, presented to show a use for what we have learned from environments studied earlier in the paper, which were chosen specifically to probe the plasticity loss pathology. As such, we unfortunately did not log the same diagnostic information for Montazuma's Revenge as we did for earlier tasks. Fitting these agents takes more time than is afforded in the NeurIPS discussion period, but we will collect this information for the camera ready copy.

The reward on the y-axis of Figure 6 is the agent's score in the game, reflecting how many objects it has collected, new rooms it has entered, and enemies it has vanquished. Our code for this is only a small modification to a popular PyTorch RND implementation.

评论

Thank you for responding to my comments and questions. I personally find the new figure easier to parse. The investigation across several tasks and distribution shifts is certainly interesting, and I would be curious to see if there's a connection between different shifts and the changing underlying loss landscape, as the authors have mentioned as future work.

审稿意见
7

This paper studies the problem of plasticity loss in on-policy deep RL. The study is quite wide as it covers many environments, types of non-stationarities, and solution methods. The first main result is the demonstration of plasticity loss in various settings. The second main result is the analysis of existing methods in the problems. The results show that regularization methods effectively mitigate plasticity loss in this setting.

优点

This is the first study that extensively studies plasticity loss in on-policy RL. Although some prior work has shown plasticity loss in on-policy RL, this study is much more extensive. The paper is generally well-written. The paper claims to be an extensive study of plasticity loss in on-policy learning, and it supports that claim by evaluating a wide range of methods in a wide range of environments. There are some minor weaknesses, but the overall paper is good and would be a good contribution to the plasticity loss literature. I recommend accepting the paper.

缺点

The statistical significance of the results is unclear. How many runs were performed for all experiments? 5? And what do the shaded regions represent in the figures? All the figure captions should contain details about the number of runs and shaded regions.

问题

What values of β\beta were used for Adam? Dohare et al. found that equal β\betas in PPO (particularly when used with ReLUs) significantly improve the performance when evaluated over a long time horizon, as may be the case in your experiments.

Dohare et al. Overcoming policy collapse in Deep RL. EWRL 2023.

局限性

The authors adequately discuss the limitations.

作者回复

We appreciate you taking the time to read and review our paper.

We used five replicates per experiment unless indicated otherwise, and the shaded regions of the graphs correspond to standard error. We will revise all figure captions to make both of these points explicit.

We used a learning rate of 5e-4 for all experiments, as displayed in Appendix C. We further use the default PyTorch values for Adam's β1\beta_1 and β2\beta_2, and we will update the section to reflect this detail.

We agree that it is possible that using the “Non-stationary Adam” method from Dohare et al. would improve performance on the tasks considered here. That said, we do not find evidence of policy degradation within-rounds (see for example Figure 2 upper-left in contrast to Figure 1 of Dohare et al.). Note that we chose the number of training epochs to reflect the asymptotic behavior of the policy—longer-running experiments, which were used to select this parameter, did not show evidence of policy collapse when the task/environment distribution is held fixed. Regardless, we will include a discussion “Non-stationary Adam” in our related work.

评论

Thank you for your response. It answers all my questions. I'm happy to suggest accepting this paper.

作者回复

To all reviewers:

We thank the reviewers for their time and insights. Two reviewers have suggested clarity improvements to Figures 4 and 5. We have updated Figure 4 using the rliable library suggested by Reviewer NjbE (https://ibb.co/HxJcVmh). If reviewers agree that it is an improvement, we will also update Figure 5 and similar Appendix figures using the same layout.

We respond to individual concerns below.

最终决定

This paper performs an extensive empirical analysis of plasticity loss in on-policy reinforcement learning methods. All reviewers praised the provided analysis and the contributions of the paper. Everyone agrees the paper should be accepted and I am recommending the paper's acceptance.