PaperHub
6.6
/10
Poster4 位审稿人
最低3最高5标准差0.9
3
5
3
3
ICML 2025

Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn

OpenReviewPDF
提交: 2025-01-14更新: 2025-07-24
TL;DR

Reducing churn prevents the rank decrease of NTK, thus mitigates plasticity loss and improves continual RL.

摘要

关键词
PlasticityContinual LearningReinforcement LearningGeneralization

评审与讨论

审稿意见
3

A recent line of research has highlighted a problem where standard deep-learning methods gradually lose plasticity (i.e., the ability to learn new things) in continual learning settings (Lyle et al., 2022; Dohare et al., 2024). This paper examines plasticity loss in deep continual reinforcement learning (RL) through the lens of churn—network output variability for out-of-batch data caused by mini-batch training (Schaul et al., 2022; Tang & Berseth, 2024). The authors identify correlations between plasticity loss and increased churn by studying the rank collapse of the Neural Tangent Kernel (NTK) matrix NθN_\theta, which is defined as the matrix of gradient dot products between all data points (Lyle et al., 2024): Nθ(i,j)=θfθ(xi)θfθ(xj)N_\theta(i,j) = \nabla_\theta f_\theta(x_i)^⊤ \nabla_\theta f_\theta(x_j) for xi,xjx_i, x_j. They empirically demonstrate that a rank collapse in the NTK matrix is a sympton of disrupted learning dynamics and hence leads to poor performance. To address this, they propose Continual Churn Approximated Reduction (C-CHAIN), a method to mitigate plasticity loss in settings where tasks switch every NN steps. They validate their approach with empirical results on four OpenAI Gym environments and 16 ProcGen tasks.

References

  1. Lyle, C., Rowland, M., and Dabney, W. Understanding and preventing capacity loss in reinforcement learning. In ICLR, 2022.
  2. Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 632(8026):768–774, 2024.
  3. Schaul, T., Barreto, A., Quan, J., and Ostrovski, G. The phenomenon of policy churn. arXiv preprint, arXiv:2206.00730, 2022.
  4. Tang, H. and Berseth, G. Improving deep reinforcement learning by reducing the chain effect of value and policy churn. In NeurIPS, 2024.
  5. Lyle, C., Zheng, Z., Khetarpal, K., van Hasselt, H., Pascanu, R., Martens, J., & Dabney, W. (2024). Disentangling the causes of plasticity loss in neural networks. arXiv preprint arXiv:2402.18762.

给作者的问题

  1. Why aren’t continual backprop (Dohare et al. 2024) and ReDo (Sokar et al. 2023) included as baselines?
  2. Is this method specific to PPO? Could you include a small experiment with SAC or another algorithm to demonstrate its generality? I believe a small experiment like this would significantly strengthen the case for the method’s generality.
  3. How does it perform in continuous action spaces? It seems all experiments here use discrete action spaces
  4. We sample BrefB_{ref}​ and BtrainB_{train}​ from the buffer DD. Does the size of BrefB_{ref}​​ have on impact on C-CHAIN’s performance?
  5. PPO is known to suffer from policy collapse (Dohare et al. 2024). Does this also occur in off-policy algorithms like SAC or TD3? Would using a large replay buffer help in this setting? Intuitively, it seems like it would mitigate catastrophic forgetting rather than plasticity loss.
  6. Should the buffer be flushed after each task switch? In other words, does the agent need to be aware of task switches?
  7. Out of curiosity, have the authors considered the streaming RL setting (Elsayed, Vasan & Mahmood 2024)? Is it possible to extend this method to the streaming RL setting?
  8. Is layer norm + l2 regularization effective in maintaining plasticity (Lyle et al. 2024)?
  9. In Fig. 4, the approximate rank appears to be slowly decreasing for C-CHAIN over time. Is this the case? If so, is there an explanation for this behavior?

References

  1. Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 632(8026):768–774, 2024.
  2. Sokar, G., Agarwal, R., Castro, P. S., and Evci, U. The dormant neuron phenomenon in deep reinforcement learning. In ICML, pp. 32145–32168, 2023.
  3. Elsayed, M., Vasan, G., & Mahmood, A. R. (2024). Streaming Deep Reinforcement Learning Finally Works. arXiv preprint arXiv:2410.14606.
  4. Lyle, C., Zheng, Z., Khetarpal, K., van Hasselt, H., Pascanu, R., Martens, J., & Dabney, W. (2024). Disentangling the causes of plasticity loss in neural networks. arXiv preprint arXiv:2402.18762.

论据与证据

Here, I present some claims from the paper verbatim and explain why I agree or disagree with them, while also questioning their soundness.

We demonstrate the connection between plasticity loss and increased churn, and show the pathological learning dynamics this connection induces.

Yes, they demonstrate this connection for a specific algorithm—PPO—in a particular continual learning setting where tasks switch every few million steps. They do this by showcasing rank collapse in the neural tangent kernel matrix with prolonged training over a sequence of tasks.

We unbox the efficacy of reducing churn in continual RL by identifying a gradient decorrelation effect and a step-size adjustment effect.

Honestly, I am unsure what "unboxing the efficacy of reducing churn" means in this context. I would appreciate it if the authors could clarify this in plain language. In addition, I’d like further clarification on the step-size adjustment effect. How does it differ from momentum-based optimizers like Adam or RMSprop, which also implicitly affect the step size?

We propose C-CHAIN and demonstrate it effectively mitigates the loss of plasticity and outperforms prior methods in a range of continual RL settings.

This is supported by their experiments. However, I have concerns about the experiments and analyses, which I will discuss later.

We demonstrate that under the continual changes in the data distribution and objective function, the agent gradually loses the rank information of its NTK matrix, leading to highly correlated gradients and eventually the exacerbation of churn.

The authors seem to be making an important point here, but it is not entirely clear to me. What exactly do they mean by highly correlated gradients? How does their method prevent this from occurring? Is it by leveraging the gradient information of the reference batch BrefB_{ref}​ in their churn regularizer loss?

方法与评估标准

The proposed method C-CHAIN is well-suited for the continual learning setting in consideration, especially one where tasks switch every N timesteps. I'm not particularly convinced by some of the design choices, but I want to assure the authors that I'd keep an open mind and re-evaluate my reviews based on their rebuttal response.

  1. On the usage of domain randomization to Gym classic control environments:

For Gym Control, we use four environments: CartPole-v1, Acrobot-v1, LunarLander-v2 and MountainCar-v0. For each environment, a task sequence TT is built by chaining kk instances of the environment with a unique Gaussian observation noise ϵiN(0,σ2)\epsilon_i \sim \mathcal{N}(0, \sigma^2) sampled once for each.

Isn't this domain randomization (Tobin et al. 2017), but with the parameters changed every few million timesteps? Is this only to induce non-stationarity in the environment? Adding noise arbitrarily to the observations makes it seem a little contrived. Have you considered more natural tasks such as varying factors like gravity or surface slipperiness or even the terrain in MuJoCo? For inspiration, you can refer to robotics works such as Kumar et al. (2021).

  1. In several environments, including Fruitbot, Jumper, and Plunder, the standard errors overlap significantly with the second-best reported method. Is this due to the use of only 6 seeds? Are the authors confident that the method would prove superior with more seeds?

  2. Patterson et al. (2024) suggest the following: "In general, we advocate that you do not report standard errors. They are like a low-confidence confidence interval, and it is more sensible to decide on the confidence interval you want to report." Could the authors clarify why they chose to report standard errors, especially given that only 6 seeds are used? Would it be possible to use a different, more appropriate metric, such as a bootstrap confidence interval?

References

  1. Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017, September). Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 23-30). IEEE.
  2. Kumar, A., Fu, Z., Pathak, D., & Malik, J. (2021). RMA: Rapid Motor Adaptation for Legged Robots. Robotics: Science and Systems XVII.
  3. Patterson, A., Neumann, S., White, M., & White, A. (2024). Empirical design in reinforcement learning. Journal of Machine Learning Research, 25(318), 1-63.

理论论述

N/A

实验设计与分析

  1. Question on C-CHAIN for Continual Supervised Learning

Besides, Permuted-MNIST and RandomLabel-MNIST are simple, where the agent can find a good solution near the initialization.

  • This is a strange argument—it's as if the authors are saying, "Our method is too strong, which is why it failed..."
  • Continual Mountain Car seems even simpler than Permuted-MNIST to me. Could the authors propose an alternative plausible explanation for why their method may not be performing well in this case?
  1. In Fig. 3, for Mountain Car, why does C-CHAIN (in blue) exhibit a flat curve in the first phase? What changes in the later stages that allow for successful learning?

  2. It's unclear to me why some methods were not included in the experimental comparison, especially when they are cited as references. For example, Dohare et al. (2024) proposed continual backprop, and Lyle et al. (2024) suggest that using layer normalization and L2 regularization is effective for maintaining plasticity. Why were these methods not considered? This is a significant concern for me. I would be willing to re-evaluate my score if the authors provide additional empirical comparisons with these two methods.

References

  1. Dohare, S., Hernandez-Garcia, J. F., Lan, Q., Rahman, P., Mahmood, A. R., and Sutton, R. S. Loss of plasticity in deep continual learning. Nature, 632(8026):768–774, 2024.
  2. Lyle, C., Zheng, Z., Khetarpal, K., van Hasselt, H., Pascanu, R., Martens, J., & Dabney, W. (2024). Disentangling the causes of plasticity loss in neural networks. arXiv preprint arXiv:2402.18762.

补充材料

I looked at the learning curves in Appendix C and the list of hyper-parameters in Appendix A.

与现有文献的关系

The paper examines the loss of plasticity in deep neural networks through the lens of churn. While loss of plasticity has been demonstrated in several works, including Lyle et al. (2022), Sokar et al. (2023), and Dohare et al. (2024), its cause is not well understood. The authors make a novel contribution by studying this phenomenon from the perspective of churn and provide interesting insights.

遗漏的重要参考文献

The authors overlook some relevant references on step-size adaptation. I recommend considering the following citations.

Step-size adaptation

  • Sutton, R. S. (1992, July). Adapting bias by gradient descent: An incremental version of delta-bar-delta. In AAAI (Vol. 92, pp. 171-176).
  • Dabney, W., & Barto, A. (2012). Adaptive step-size for online temporal difference learning. AAAI Conference on Artificial Intelligence (pp. 872-878).
  • Martens, J., & Grosse, R. (2015). Optimizing neural networks with kronecker-factored approximate curvature. International Conference on Machine Learning (pp. 2408-2417).
  • Elsayed, M., Vasan, G., & Mahmood, A. R. (2024). Streaming Deep Reinforcement Learning Finally Works. arXiv preprint arXiv:2410.14606.

其他优缺点

Strengths

  • The introduction was well-written, making it easy to understand the problem, the solution, and the authors' contributions.
  • The visualizations and analysis using the NTK are interesting and insightful.

Weaknesses

  • I’ve listed the weaknesses in the previous sections. I’m hopeful the authors will address at least some of these in the rebuttal.
  • The figures could be clearer. For example, in Fig. 4, consider placing the legend outside the figure, as its current position occludes large portions of the figure.
  • C-CHAIN is tested only with PPO. It’s unclear whether it is necessary for other methods.

其他意见或建议

Reinforcement learning (RL), when coupled with non-linear function approximators, suffers from optimization challenges due to the non-stationarity of the data and the learning objectives, i.e. the deadly triad

  • Why mention the deadly triad here? What is off-policy about your learning problem? I believe the introductory line doesn’t need to reference the deadly triad. Perhaps the authors mention it implicitly to acknowledge PPO’s clip surrogate loss?

I like the motivation and approach of the paper and believe it could be much stronger in its current form. If the authors provide additional experimental evidence and address my concerns, I would be happy to increase my score.

伦理审查问题

N/A

作者回复

On other possible baseline methods, e.g., L2 regularization, LayerNorm (Lyle et al., 2024) continual backprop (Dohare et al., 2024) and ReDo (Sokar et al. 2023)

In addition to existing empirical evidence for LayerNorm, L2 Regularization and ReDo, we additionally implemented and ran LayerNorm, ReDo and AdamRel in our continual ProcGen experiments.

[Existing Evidence] The comparison between LayerNorm and TRAC (i.e., our major baseline and codebase) in continual Gym control tasks can be found in Appendix D, Figure 15 of TRAC paper (Muppidi et al., NeurIPS 2024). Their results show TRAC outperforms LayerNorm (i.e., LayerNorm Adam) in their Figure 15. For L2 regularization, Lyle et al. (2024) mentioned “We found that batch normalization and L2 regularization interfere with learning” in their Section 4.2, RL evaluation. In our work, we did L2 regularization and we found it performed very similarly (slightly worse) to L2 Init (i.e., Regenerative Regularization).

In another paper (Juliani and Ash, NeurIPS 2024), their Figure 5, ReDo, LayerNorm and other related methods are compared with L2 Init (i.e., Regen Reg in the figure) and L2 norm in ProcGen CoinRun. The results show that L2 Init performs comparably with ReDo and LayerNorm and slightly outperforms L2 norm (which is consistent with our findings as mentioned above).

For continual backpropagation, it is similar to ReDo which uses a different recycling metric, and in Appendix C.4, Figure 25 of the ReDo paper, they achieved similar results. Therefore, we only consider ReDo here.

[Additional Experiments] Please refer to the response to Reviewer cU5H and Reviewer mKWp.

Questions on the buffer

[On buffer flush and awareness of task switch]

In our method, i.e., C-CHAIN PPO, both BrefB_{ref} and BtrainB_{train} are sampled from the online interaction data collected by the policy in the current iteration (e.g., every 2048 interactions). Thus, it does not matter for C-CHAIN PPO whether the buffer is flushed after each task switch and the agent does not need to be aware of task switches.

[The impact of the size of BrefB_{ref}]

We did the experiments for different batch sizes for BrefB_{ref} on continual Gym control tasks. Our findings were that using a 2x, 4x or 8x batch size for BrefB_{ref} sometimes improved the learning performance, but not consistently. To some degree, increasing the batch size of BrefB_{ref} acted similarly to increasing the regularization coefficient, as both of them reduce more churn. Thus, we did not search for the best batch size for BrefB_{ref} to alleviate the hyperparameter choice burden.

Questions on concrete experimental results in continual RL

[Explanation for the slow decreasing of the approximate rank for C-CHAIN in Fig. 4]

For C-CHAIN, the approximate rank decreases very slowly, from around 85 to 80 over 10M steps (in contrast, the approximate rank decreases is 75 to 30 for vanilla PPO). We think it is reasonable as the agent continually consumes the plasticity to learn new tasks and our method is not perfect.

[The performance in MountainCar]

We think MountainCar is a bit more difficult than the other three Gym control tasks, as it needs more exploration and TRAC almost totally failed in it (note that MountainCar is not included in TRAC paper). One possible reason for C-CHAIN’s low early-stage performance might also be due to the exploration feature of MountainCar. Reducing churn could also be viewed to prevent the generalization of exploration behavior learned by the policy in some states to similar states. Therefore, C-CHAIN learns slowly but improves steadily, while vanilla PPO learns quicker and collapses.

[The performance on Fruitbot, Jumper, and Plunder]

By checking the task details in the official document of ProcGen, tasks like Leaper, Jumper are stuff-collection tasks with sparse rewards, and especially Leaper is an episodic-reward environment.

To gain more understanding, we provided the average scores of the max/mean/min curves for Leaper, Fruitbot, Plunder, Jumper below. We can observe that C-CHAIN improves the average Max scores over PPO Vanilla and the average Mean scores (by improving Max or by reducing failures when Max is similar); while it does not fully address the (near-)zero performance due to the limited exploration ability of the PPO base agent.

Method/TaskLeaperFruitbotPlunder
PPO (Oracle)2.020 / 0.350 / 0.0006.398 / 1.613 / -0.07911.858 / 7.113 / 2.731
PPO (Vanilla)2.006 / 0.347 / 0.0005.882 / 1.549 / -0.0688.304 / 4.801 / 1.944
C-CHAIN4.004 / 0.668 / 0.0006.153 / 1.689 / -0.04316.916 / 10.340 / 4.949
TRAC2.002 / 0.334 / 0.0003.624 / 0.855 / -0.03417.104 / 9.719 / 2.724

Other remaining questions

Due to the space limitation, we will provide them during the discussion stage.

审稿人评论

I thank the authors for their detailed response, which adequately addressed my questions and concerns. I also appreciate the additional results provided. Accordingly, I have increased my score.

作者评论

We sincerely appreciate the reviewer’s very careful and constructive comments. We are glad that our responses addressed the reviewer’s concerns.


More experimental results

Aside from three additional baselines and the Reliable Metrics, we provided more results as suggested by Reviewer mKWp:

  • We added three continual learning settings in DMC, (1) Walker-stand -> Walker-walk -> Walker-run, (2) Quadruped-walk -> Quadruped-run -> Quadruped-walk, (3) Dog-stand -> Dog-walk -> Dog-run -> Dog-trot. The results show that C-CHAIN PPO outperforms PPO across the three continual DMC settings (12 seeds).
  • We added a continual learning setting in MinAtar: Space_Invaders -> Asterix -> Seaquest. The results show that C-CHAIN DoubleDQN outperforms DoubleDQN (12 seeds).

For concrete scores, please refer to our response to Reviewer mKWp or cU5H.


As mentioned, we were not able to post all the responses due to the space limitation. We requested AC to relay our remaining responses. To avoid any unexpected circumstances where the reviewer may not have seen the remaining responses, we provided them here:

Questions on the expression of claims

[The meaning of “highly correlated gradients” and how C-CHAIN prevents this from occurring] “highly correlated gradients” means the case that for two input data xi,xjx_i, x_j the NTK between them θfθ(xi)θfθ(xj)\nabla_{\theta} f_{\theta}(x_i)^{\top} \nabla_{\theta} f_{\theta}(x_j) (which characterizes both the norm of the gradients and the cosine similarity of the gradients) has a high absolute value.

The gradient correlation is suppressed by reducing the churn for the reference batch caused by the gradient update based on the training batch, which has the effect of suppressing the off-diagonal of the NTK matrix as described by Equation 11.

[On the "unboxing the efficacy of reducing churn"] We meant to express that we decomposed the effect of C-CHAIN into two parts as presented in Section 4.3. We will rephrase this sentence and use a plain word.

[Further clarification on the step-size adjustment effect] Compared with momentum-based optimizers, the step-size adjustment effect we presented in this work differs at two points:

  • Momentum-based optimizers use the first-order and second-order moments of the historical gradients (which are temporally correlated) to adjust step size; while the step-size adjustment effect of reducing churn uses the projected gradient of an independently sampled reference batch (i.e., the off-training-batch information).
  • Adam and RMSProp also change the gradient direction; while the step-size adjustment effect of reducing churn alone only changes the scale of the gradient as it has the same (or reverse) direction (the direction is changed by the first effect of reducing churn, i.e., the term 1 in Equation 10).

[On the “deadly triad”] We agree with the reviewer’s comment. Our aim is to introduce non-stationarity (as it is a problem feature in both RL and continual learning). We will rephrase and remove it.

Question on C-CHAIN for Continual Supervised Learning

We provided two possible explanations in Sec. 5.3.

First, a significant difference to note is that RL suffers from the chain of churn, formally characterized in (Tang and Berseth, 2024), which stems from the iterative policy improvement and policy evaluation nature (i.e., GPI), while SL does not as it learns from static labels. This means that the exacerbation of churn on both the policy and the value side can further (negatively) affect each other. This explains to some extent why C-CHAIN is not very effective for CSL.

Second, there is a connection between the performance of L2 Init, WC and C-CHAIN from our CRL experiments to CSL experiments:

  • L2 Init and WC perform better in continual Gym control but perform relatively worse in ProcGen. Both L2 Init and WC prevent the agent from going too far away from its initialization. Thus, we assume that this is a possible explanation for their performance in Gym control and ProcGen.
  • Similarly, the good performance of L2 Init and WC could indicate that the agent finds a good solution near initialization.

On domain randomization

We adopted the continual Gym control setting from TRAC paper (Muppidi et al., 2024). We agree this can be viewed as domain randomization in general.

One thing we need to note is that one of the major factors of non-stationarity in continual learning is the distribution change of input data. For a representative experimental setting, Permuted-MNIST (Goodfellow et al., 2013; Dohare et al., 2024) applies fixed random permutations to input pixels. The continual Gym used in the TRAC paper is built on the same logic of constructing input distribution non-stationarity.


We believe these additional results and responses can further strengthen our work. Since this is our last opportunity to respond, we sincerely hope the reviewer could re-evaluate our work accordingly.

审稿意见
5

The paper investigates the loss of plasticity issue in the continual deep reinforcement learning setting from the lens of churn ("undesirable generalization", empirical excess out-of-training-batch variation). They claim that

  1. loss of plasticity and high churn are connected: the decrease of the NTK matrix rank indicates high churn and had previously been shown to correlate with plasticity loss.
  2. addressing churn has a positive effect on the decorrelation of the NTK, and serves as an implicit adaptive step size
  3. their proposed C-CHAIN effectively addresses churn in the continual RL setting, shows the benefits of addressing churn on the NTK empirically, and validates the connection between churn and plasticity.

update after rebuttal

I have read the other reviews and comments from the authors, and I maintain the score of my review. The authors have agreed to discuss the two additional papers I have cited in the references section, to make the presentation modifications about the figure illustrating the connection between NTK and churn, and added new baselines and experiments.

给作者的问题

No additional questions.

论据与证据

All claims are backed by enough theoretical or empirical justification.

  1. Connection between churn and plasticity through the NTK: The connection between churn and the NTK is clearly presented and backed by a theoretical derivation. It is also shown empirically in the experiments. The connection between the NTK and plasticity heavility relies on the results from Lyle et al. (2024), which has not been accepted to a peer-reviewed venue yet, which can be a limitation. However, the paper provides evidence of low-rank NTK matrices in the presence of plasticity loss, so the connection does hold independently of the motivation from Lyle et al. (2024).

  2. The effects of reducing churn on the NTK and the optimization: This is well justified theoretically in section 4.

  3. The proposed C-CHAIN algorithm: The algorithm is competitive with recently proposed methods and empirically shows the benefits of using churn on the NTK.

方法与评估标准

Overall, the choice of environments and baselines is good.

The ProcGen environments have been used in several previous papers on plasticity and are a valid benchmark for this paper. To my knowledge, the Gym environments have not been used in work on plasticity before, but the non-stationary version introduced by the paper seems to provide a good testbed to validate their claims.

The use of mean performance is a valid evaluation metric for this work.

The baselines considered give a good understanding of the performance of the proposed C-CHAIN.

理论论述

I checked all the derivations in the main paper.

in eq 5, there does not seem to be a term corresponding to the data distribution (as there was an expectation in eq 4). I think it can easily be plugged into SxS_x. The authors can clarify this or say that for illustrative purposes, they assume a uniform distribution.

实验设计与分析

There is enough statistical significance, and the hyperparameter optimization seems reasonable.

The analysis of the results does not make claims beyond what the results show, and the provided analysis backs up the claims made.

Showing the limitations of C-CHAIN in the supervised learning setting is appreciated.

补充材料

I did not check the supplementary material.

与现有文献的关系

Plasticity is an important problem in continual learning. This work proposes a solution to address it by addressing the dynamics of the NTK matrix, The work discussed other work addressing plasticity loss with other methods; however, addressing the NTK explicitly can have a broader connection to optimizing neural networks, and these broader connections are not discussed.

遗漏的重要参考文献

The essential literature has been discussed.

Nevertheless, the paper would benefit the community more by referencing/discussing the following works to give a broader presentation of the methods addressing plasticity loss in PPO, with which the main results of the paper have been presented. I believe [1] has been discussed in the paper, although [2] and [3], which came earlier, have not.

  1. Ellis, Benjamin, et al. "Adam on Local Time: Addressing Nonstationarity in RL with Relative Adam Timesteps."
  2. Moalla, Skander, et al. "No Representation, No Trust: Connecting Representation, Collapse, and Trust Issues in PPO."
  3. Juliani, Arthur, and Jordan Ash. "A Study of Plasticity Loss in On-Policy Deep Reinforcement Learning."

All in Advances in Neural Information Processing Systems 37 (2024): 113884-113910.

其他优缺点

Everything is mentioned in the previous sections.

其他意见或建议

Figure 1 was critical for me to understand the equations using the NTK to relate to churn, as one is a symmetric matrix, and the other contrasts two different batches. I was confused by eq 5 until I saw the figure. I think it would make the paper much easier to grasp if the figure were introduced with eq 5.

作者回复

We sincerely appreciate the reviewer’s valuable comments and recognition. Our response aims to add more discussion and clarification on the points mentioned in the inspiring comments.

In addition, we provide additional experimental results for AdamRel [1], ReDo and LayerNorm, along with Reliable Metrics (Agarwal et al., NeurIPS 2021). Besides, we provide two additional continual RL experimental setups in DMC.

On the data distribution regarding the expectation in Eq. 4 and SxS_x

We appreciate the reviewer for pointing this out. Yes, SxS_x denotes the stochastic variable of sampling BtrainB_{train}, which should be defined along with a sampling distribution. For illustrative purposes, using a uniform distribution is sufficient. And we expect to take the concrete form of the sampling distribution (intuitively, it should depend on factors like on-policy learning/off-policy learning, exploration policy, experience replay strategy, etc.) into the analysis in the future. We will clarify this point as suggested.

On the mentioned related [1], [2], [3]

[Additional Discussions]

For some discussions on [2] and [3], the Proximal Feature Optimization (PFO) regularization proposed in [2] is closely related to DR3 (Kumar et al., ICLR 2022) and CHAIN (Tang and Berseth, NeurIPS 2024). This is because that the feature difference can be viewed as the gradient difference when the network is viewed as a linear approximation i.e., π(s)=ϕ(s)W\pi(s) = \phi(s) W as in DR3; in another view, regularizing feature difference should share some overlapped effects with regularizing network-output difference as in CHAIN.

And for [3], we found the experimental results for ProcGen CoinRun by the Figure 5 in their paper provide a useful reference of the learning performance of other related methods not included in our submission version. For example, ReDo and LayerNorm perform comparably with Regen reg (i.e., the L2 Init baseline method adopted in our paper).

[Additional Experimental Results]

We added AdamRel [1], ReDo and LayerNorm to our experimental comparison in 16 ProcGen tasks. The aggregate comparison is shown below. We can find ReDo and LayerNorm perform comparably with L2 Init, which is basically consistent with the evidence in existing papers.

MethodPPO OraclePPO VanillaTRACC-CHAINLayerNormReDoAdamRelWeight ClippingL2 Init
ProcGen Agg.70.78955.04977.289101.79275.16461.18061.44357.09272.961

Moreover, we provide the aggregation evaluation by using Reliable Metrics (Agarwal et al., NeurIPS 2021). The evaluation uses Mean, Median, Inter-Quantile Mean (IQM) and Optimality Gap, with 95% confidence intervals and 50000 bootstrap replications (recommended by Reliable Metrics official implementation). It aggregates over 9 methods, 6 seeds for each of 16 ProcGen tasks, i.e., 864 runs with 10M steps for each.

The results are shown in the table below (the corresponding plot will be added to the revision). We can observe that C-CHAIN performs the best regarding all four metrics and outperforms the second-best with no overlap of confidence intervals for Median, IQM, and a minor one for Mean.

MethodMedianIQMMeanOptimality Gap (lower is better)
PPO (Oracle)1.000 (0.960, 1.036)0.969 (0.933, 1.010)1.000 (0.908, 1.130)0.128 (0.098, 0.155)
PPO (Vanilla)0.835 (0.728, 0.925)0.781 (0.722, 0.841)0.811 (0.704, 0.949)0.300 (0.252, 0.348)
TRAC0.977 (0.894, 1.130)0.999 (0.887, 1.103)1.017 (0.891, 1.166)0.255 (0.205, 0.305)
C-CHAIN1.452 (1.287, 1.522)1.388 (1.298, 1.472)1.461 (1.291, 1.719)0.086 (0.058, 0.110)
Weight Clipping0.889 (0.759, 0.948)0.822 (0.752, 0.889)0.859 (0.744, 0.996)0.275 (0.223, 0.328)
L2 Init1.029 (0.981, 1.103)1.035 (0.987, 1.092)1.116 (1.016, 1.237)0.098 (0.067, 0.130)
LN1.057 (0.996, 1.121)1.018 (0.952, 1.086)1.131 (0.976, 1.311)0.146 (0.107, 0.187)
ReDO0.945 (0.815, 0.974)0.862 (0.821, 0.904)0.912 (0.810, 1.039)0.209 (0.169, 0.249)
AdamRel0.905 (0.804, 0.957)0.860 (0.792, 0.925)0.899 (0.781, 1.037)0.250 (0.199, 0.301)

On other advice

[The position of Figure 1] We are glad to know that Figure 1 helps to understand the NTK equations. We will move Figure 1 close to Equation 2 and 5 with additional explanations to improve the smoothness of the introduction of the NTK equations.

[The broader connection to optimizing neural networks] We appreciate the reviewer’s inspiring comments. We will discuss these broader connections in our revision.

审稿人评论

After looking at the other reviews and the comments from the authors, I acknowledge some of the limitations mentioned by the other reviewers, but I maintain the score of my review.

作者评论

We sincerely appreciate the reviewer’s constructive feedback and positive support!


Aside from the three additional baseline methods and the Reliable Metrics we mentioned above, we provided more additional experiments as suggested by Reviewer mKWp:

  • We added three continual learning settings in DMC, (1) Walker-stand -> Walker-walk -> Walker-run, (2) Quadruped-walk -> Quadruped-run -> Quadruped-walk, (3) Dog-stand -> Dog-walk -> Dog-run -> Dog-trot. The results show that C-CHAIN PPO outperforms PPO across the three continual DMC settings.
  • We added a continual learning setting in MinAtar: Space_Invaders -> Asterix -> Seaquest. The results show that C-CHAIN DoubleDQN outperforms DoubleDQN (we applied C-CHAIN to the value network learning, with no modification to the hyperparameters of DoubleDQN).

The mean scores and standard errors across 12 seeds are shown below.

Method/TaskWalker/Stand-Walk-RunQuadruped/Walk-Run-WalkDog/Stand-Walk-Run-Trot
PPO (Oracle)395.971 ±\pm 8.116234.529 ±\pm 19.193137.080 ±\pm 3.133
PPO (Vanilla)305.199 ±\pm 18.519250.153 ±\pm 26.385129.744 ±\pm 5.914
C-CHAIN472.828 ±\pm 17.865314.510 ±\pm 44.321174.098 ±\pm 9.169
Method/TaskSpace_Invaders/Asterix/Seaquest
DoubleDQN (Vanilla)22.044 ±\pm 0.733
C-CHAIN29.513 ±\pm 0.682

We will also include these results in our paper to strengthen our experiments further and provide a potentially useful reference for future study.

审稿意见
3

The manuscript investigates the loss of plasticity in continual reinforcement learning (CRL) from the perspective of churn. The authors establish a connection between plasticity loss and churn through the Neural Tangent Kernel (NTK) framework, demonstrating that churn exacerbation correlates with the rank decrease of the NTK matrix. To address this, the authors propose the Continual Churn Approximated Reduction (C-CHAIN) method, which reduces churn during training, mitigating plasticity loss. Empirical results on OpenAI Gym Control and ProcGen benchmarks show that C-CHAIN outperforms baseline methods in most environments. The manuscript also includes theoretical analyses, experimental evaluations, and discussions on the broader implications of churn reduction in continual learning.

给作者的问题

Can the authors provide a more precise mathematical formulation of the connection between churn, NTK rank, and plasticity loss? How does this connection specifically apply to reinforcement learning?

Are the theoretical insights on churn reduction in CRL applicable to supervised continual learning? If so, can the authors provide evidence or discussion to support this claim?

How does C-CHAIN address the trade-off between plasticity and stability in continual learning? Does it mitigate catastrophic forgetting while improving plasticity?

Can the authors evaluate C-CHAIN on task sequences composed of diverse environments, similar to those used in "Loss of Plasticity in Continual Deep Reinforcement Learning" (2023)?

论据与证据

The manuscript provides empirical evidence supporting its claims, particularly in demonstrating the efficacy of C-CHAIN in mitigating plasticity loss and improving performance in CRL tasks. However, the connection between plasticity loss, churn, and the NTK matrix rank is not established with sufficient mathematical rigor. For example, while Section 4.2 discusses the interplay between these factors in CRL, the arguments lack precise mathematical formulations. Additionally, the manuscript does not clearly delineate whether the theoretical insights are specific to reinforcement learning or applicable to other continual learning paradigms.

方法与评估标准

The proposed C-CHAIN method is well-motivated and aligns with the problem of mitigating plasticity loss in CRL. However, the experiments focus on environments with relatively small task differences, limiting the generalizability of the findings. A broader evaluation across task sequences with greater variability would strengthen the conclusions

理论论述

The manuscript provides theoretical insights into the relationship between churn, NTK rank, and plasticity loss. However, the theoretical claims are not fully substantiated. For instance, the discussion in Section 4.2 is somewhat vague, and additional mathematical clarity—such as explicit equations or proofs—would enhance the credibility of the theoretical framework.

实验设计与分析

The experimental design is generally sound, with a focus on evaluating C-CHAIN on 20 environments in Gym Control and ProcGen benchmarks. The results demonstrate the method's effectiveness in most environments, particularly those with dynamic task difficulties. However, the experiments do not include task sequences composed of diverse environments, which would better illustrate the method's ability to handle significant plasticity loss. The authors should also clarify the rationale behind the choice of baseline methods.

补充材料

The supplementary material is comprehensive, providing implementation details, additional experimental results, and NTK analyses. However, the manuscript does not adequately discuss certain discrepancies observed in the supplementary figures, such as the divergent results in specific environments (e.g. Leaper in Figure 11). Addressing these inconsistencies would strengthen the overall presentation.

与现有文献的关系

The manuscript is well-positioned within the broader literature on continual reinforcement learning and plasticity loss. It builds upon prior work on nonstationarity, catastrophic forgetting, and the loss of plasticity in neural networks. The discussion of related work is thorough.

遗漏的重要参考文献

The manuscript adequately cites most key references in the field.

其他优缺点

Strengths:

The manuscript explores an important problem in CRL, providing both theoretical insights and practical solutions. The proposed method is novel and demonstrates strong empirical performance in most tested environments. The experimental evaluation is extensive, covering a wide range of environments and including analysis studies.

Weaknesses:

The theoretical analysis lacks mathematical rigor, particularly in establishing the connection between churn, NTK rank, and plasticity loss of CRL. The experimental results are limited to environments with relatively small task differences, reducing the generalizability of the findings. Certain figures, such as Figure 1, lack clear explanations, which detracts from the clarity of the manuscript.

其他意见或建议

Improve the clarity of Figure 1 by providing a detailed explanation of its purpose and implications.

Introduce a direct metric for quantifying plasticity loss to better evaluate the effectiveness of C-CHAIN.

Expand the experimental evaluation to include task sequences with greater variability and diversity.

Clarify whether the theoretical insights are specific to CRL or applicable to other continual learning paradigms.

作者回复

We sincerely appreciate the reviewer’s valuable comments and the recognition of our method and experiments. Our response aims to address these aspects in detail.

On the experimental setups, task difference, and additional CRL setups

[Clarification on environment choice and task difference]

We followed the experimental setting in TRAC paper (Muppidi et al., NeurIPS 2024) and took it as a major baseline. For a more comprehensive experimental comparison, we added MountainCar (where TRAC fails almost totally) to the continual Gym control suite, and we extended the 4 ProcGen tasks used in TRAC to all 16 tasks provided by ProcGen suite.

These tasks are representative CRL scenarios with semantically related tasks, and they are challenging to the vanilla PPO agent in the sense that apparent degradation occurs as the learning goes on (for most of them).

[Additional experimental setups in DMC]

As suggested by the reviewer, instead of chaining totally different Atari games (Abbas et al., 2023), we additionally established two continual RL settings in DeepMind Control (DMC) suite:

  • Continual Walker: chain Walker-stand, Walker-walk, Walker-run
  • Continual Quadruped: chain Quadruped-walk, Quadruped-run, Quadruped-walk (repeat because only two Quadruped tasks are available in DMC).

Similarly, we compare PPO Oracle, PPO Vanilla, and C-CHAIN in the three settings. We run 1M steps for each task, i.e., 3M in total for each setting. The results are averaged over 12 random seeds for each configuration, as shown below (the corresponding learning curves will be added to our revision).

Method/TaskWalker/Stand-Walk-RunQuadruped/Walk-Run-Walk
PPO (Oracle)395.971 ±\pm 8.116234.529 ±\pm 19.193
PPO (Vanilla)305.199 ±\pm 18.519250.153 ±\pm 26.385
C-CHAIN472.828 ±\pm 17.865314.510 ±\pm 44.321

We can observe that C-CHAIN improves PPO in continual DMC. Actually, from the learning curves (will be added in the revision), we found PPO learns more slowly with a decreased slope in the second and the third task, and C-CHAIN learns faster and achieves higher scores.

On the choice of baseline methods

[Existing evidence in prior work]

For the choice of other baseline methods, as we mentioned in Line 315-325, TRAC outperforms Concatenated ReLU (Abbas et al., 2023), EWC (Schwarz et al., 2018), Modulating Masks (Nath et al., 2023) in their paper, as well as LayerNorm (Lyle et al., 2024) in Figure 15 of TRAC paper (i.e., LayerNorm Adam). Besides, more related methods (e.g., ReDo) are compared with L2 Init (i.e., Regen Reg) in ProcGen CoinRun in Figure 5 of (Juliani and Ash, NeurIPS 2024). The results show that L2 Init performs comparably with ReDo and LayerNorm.

[Additional results for three more baselines and reliable aggregate evaluation] Please refer to the response to Reviewer cU5H, due to the space limitation.

On a more precise mathematical formulation of the connection between churn, NTK rank, and plasticity loss

In this paper, we are more focused on using formal expressions to describe, analyze and dig out intuitive insights. To give a thorough and rigorous theory, we need (at least) definitions and assumptions from three aspects:

  • [RL-oriented definition of plasticity] To our knowledge, the existing formal definition of plasticity (i.e., Target-fitting capacity, Definition 1 in Clare et al., 2022) is based on a general family of objective functions, which is too vague and broad for RL. This is why a direct plasticity loss metric was not used in our paper, and instead we used the rank loss of NTK since it is the best empirical practice in the literature (Clare Lyle et al., 2024). One possible RL-oriented definition could be developed upon the Value Improvement Path (Definition 1, Dabney et al., 2021).
  • [Proper assumptions on deep RL] A Rigorous theoretical analysis on deep RL learning process is challenging as we need concrete assumptions on network structure, optimization, objective function. Theoretical analysis methods and results under practical assumptions are still lacking in deep RL.
  • [Proper assumptions on continual learning setting] In the literature of continual RL, task sequence and switch scheme are usually manually designed for experimentation, little formal definition and assumption of task distribution and task switch have been established so far.

Due to the lack of theoretical foundation above, providing a thorough and rigorous theory is out of the scope of this paper. But as suggested, we will update the discussion in Section 4 to provide more formal support and additional discussions for the connections between churn, NTK and plasticity.

On the results for the tasks like Leaper

Please refer to the response to Reviewer 65iv.

Other remaining questions

Due to the space limitation, we will provide them during the discussion stage.

审稿人评论

Thank you for the detailed responses and the additional experimental results. Given that CRL and plasticity loss are still emerging areas of research, I understand the challenges in providing a comprehensive and rigorous theoretical framework, as well as direct quantifiable metrics for plasticity loss.

The tasks in the manuscript with semantic relationships are indeed challenging for standard RL agents. However, incorporating task sequences composed of completely different Atari games could significantly demonstrate the robustness of the proposed approach. This would provide a clearer picture of C-CHAIN's effectiveness across a broader spectrum of task variability. This can also serve as a direction for further improvement in the future.

Regarding the updates in Section 4, could you please provide a brief overview of the additional discussions and formal support you plan to include?

I look forward to further discussions on the remaining questions, particularly on the broader applicability of the theoretical insights and the trade-off between plasticity and stability in continual learning.

作者评论

We sincerely appreciate the reviewer’s very careful and constructive comments. We will provide additional experimental results and the responses to the remaining questions below.


Additional experiment on Continual MinAtar

As suggested by the reviewer, we built a continual learning setting based on MinAtar (Young and Tian, 2019), which has the same game features as Atari but allows faster training/evaluation. We chained three tasks: Space_Invaders -> Asterix -> Seaquest, by padding the observation space to be [10,10,10][10, 10, 10]. We ran 1.5M steps for each task, i.e., 4.5M in total.

We used DoubleDQN as the base agent, and applied C-CHAIN to the training of the value network, with no modification to the hyperparameters of DoubleDQN. The results over 12 seeds are shown below.

Method/TaskSpace_Invaders/Asterix/Seaquest
DoubleDQN (Vanilla)22.044 ±\pm 0.733
C-CHAIN29.513 ±\pm 0.682

The results show that C-CHAIN also improves the continual learning performance upon DoubleDQN in a sequence of totally different tasks. We will include these results in our paper to further strengthen our experiments.

Additional experiment on Continual DMC-Dog

In addition to the two continual DMC settings, we provide the results for Dog-stand -> Dog-walk -> Dog-run -> Dog-trot. Similarly, we ran 1M steps for each task, i.e., 4M in total for Continual Dog, with 12 seeds. The results show C-CHAIN PPO also outperforms PPO in continual Dog.

Method/TaskDog/Stand-Walk-Run-Trot
PPO (Oracle)137.080 ±\pm 3.133
PPO (Vanilla)129.744 ±\pm 5.914
C-CHAIN174.098 ±\pm 9.169

Regarding the updates in Section 4

In addition to our discussion on the missing theoretical foundations, we will consider adding additional formal analysis based on the theoretical tool in recent linear approximation transfer theory (Springer et al., 2025; Gidel et al., 2019).

The brief plan is to extend the (one-shot) transfer setting to continual learning (i.e., continual transfer), where the tasks can be assumed to differ mainly at the singular values under the SVD viewpoint. Then, it would be possible to discuss the rank decrease of the learned features under the continual learning dynamics.

Reference:

  • Springer et al. Overtrained Language Models Are Harder to Fine-Tune. 2025
  • Gidel et al. Implicit regularization of discrete gradient dynamics in linear neural networks. 2019

On the applicability of the theoretical insights to continual supervised learning

Our formal analysis in Section 4 uses a general loss function form (or the common MSE form). Thus, the derivations should apply to both continual RL and continual Supervised Learning.

However, a significant difference to note is that RL suffers from the chain of churn, formally characterized in (Tang and Berseth, NeurIPS 2024), which stems from the iterative policy improvement and policy evaluation nature (i.e., the Generalized Policy Iteration paradigm), while SL does not as it learns from static labels. This means that the exacerbation of churn during the process of NTK rank loss on both the policy network side and the value network side can further (negatively) affect each other due to the chain effect.

We need to note that the focus of this work is on continual RL, which is more non-stationary and less studied compared to continual SL. In Section 5.3, we provided the evaluation of C-CHAIN on continual SL tasks for a possible useful reference to curious audiences who care about continual SL. The phenomenon that C-CHAIN (which is proposed from the perspective of churn) does not show superiority as that in contnual RL can be explained by the difference between RL and SL mentioned above to some extent.

We will clarify this as suggested.

On the trade-off between plasticity and stability

It is non-trivial to balance the trade-off between plasticity and stability when we are considering a model with finite capacity. We think that different learning scenarios should have different preferences on either side more or less.

In this work, we follow the previous literature and focus more on plasticity loss. We do not think that C-CHAIN addresses the trade-off between plasticity and stability (sufficiently well) as it is proposed for plasticity. In principle, according to the continual churn reduction regularization objective and the NTK analysis, we think that C-CHAIN improves stability in the sense that it decorrelates the gradients of different data batches, suppresses passive function changes caused by churn and mitigates the learning degradation and even collapse in continual RL.


We believe these additional results and responses can help to address the remaining concerns. Since this is our last opportunity to respond, we sincerely hope the reviewer could re-evaluate our work accordingly.

审稿意见
3

This paper studies the loss of plasticity in the continual reinforcement learning problem. The authors present a method that is based on reducing the churn to help prevent the collapse of the NTK rank. Through a series of experiments, the paper shows the effectiveness of the proposed method against other baselines.


Update after rebuttal: I thank the authors for their effort to improve their empirical evaluation. The new results with more runs and also with more baselines increase my confidence in the paper's conclusion. Thus, I raised my score to reflect that change.

给作者的问题

There are missing steps between Equation 4 and Equation 5, namely about the diagonal SS matrix. Also, how did both θfθ(xˉ)\nabla_\theta f_\theta(\bar{x}) and θfθ(x)\nabla_\theta f_\theta(x) become GθG_\theta?

论据与证据

  • The reason why C-CHAIN performs poorly under the simpler setting of continual supervised learning is not convincing.
  • The experimental design, where the tasks are not on the same level of complexity (shown by the inconsistent performance of Oracle PPO), makes empirical evaluation hard.
  • The authors compare their methods against other baselines and show their effectiveness. However, the improvement is not consistent across environments and domains, and little insight is provided to explain it.
  • Most learning curves have overlapping confidence intervals, which compromises the statistical significance of the results.

方法与评估标准

The authors selected standard benchmarking tasks accepted by the community to study loss of plasticity. In addition, they used empirical NTK, which gained popularity when investigating plasticity.

理论论述

I haven’t closely checked the theoretical claims or the math.

实验设计与分析

The empirical evaluation doesn't show conclusive results since, in many experiments except for a few, there are overlapping confidence intervals. I suggest the authors would increase the number of independent runs to convince the reader of the validity of their approach.

补充材料

I didn’t check the supplementary material.

与现有文献的关系

The paper studies continual RL, which is a very important problem, and the results are relevant to a large number of researchers, especially since it’s building on the concept of churn and its relationship to plasticity, which has been studied before. The empirical evaluation itself doesn’t provide conclusive results, but the analysis given might be helpful for future research.

遗漏的重要参考文献

N/A

其他优缺点

N/A

其他意见或建议

N/A

作者回复

We sincerely appreciate the reviewer’s valuable comments and the recognition of the importance of the topic studied in our paper.

On the reviewer’s comments including “the empirical evaluation doesn't show conclusive results”, “the improvement is not consistent”, and “little insight is provided to explain it”

[Additional reliable aggregate evaluation]

To obtain a conclusive evaluation, we provide the aggregation evaluation by using Reliable Metrics (Agarwal et al., NeurIPS 2021). The evaluation uses Mean, Median, Inter-Quantile Mean (IQM) and Optimality Gap, with 95% confidence intervals and 50000 bootstrap replications (recommended by Reliable Metrics official github implementation). It aggregates over 9 methods, 6 seeds for each of 16 ProcGen tasks, i.e., 864 runs with 10M steps for each.

The results are shown in the table below (the corresponding plot will be added to the revision). We can observe that C-CHAIN performs the best regarding all four metrics and outperforms the second-best with no overlap of Confidence Intervals for Median, IQM, and a minor one for Mean.

Due to the space limitation, please refer to the concrete markdown table we will upload in the common response or the discussion response after the rebuttal deadline.

[On the consistency of improvement]

As summarized by Table 1, our method C-CHAIN (PPO) consistently outperforms vanilla PPO in all 20 envs. And C-CHAIN performs the best in 15 out of 16 tasks (and performs the second-best in the remaining one) on continual ProcGen benchmark. In this sense, our method achieved overall consistent improvement.

[On the insights and explanations]

For continual CartPole and continual LunarLander, where L2 Init outperforms C-CHAIN, we provided the possible explanation in Line 328-365. Our insight is, for simple tasks like CartPole and LunarLander where the agent can find a good policy near parameter initialization, L2 Init performs the best; in contrast, it limits the policy learning as in continual ProcGen where L2 init is outperformed by C-CHAIN in most tasks.

On the experimental design

We followed the experimental settings used in TRAC (as also mentioned by the reviewer’s comment “The authors selected standard benchmarking tasks accepted by the community to study loss of plasticity”), and we further extended the ProcGen tasks from the four ones used in TRAC to all 16 tasks, as well as continual MountainCar where TRAC collapses.

[On the task diversity and the varying difficulties]

Since the tasks are different in game logic, reward density, visual complexity, and etc., they provide a range of environments of diversity and varying difficulties, which is also a commonly adopted principle of designing a benchmark. Therefore, we think it is natural for Oracle PPO or vanilla PPO to have different levels of performance across different tasks.

On the math derivation

[How did both θfθ(xˉ)\nabla_{\theta} f_{\theta}(\bar x) and θfθ(x)\nabla_{\theta} f_{\theta}(x) become GθG_{\theta}?]

We refer the reviewer to the definition of the NTK matrix (i.e., the first line for entry definition and the second line for matrix definition in vector form) in Equation 2.

As in the second line of Equation 4, θfθ(xˉ)θfθ(x)\nabla_{\theta} f_{\theta}(\bar x)^{\top} \nabla_{\theta} f_{\theta}(x) becomes Nθ(xˉ,x)N_{\theta}(\bar x, x) by the definition in Equation 2. Then, Equation 5 is the vector form of Equation 4, correspondingly, the entry definition Nθ(xˉ,x)N_{\theta}(\bar x, x) becomes Nθ=GθGθN_{\theta} = G_{\theta}^{\top}G_{\theta} as in Equation 2.

[The transition from Equation 4 and Equation 5, about the diagonal SS matrix]

As mentioned above, Equation 5 is the vector form of Equation 4. Therefore, the sampling xBtrainx \sim B_{train} is replaced by the SS matrix as below in Equation 5. More specifically, SS has the same size as NθN_{\theta}, i.e., X|X| by X|X|. As mentioned in Line 155-157, SS is a diagonal matrix, and its diagonal is 0,1{0,1}-binary with 1 for the sampled data and 0 for the non-sampled.

We appreciate the reviewer for pointing this out. We will make this clearer as suggested by the reviewer.

On the evaluation of C-CHAIN in continual supervised learning

We provided two explanations in Section 5.3. A significant difference to note is that RL suffers from the chain of churn, formally characterized in (Tang and Berseth, NeurIPS 2024), which stems from the iterative policy improvement and policy evaluation nature (i.e., the Generalized Policy Iteration paradigm), while SL does not as it learns from static labels.

This means that the exacerbation of churn during the process of NTK rank loss on both the policy network side and the value network side can further (negatively) affect each other due to the chain effect. Since C-CHAIN is proposed from the perspective of churn, this explains to some extent why C-CHAIN is not that effective for CSL.

最终决定

This paper studies the loss of plasticity problem in deep continual reinforcement learning (RL). By establishing a connection between plasticity loss and churn through the Neural Tangent Kernel (NTK) framework, it offers insightful observations regarding the relationship among churn, NTK, and plasticity loss. Additionally, extensive experiments strongly support the claims presented in the paper. Although some reviewers have concerns about the mathematical rigor of the analysis, we believe that the strengths of the paper outweigh its weaknesses, and it makes solid contributions to the research community.