PaperHub
6.8
/10
Poster4 位审稿人
最低3最高5标准差0.8
5
5
4
3
4.5
置信度
创新性2.3
质量2.8
清晰度3.3
重要性2.8
NeurIPS 2025

Measure gradients, not activations! Enhancing neuronal activity in deep reinforcement learning

OpenReviewPDF
提交: 2025-05-09更新: 2025-10-29
TL;DR

Measuring neuronal activity via activations is ineffective in complex agents, as these values do not reflect true learning capacity. We introduce GraMa, which offers robust quantification and resetting guidance across various network architectures.

摘要

Deep reinforcement learning (RL) agents frequently suffer from neuronal activity loss, which impairs their ability to adapt to new data and learn continually. A common method to quantify and address this issue is the $\tau$-dormant neuron ratio, which uses activation statistics to measure the expressive ability of neurons. While effective for simple MLP-based agents, this approach loses statistical power in more complex architectures. To address this, we argue that in advanced RL agents, maintaining a neuron's **learning capacity**, its ability to adapt via gradient updates, is more critical than preserving its expressive ability. Based on this insight, we shift the statistical objective from activations to gradients, and introduce **GraMa** (**Gra**dient **Ma**gnitude Neural Activity Metric), a lightweight, architecture-agnostic metric for quantifying neuron-level learning capacity. We show that **GraMa** effectively reveals persistent neuron inactivity across diverse architectures, including residual networks, diffusion models, and agents with varied activation functions. Moreover, **re**setting neurons guided by **GraMa** (**ReGraMa**) consistently improves learning performance across multiple deep RL algorithms and benchmarks, such as MuJoCo and the DeepMind Control Suite. **We make our code available.**
关键词
Deep Reinforcement LearningDormant NeuronsNeuron Resetting

评审与讨论

审稿意见
5

This paper introduces gradient magnitude as a novel measure to identify and reset inactive neurons in reinforcement learning. This measure called GraMa is lightweight and effective especially in complex networks.

优缺点分析

Strong points: S1: The paper is well written. There are lots of illustrative figures. Important results are well explained, illustrated and summarized. S2: The code is available and well documented. S3: The proposed method is runtime-efficient and increases the performance in many cases, sometimes drastically.

Weak points: W1: There are almost no formal guarantees nor formal explainations why GraMa works.

Ad W1: I guess it is very difficult to formally explore the properties, strengths and limitations of the method. At least, empirical evidence by well-conducted experiments show the merits of the method. The only theoretical exploration is the equivivalence to ReDo in traditional architectures. This is nice and easy to see but not very important, as more complex architectures are the central focus of this paper.

Minor weak points:

  • The clarity of Fig. 1 can be improved. I suggest to remove the double title "Complex Architecture Based Policy" which is anyhow misleading in my view. Also revise the caption on the y-axis on the top subfigure. What is the unit here? That is not entirely clear. Also make sure that the axis labelling is readable, currently, it is too tiny.
  • The notation in Eq. 1 looks a bit overly complex. Consider simplifying. But if it is common and taken from the corresponding papers, you can also leave it as it is.
  • Fig. 3 is impressive but a bit more description would help the understanding. What are the light-blue and light-red areas?
  • What is the unit of measurement on the y-axis in the subfigure at the center of Fig. 5? Also, in Fig. 5 the axis labelling and the headings of the center and right-most subfigures are unreadable small.
  • The message in the text (Lines 215-220) does not really correspond to the visual information in Fig. 6. The figure suggest that during training, already inactive neurons become even less active. If this is the case it should be also stated in the text.
  • Sometimes the figures could be displayed closer to the places in the text where they are discussed, e.g., Fig. 5.
  • In Fig.9 the axis labelling is too small. Also, the meaning of 1x, 2x and so on on the x-axis is not clear. How many layers do these networks actually have?

问题

I know that this might be difficult, but could you come up with any kind of formal guarantees or explainations for GraMa? This would make the paper even stronger.

局限性

yes

最终评判理由

The rebuttal successfully addressed all my concerns.

格式问题

No concerns.

作者回复

We are grateful for your thoughtful review and for recognizing the importance of our research. Your suggestions encouraged us to significantly strengthen both the theoretical and empirical components of the paper.

Weakness

W1: There are almost no formal guarantees nor formal explainations why GraMa works. I know that this might be difficult, but could you come up with any kind of formal guarantees or explainations for GraMa? This would make the paper even stronger.

Thanks for your valuable feedback. We appreciate your recognition of our paper’s strengths while acknowledging, as you noted, the inherent challenge of developing comprehensive and realistic theoretical guarantees for complex deep RL systems.

As the first study to comprehensively analyze gradient magnitudes for neural maintenance in deep RL, our work primarily establishes a strong empirical foundation across multiple modern architectural settings in deep RL, with a few theoretical justifications, though we fully agree that these could be better highlighted and expanded.

Based on our continued theoretical explorations since submission, we have identified several promising future theoretical directions for our work, and we will include these discussions in the updated manuscript: 1) the relationship of resetting parameters with small gradient norms and suboptimal saddle point escape, which is a common challenge in deep RL optimization landscapes [1, 2], 2) the effect of selecting parameters with minimal gradient magnitudes for resetting on disruption to learned features while maximizing learning capacity. The developments in these theoretical perspectives can further help explain why GraMa consistently outperforms activation-based alternatives across diverse architectures and tasks, but remains particularly challenging due to the complex learning dynamics in deep RL systems. Our work also opens several promising avenues for future theoretical exploration including the connection to second-order optimization, and convergence guarantees in non-stationary environments.

In our revision, we will better highlight these theoretical insights with our paper's strong empirical focus and theoretical justifications. We believe our pioneering work not only establishes a practical method for neural maintenance but also provides valuable theoretical foundations for this emerging area of deep RL.

[1]: Ge, Rong, et al. "Escaping from saddle points—online stochastic gradient for tensor decomposition." COLT , 2015.

[2]: Mishkin, Dmytro, et al. "All you need is a good init." ICLR, 2015.

Minor Weakness

W1: I suggest removing the double title "Complex Architecture Based Policy" which is, anyhow, misleading in my view. Also revise the caption on the y-axis on the top subfigure. What is the unit here? That is not entirely clear. Also make sure that the axis labelling is readable. Currently, it is too tiny.

Thanks for the suggestion. We incorporated recommended changes in the revision, where we removed the duplicate title, revised the caption to clarify the metric (the y-axis represents the score normalized by the peak ReGraMa performance, with the maximum value being 1.0), and enlarged axis labels for better readability.

W2: The notation in Eq. 1 looks a bit overly complex. Consider simplifying. But if it is common and taken from the corresponding papers, you can also leave it as it is.

Thanks for raising this point! To ensure consistency with established literature [1], we retained the original notation, as it facilitates direct comparison with prior work.

[1]: Sokar et al., The Dormant Neuron Phenomenon in Deep Reinforcement Learning, ICML 2023

W3: Fig. 3 is impressive but a bit more description would help the understanding. What are the light-blue and light-red areas?

Thanks for highlighting this! We have enhanced the visualizations with clearer labels: light-blue areas now explicitly mark neurons with low learning capacity and low expressivity—the primary candidates for reset, while light-red indicates neurons with high learning capacity but low expressivity. We also added new panels that explicitly show the performance impact of resetting each of the four neuron types, thereby highlighting their distinct functional roles (detailed results are provided in Response W3 to Reviewer 1 due to space limitation).

W4: What is the unit of measurement on the y-axis in the subfigure at the center of Fig. 5? Also, in Fig. 5 the axis labelling and the headings of the center and right-most subfigures are unreadable small.

We have enlarged all fonts and added explicit labeling showing that the y-axis represents percentage improvement. For intuitive interpretation, we normalized each ReGraMa score relative to the score of ReDo (ReDo’s score = 1.0), so the y-axis directly shows ReGraMa’s relative improvement over ReDo in percentage.

W5: The message in the text (Lines 215-220) does not really correspond to the visual information in Fig. 6. The figure suggest that during training, already inactive neurons become even less active...

In the revised manuscript, we now state: “The figure indicates that neurons that are already inactive at the start of training become progressively less active as training proceeds.”

W6: Sometimes the figures could be displayed closer to the places in the text where they are discussed, e.g., Fig. 5.

Great suggestion! We reorganized the layout so all figures appear adjacent to the corresponding discussion, eliminating the need for flipping pages and improving overall readability.

W7: In Fig.9 the axis labelling is too small. Also, the meaning of 1x, 2x and so on on the x-axis is not clear. How many layers do these networks actually have?

In the revised Figure 9, we enlarged all axis labels and revised the x-axis labeling with a clear caption explaining: model sizes begin with two BRO-net blocks (as per the standard setting), and larger model scales are constructed by adding additional BRO-net blocks in sequence.

Question

Q1:

Please refer to W1.


Thanks again for reviewing our work and constructive feedback! We hope our responses address your concerns. We are happy to provide additional clarification if needed.

评论

Thank you for the rebuttal. I am convinced that all points that I have raised can be well addressed in the final version of the paper. However, I also think that more research is needed to establish stronger theoretical foundations. Nevertheless, the paper has clear merits which will stand out more clearly after the revision.

评论

We would like to sincerely thank the reviewer once again for the recognition and support of our work, and for the time and effort dedicated to reviewing our manuscript!

We will carefully incorporate your valuable feedback into the revision, which has greatly helped us improve the quality, clarity, and overall contribution of our work.

审稿意见
5

Deep reinforcement learning often suffers from a loss of plasticity or neuronal activity, largely due to the non-stationarity of learning targets. A common metric for identifying such loss is the \tau-dormant neuron ratio, which measures the relative activation strength of neurons across a dataset. Neurons with low activation are considered dormant and are reset to enable adaptation to new data---a strategy known as ReDo. In this paper, the authors argue that the \tau-dormant ratio is limited to standard MLPs with ReLU activations and fails to capture plasticity in more complex architectures. To address this, they propose estimating plasticity using the relative magnitude of the gradient. (GraMa) They demonstrate that resetting neurons based on this new signal---an approach they call ReGraMa---results in improved performance.

优缺点分析

Strengths:

  1. The problem tackled in this paper is both relevant and impactful for the Deep RL community.
  2. It is clear, well-written, and the proposed approach is well motivated.
  3. The experiments are well-designed and validate the claims made by the authors.

However, there are a few minor concerns/clarifications I would appreciate the authors addressing:

  1. Why does the gradient magnitude imply “learning capacity” and the activation magnitude imply “expressivity”? This is not well discussed in the paper.

  2. Figure 3: This seems like circular reasoning—you define "learning capacity" as the gradient magnitude and then evaluate GraMa, which is based entirely on that same metric. Doesn’t this place ReDo at an unfair disadvantage? That said, the results do show that the two methods reset different types of neurons.

  3. Suppose a neuron consistently receives a high-magnitude gradient (greater than \tau) but in alternating directions (e.g., +c, -c, and so on). In such a case, can we truly say that the neuron is “learning'' or that it has high “learning capacity''?

  4. (Line 158) “Less expressive at the moment” — As I understand it, [Sokar et al., 2023] show that dormant neurons tend to remain dormant over time. Does this observation not apply to the BroNet architecture discussed in this context?

  5. Figure 4:

    a) (Minor) I recommend adding the Resnet SAC architecture in the Appendix as well (since all the others are present)

    b) The results show that ReGraMa is better at resetting neurons v/s ReDo, which is more aligned with the results presented in Section 4 (Experiments).

    c) The analysis in Section 3 discusses how ReDo fails to identify dormant neurons in the settings considered (e.g., non-ReLU activations). To validate this claim, I suggest an experiment comparing the performance of SAC with ReLU and TanH activations, both with and without ReDo—i.e., SAC-ReLU, SAC-ReLU-ReDo, SAC-TanH, and SAC-TanH-ReDo. If ReDo improves performance in the ReLU setting but not with TanH, it would provide empirical support for the claim.

  6. Figure 7(a) —Are ReDo and ReGraMa equivalent in this setting (according to Theorem 1)? Shouldn't we then expect them to perform similarly? (ReGraMa gets a higher reward)

  7. I also recommend that the authors include plots of the \tau-dormant ratio for some of the reported results. This would help verify whether ReDo is behaving as expected or diverging in practice---similar to the behaviour observed with GraMa in Figure~7(c), for example.

  8. Are “Low learning Capacity neuron ratio” (Figure 1) and GraMa the same?

  9. (Typos)

    a) (Line 228) z is the output before passing through the activation?

    b) (Line 513) “Structures were showen”

问题

  1. Can you address the questions regarding "learning capacity" and "expressivity"?
  2. Could you also plot the \tau-dormant neuron ratio?
  3. Could you include the results mentioned in Weakness 5(c) to support the "Analysis" section?

局限性

Yes

格式问题

none

作者回复

We would like to express our deep gratitude for your significant support for this article. Your detailed suggestions have enabled us to further improve our work. We hope that our response can address all your concerns.

Weakness

W1: Why does the gradient magnitude imply “learning capacity” and the activation magnitude imply “expressivity”? This is not well discussed in the paper.

Thanks for the question. We have expanded our background section in the revised version, to provide a clearer explanation and foundations of learning capacity and expressivity.

  • Recent work [1] defines learning capacity as an agent’s ability to efficiently adapt its parameters to the current data distribution. Gradient magnitude directly quantifies this property; it reflects the effect of parameter updates prescribed by back-propagation upon receiving learning signals with new data. Hence, smaller magnitudes indicate lower learning capacity.
  • In [2,3], expressivity refers to the model's representational ability for the inputs. Empirical studies show that this correlates with the proportion of neurons whose activations exceed a threshold. Accordingly, subsequent work adopts mean activation magnitude as a practical proxy for expressivity.

[1]: Liu et al., Neuroplastic Expansion in Deep Reinforcement Learning, ICLR 2025

[2]: Sokar et al., The Dormant Neuron Phenomenon in Deep Reinforcement Learning, ICML 2023

[3]: Ma et al., Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning, ICML 2025

W2: Figure 3: This seems like circular reasoning—you define "learning capacity" as the gradient magnitude and then evaluate GraMa, which is based entirely on that same metric. Doesn’t this place ReDo at an unfair disadvantage? That said, the results do show that the two methods reset different types of neurons.

We use GraMa for analysis because ReDo tends to misidentify dormant neurons in complex architectures. Specifically, ReDo systematically flags a large fraction of healthy neurons as dormant while missing those that empirically harm performance. This misalignment is clearly visible in the reset-preference visualizations in Figures 1 and 3. As such, activation-based scores lose their statistical power in these cases. To facilitate direct visual comparison, we have added ReDo curves to the main experimental plots in the revised manuscript.

W3: Suppose a neuron consistently receives a high-magnitude gradient (greater than \tau) but in alternating directions (e.g., +c, -c, and so on). In such a case, can we truly say that the neuron is “learning'' or that it has high “learning capacity''?

Our definition of learning capacity focuses on the magnitude of the gradient to reflect the agent’s "willingness" to update parameters. Even if the update direction oscillates, the neuron remains actively engaged in learning. Therefore, we consider large magnitudes, regardless of sign, as indicative of high capacity. In practice, measuring this is more tractable and stable than evaluating directional correctness on a per-neuron basis.

W4: (Line 158) “Less expressive at the moment” — As I understand it, [Sokar et al., 2023] show that dormant neurons tend to remain dormant over time. Does this observation not apply to the BroNet architecture discussed in this context?

Indeed! In the revised manuscript, we clarify that architectures with normalization (e.g., BRO-net) are already known to partially mitigate dormancy. In these cases, ReDo’s activation-based metric can occasionally “revive” dormant neurons, which is visible as minor oscillations in ReDo’s dormant-neuron curves. This transient rebound becomes more pronounced in deeper or more complex networks.

W5: (a,b) I recommend adding the Resnet SAC architecture in the Appendix as well (since all the others are present) ...

We have added the Resnet SAC architecture in Appendix B of the revised manuscript and reorganized this point.

W5: (c) The analysis in Section 3 discusses how ReDo fails to identify dormant neurons in the settings considered (e.g., non-ReLU activations). To validate this claim, I suggest an experiment comparing the performance of SAC with ReLU and TanH activations, both with and without ReDo—i.e., SAC-ReLU, SAC-ReLU-ReDo, SAC-TanH, and SAC-TanH-ReDo. If ReDo improves performance in the ReLU setting but not with TanH, it would provide empirical support for the claim.

We followed your suggestion and replicated the experimental setup from ReDo’s paper on the MuJoCo Humanoid task (1M steps, Table 6). We normalized all results against SAC-ReLU for clearer comparison. The table below shows the performance ratios (mean ± std across 3 runs).

The results in the table below match your conjecture. ReDo does improve the performance of the agent compared to vanilla ResNet-SAC under ReLU, but degrades it under TanH.

Table: Advantages over SAC-ReLU (Average of 3 runs):

MethodHumanoid
SAC-ReLU1.01.0
SAC-ReLU-ReDo1.14±0.031.14\pm0.03
SAC-TanH0.97±0.050.97\pm0.05
SAC-TanH-ReDo0.89±0.060.89\pm0.06

W6: Figure 7(a) —Are ReDo and ReGraMa equivalent in this setting (according to Theorem 1)? Shouldn't we then expect them to perform similarly? (ReGraMa gets a higher reward)

Thank you for pointing this out. We apologize for the lack of clarity. The results in Figure 7 are based on BRO-net, not a standard MLP+ReLU architecture. Thus, GraMa and ReDo are no longer equivalent, and the superior robustness of ReGraMa is reflected in its higher reward. We clarify this distinction in the new version.

W7: I also recommend that the authors include plots of the \tau-dormant ratio for some of the reported results.

We have added τ\tau-dormant ratio plots for ReDo in the revised manuscript to better visualize how it compares with GraMa under different conditions.

W8: Are “Low learning Capacity neuron ratio” (Figure 1) and GraMa the same?

Yes, they are equivalent. We will explicitly clarify this in the revised version.

W9: (Typos)

We have thoroughly revised the manuscript to fix typos, improve clarity, and correct inconsistent notation.

Questions

Q1: Can you address the questions regarding "learning capacity" and "expressivity"?

Please refer to our response to W1 above.

Q2: Could you also plot the \tau-dormant neuron ratio?

Yes, we will include these plots in the revised manuscript.

Q3: Could you include the results mentioned in Weakness 5(c) to support the "Analysis" section?

Yes, please refer to our response to W5(c) and the accompanying experimental table.


Thanks again for your valuable feedback. We appreciate your thoughtful suggestions and hope our clarifications are helpful. Please let us know if anything remains unclear.

评论

Thank you for the detailed and thoughtful responses. All of my concerns have been satisfactorily resolved.

评论

We would like to sincerely thank the reviewer once again for their helpful and constructive comments and for the time and effort dedicated to reviewing our work.

We will carefully incorporate your valuable feedback into the revision, which has greatly helped us improve the quality, clarity, and overall contribution of our work.

审稿意见
4

This work proposes a metric for measuring inefficient learning neurons in single-agent reinforcement learning. Unlike ReDo's activation-based metric, this metric is gradient-based and can be applied to more complex networks and other activation functions besides ReLU. Good results have been obtained on many algorithms and benchmarks, such as MuJoCo and the DeepMind Control Suite.

优缺点分析

Strengths:

  1. The gradient-based perspective is a meaningful departure from activation-based methods.

  2. The approach is simple, and the writing is clear and easy to understand.

  3. This metric is more lightweight and does not require additional overhead to store activation values.

Weaknesses:

  1. The distinction between GraMa-detected "inactive" neurons and ReDo-detected "dormant" neurons is not rigorously analyzed.

  2. Although Theorem 1 proves that the roles of GraMa and ReDo are equivalent in simple architectures, there is no theoretical analysis in complex architectures with different activation functions.

  3. For complex network architectures and different activation functions, there are too few experiments as empirical verification, and they are not very convincing.

问题

  1. What is the distributional difference between neurons flagged as "inactive" by GraMa and "dormant" by ReDo? What is their overlap rate?

  2. Which part of neurons are reset to improve the performance of ReGraMa? Figure 3 shows that both ReGraMa and ReDo can reset neurons with low learning capacity and low expressivity. I want to know the impact of resetting these four types of neurons on performance. Such analysis is very important to prove the effectiveness of GraMa.

  3. The authors have shown that ReDo and ReGraMa are equivalent in MLP+ReLU. Why ReGraMa reduces GraMa significantly better than ReDo with exactly the same experimental settings in Figure 5?

  4. I checked the code you provided, and it seems that the reset is only performed on Linear, Conv2d, and LayerNorm. Although experiments were conducted on Residual network, Diffusion model, and MLP-based SAC architectures, most of them are combinations of MLP and ReLU. In critic networks, RNN is a commonly used structure, and I wonder if this method has any effect on this structure.

  5. The following methods [1,2,3] are all related to the plasticity or expressiveness of neural networks, such as CBP's utility is a discounted average product of a neuron's associated weights and activation. The related Baseline should not only include ReDo.

[1] ReBorn: The Dormant Neuron Phenomenon in Multi-Agent Reinforcement Learning Value Factorization. Neurips 2024.

[2] CBP: Loss of plasticity in deep continual learning. Nature 2024.

[3] S&P: On warm-starting neural network training. Neurips 2020.

局限性

yes

最终评判理由

The author's response addressed my concerns, but considering the other reviewers' comments, I maintain my score.

格式问题

N/A

作者回复

We thank the reviewer for the insightful suggestions and useful feedback on our work. Please find our responses to each of the concerns below.

Weaknesses

W1: The distinction between GraMa-detected "inactive" neurons and ReDo-detected "dormant" neurons is not rigorously analyzed. What is the distributional difference between neurons flagged as "inactive" by GraMa and "dormant" by ReDo? What is their overlap rate?

Thanks for raising this point! We have conducted additional experiments by performing a more thorough distributional analysis to examine the differences between neurons identified by each method. We evaluated this on the BRO-net architecture on DMC Humanoid Walk, and will include the full results in the new version.

  1. We measured the overlap between neurons flagged by ReGrama and ReDo. The average overlap rate is only 31.56% (average of 3 runs), confirming that they identify fundamentally different neuron populations.
  2. We tracked the proportion of neurons reset by ReDo and ReGraMa throughout training. As shown in Table 1, ReDo performs aggressive resets early in training when activation patterns are naturally unstable (potentially leading to falsely identifying neurons). In contrast, ReGraMa shows a more progressive pattern (which resets more neurons as training evolves) that better aligns with the gradual loss of learning capacity (indicative of a more conservative and adaptive strategy).

Table 1. Reset neuron ratios throughout training (average of 3 runs):

Methodduring 1/4 training stageduring 2/4 training stageduring 3/4 training stageduring 4/4 training stage
ReDo0.41±0.040.41\pm 0.040.24±0.060.24 \pm 0.060.32±0.140.32\pm0.140.35±0.080.35\pm0.08
ReGraMa0.13±0.070.13\pm 0.070.19±0.030.19\pm0.030.21±0.070.21\pm0.070.24±0.110.24\pm0.11
  1. We analyzed the layer-wise distribution for each method. Table 2 shows nearly opposite layer-wise preferences: ReDo predominantly resets neurons in input/output layers, whereas ReGraMa focuses more on the middle layers (where representational learning primarily occurs).

Table 2. Layer-wise reset neuron ratios at step 500k (average of 3 runs):

Methodfirst 1/4 layers[1/4, 2/4) layers[2/4, 3/4) layerslast 1/4 layers
ReDo0.35±0.040.35\pm 0.040.12±0.070.12 \pm 0.070.16±0.090.16\pm0.090.37±0.080.37\pm0.08
ReGraMa0.11±0.110.11\pm 0.110.38±0.060.38\pm0.060.33±0.140.33\pm0.140.18±0.130.18\pm0.13

W2: Although Theorem 1 proves … there is no theoretical analysis in complex architectures with different activation functions.

Thanks for bringing up this point. While Theorem 1 establishes equivalence for MLPs with ReLU activations, we acknowledge that comprehensive theoretical analysis for complex architectures (like ResNet) with arbitrary activation functions remains a significant open challenge in the field. Despite this, our approach has strong theoretical intuition: gradient magnitudes better adapt to any differentiable architecture’s computational graph, which can reflect neuron contribution regardless of the activation function. Our empirical results also strongly support this intuition, showing consistent performance improvements across diverse architectures and activation functions. We hope our work opens a new venue for future theoretical advances that can formally characterize dormancy detection across arbitrary architectures.

W3: For complex network architectures and different activation functions, there are too few experiments for empirical verification.

We have expanded our evaluation to address this concern and further substantiate our conclusions, which highlight the efficiency of ReGraMa.

  • We compare ReGraMa with four related methods based on RNN architectures, where ReGraMa consistently outperformed strong baselines including CBP, ReDo and S&P (refer to Q4)
  • We further compare ReGraMa with the four mentioned recent methods, i.e., ReBorn, CBP, S&P and ReDo, based on BRO-net, where ReGraMa demonstrates superior performance (refer to Q5).
  • We conducted experiments on non-ReLU activation setting by comparing with the four additional baselines, where ReGraMa consistently maintains its performance advantage (refer to Q5).

Questions

Q1:

Please refer to our response to W1 above.

Q2: I want to know the impact of resetting these four types of neurons on performance. Such analysis is very important to prove the effectiveness of GraMa.

To analyze which neurons contribute to ReGraMa’s performance gains, we added a study isolating the impact of resetting each neuron type from Fig. 3. The table below shows the following key findings:

  1. Resetting neurons with high learning capacity and low expressivity harms performance, where these neurons are frequently misidentified by ReDo as dormant due to their low activity. Furthermore, resetting those with high learning capacity and high expressivity causes the most substantial performance degradation.
  2. Resetting neurons with low learning capacity and high expressivity yields marginal improvements, while resetting low learning capacity + low expressivity neurons provides the largest performance boost.

This analysis demonstrates that ReGraMa’s effectiveness stems from its ability to discriminate between neurons based on learning capacity.

Table. Performance impact (percentage change relative to the vanilla BRO-net policy, avg of 3 runs):

ResetHumanoid run
low learning capacity & low expressivity+18+18% (±4\pm 4%)
low learning capacity & high expressivity+6+6% (±7\pm 7%)
high learning capacity & high expressivity17-17% (±9\pm 9%)
high learning capacity & low expressivity9-9% (±2\pm 2%)

Q3: Why ReGraMa reduces GraMa significantly better than ReDo with exactly the same experimental settings in Fig. 5?

There appears to be a misinterpretation of Fig5 (right), where the first figure in Fig5 (right) shows that ReGraMa (orange) and ReDo (pink) show nearly overlapping GraMa metrics throughout training with only minimal differences (approximately 0.001), and the 2nd figure in Fig5(right) shows that both methods achieve comparable accuracy curves on CIFAR-100. The slight variations are attributed to randomness within the model rather than algorithmic distinctions, which is expected and consistent with our theoretical analysis.

On the other hand, Fig. 5 (left) demonstrates a different message, which is about the significant computational efficiency advantage of GraMa over τ\tau-dormant neuron calculation.

Please kindly let us know if we understand this question correctly.

Q4: I checked the code you provided, ... Does ReGraMa work with RNNs, a common architecture in critic networks?

We further extended ReGraMa to GRU-based RNN architectures based on ReBorn’s official implementation [1]. ReGraMa achieves excellent effect as ReBorn, consistently outperforming most baselines.

Table: Performance on RNN-based agents (avg. of 3 runs)

MethodHalfCheetahAnt
vanilla RNN policy13007.57±236.1113007.57\pm 236.116963.57±436.116963.57\pm 436.11
ReGraMa14184.28±115.4814184.28\pm 115.487723.95±104.817723.95\pm 104.81
ReDo13275.94±209.2113275.94\pm 209.217091.53±162.747091.53\pm 162.74
S&P13191.73±127.3113191.73\pm 127.317306.83±265.367306.83\pm 265.36
ReBorn14106.64±219.2714106.64\pm 219.277529.42±147.337529.42\pm 147.33
CBP13452.71±232.0413452.71\pm 232.047254.41±322.537254.41\pm 322.53

[1]: The dormant neuron phenomenon in multi-agent reinforcement learning value factorization.

Q5: Other plasticity-aware baselines (e.g., CBP, ReBorn) should be included beyond ReDo.

We have extended the experiment on BRO-net and TanH activation-based methods for comparison based on 4 DMC tasks.

The results based on the BRO-net are summarized in Table a. ReGraMa consistently outperforms other methods and highlights its robustness in complex architecture. On the other hand, ReBorn, which considers both over-active and dormant neurons, shows moderate improvements and struggles with certain tasks (Dog walk), while S&P and CBP show slight improvements.

Table a. Normalized score relative to baseline based on BRO-net (average of 3 runs):

MethodHumanoid standHumanoid runDog standDog walk
vanilla BRO-net (baseline)1.01.01.01.01.01.01.01.0
ReGraMa1.21±0.031.21\pm 0.031.16±0.081.16\pm 0.081.12±0.081.12\pm 0.081.08±0.041.08\pm 0.04
ReDo1.17±0.071.17\pm 0.070.96±0.030.96\pm 0.030.92±0.120.92\pm 0.120.94±0.060.94\pm 0.06
S&P1.05±0.121.05\pm 0.121.08±0.041.08\pm 0.040.95±0.070.95\pm 0.071.07±0.021.07\pm 0.02
ReBorn1.13±0.061.13\pm 0.061.05±0.071.05\pm 0.071.02±0.061.02\pm 0.060.91±0.130.91\pm 0.13
CBP1.18±0.021.18\pm 0.020.99±0.120.99\pm 0.121.06±0.041.06\pm 0.041.09±0.071.09\pm 0.07

In the non-ReLU setting, ReGraMa and CBP demonstrate strong and robust performance, where our method achieves more stable performance across different tasks.

Table b. Normalized score relative to baseline based on TanH based ResNet-SAC (average of 3 runs):

MethodHumanoid standHumanoid RunDog standDog walk
vanilla TanH based ResNet-SAC (baseline)1.01.01.01.01.01.01.01.0
ReGraMa1.29±0.131.29\pm 0.131.16±0.041.16\pm 0.041.23±0.061.23\pm 0.061.04±0.021.04\pm 0.02
ReDo0.94±0.110.94\pm 0.110.93±0.080.93\pm 0.081.14±0.071.14\pm 0.070.98±0.030.98\pm 0.03
S&P1.13±0.041.13\pm 0.040.84±0.130.84\pm 0.130.95±0.030.95\pm 0.031.08±0.051.08\pm 0.05
ReBorn1.09±0.121.09\pm 0.121.02±0.041.02\pm 0.041.15±0.011.15\pm 0.010.94±0.070.94\pm 0.07
CBP1.22±0.061.22\pm 0.061.03±0.051.03\pm 0.050.97±0.060.97\pm 0.061.06±0.121.06\pm 0.12

These comprehensive results across different neural plasticity methods further demonstrate ReGraMa's robustness across architectures, activation functions, and task domains.


Thanks again for reviewing our work and constructive feedback! We hope our responses address your concerns. We are happy to provide further clarification if needed.

评论

Thank you for your response. Most of my concerns have been addressed, but I still have a few remaining questions:

Q1: Would it be more effective to reset only the neurons with both low learning capacity and low expressivity, rather than resetting all neurons with low learning capacity? My intuition is that neurons with low learning capacity but high expressivity may still be capable of stably learning useful representations. In the context of GraMa, would resetting such neurons potentially lead to a performance drop?

Q2: Although the absolute performance gap is only 0.001, the relative difference remains noticeable, especially around the 50th epoch. Did you conduct multiple runs for this experiment? If so, could you share the number of runs and the variance or standard deviation?

Q3: How do you compute GraMa for GRU layers? Could you please provide a more detailed explanation of how the metric is adapted to recurrent structures?

评论

Thank you for the thoughtful questions, and we appreciate the opportunity to provide further clarification on these points.

Q1: Would it be more effective to reset only the neurons with both low learning capacity and low expressivity, rather than resetting all neurons with low learning capacity? My intuition is that neurons with low learning capacity but high expressivity may still be capable of stably learning useful representations. In the context of GraMa, would resetting such neurons potentially lead to a performance drop?

Thanks for your valuable question! In the updated version, we compared the performance of resetting low learning capacity neurons and resetting both low learning capacity and low expressivity neurons on the BRO-net policy, as summarized in Table 1.

While this approach shows improved efficiency in early training (we include this in the updated version but cannot include a figure here), the final performance difference is not statistically significant (but induces some additional computational overhead to maintain the calculation of two indicators).

Table 1. Performance comparison (percentage change relative to the vanilla BRO-net policy, avg of 3 runs):

Reset strategyPerformance improvement compared to the baseline
low learning capacity & low expressivity+18.51% ±\pm 4.06%
all with low learning capacity+17.83% ±\pm 3.67%

As the distinction of this alternative reset strategy mainly stems from neurons with low learning capacity but high expressivity, we further tracked 1000 (randomly sampled) such neurons throughout training and measured their expressivity. Table 2 shows that these neurons gradually lose their expressivity, even though they initially show high activation value, which may explain why selectively preserving these neurons does not significantly improve the final performance. We thank the reviewer’s question and believe that it also represents a promising future direction for our work (particularly for specialized architectures where the correlation between expressivity and learning capacity might be weaker).

Table 2. Expressivity of (Average of 3 runs):

NeuronsExpressivity during 2/4 training stage (baseline)Expressivity during 3/4 training stageExpressivity during 4/4 training stage
Identified as low learning capacity but high expressivity at 2/4 training stage1.01.0 (normalized)0.57±0.140.57 \pm 0.140.38±0.290.38\pm0.29

Q2: Although the absolute performance gap is only 0.001, the relative difference remains noticeable, especially around the 50th epoch. Did you conduct multiple runs for this experiment? If so, could you share the number of runs and the variance or standard deviation?

Yes, we did conduct 5 runs per configuration, and we include the mean and standard deviation in the updated version. Table 3 records such information corresponding to the 50th and 100th epochs, where there is an overlap (and the slight variations are attributed to randomness within the model rather than algorithmic distinctions).

Table 3. GraMa metrics:

Reset50 epoch100 epoch
ReDo4.17 * 1e-3 ±\pm 0.26 * 1e-34.08 * 1e-3 ±\pm 0.11 *1e-3
ReGraMa3.94 * 1e-3 ±\pm 0.43 * 1e-33.87 * 1e-3 ±\pm 0.24 * 1e-3

Q3: How do you compute GraMa for GRU layers? Could you please provide a more detailed explanation of how the metric is adapted to recurrent structures?

In the new version appendix, we include detailed implementation on the settings of the RNN-agent, as well as the way of deploying GraMa in the RNN architecture, which maintains the same conceptual framework as our feedforward implementation while accounting for the temporal dependencies in recurrent networks. We will involve the RNN-based agent in our open-sourced codebase.

  • We use the standard architecture outlined in rnn_agent.py from ReBorn's official repository, which incorporates a GRUCell within the linear layers:
        self.fc1 = nn.Linear(input_shape, args.rnn_hidden_dim)
        self.net = nn.Sequential(nn.Linear(input_shape, args.rnn_hidden_dim),
                                 nn.ReLU(inplace=True),
                                 ActivateLayer(args.rnn_hidden_dim, 'A'),
                                 nn.Linear(args.rnn_hidden_dim, args.rnn_hidden_dim),
                                 nn.ReLU(inplace=True),)
        self.rnn = nn.GRUCell(args.rnn_hidden_dim, args.rnn_hidden_dim)
        self.fc2 = nn.Linear(args.rnn_hidden_dim, args.n_actions)

  • To implement GraMa for GRU architectures, we obtain the gradients assigned to all GRU layers (during backpropagation) at each training step, and apply the abs() function to these gradients to obtain their magnitude. Finally, we use ReGraMa to guide the reset of the GRU layers.

We hope our responses address your concerns. We are happy to provide additional clarification if needed.

评论

Dear Reviewer AyZs,

Thanks once again for your time in reviewing our work and for the constructive suggestions!

As the author–reviewer discussion period concludes in less than 24 hours, we would like to briefly follow up to confirm that our subsequent responses resolve your concerns. Any further comments would be greatly appreciated. Thanks!

Best, The Authors

评论

Thank you for the detailed responses and additional experiments, they address my concerns. Taking into account this and the other reviews, I will be keeping my score. I hope the author will add these experiments to the paper.

评论

Thank you for the constructive review and for noting that the additional experiments addressed all your concerns! If the concerns are now resolved, we would be grateful if you could update the score accordingly, as we believe the improved clarity and validation meet the bar for acceptance, in line with your own final assessment.

We will carefully incorporate your valuable feedback into the revision, which has greatly helped us improve the quality, clarity, and overall contribution of our work.

Thank you again for your time and effort.

审稿意见
3

The paper addresses the problem of neuronal activity loss in deep-RL agents. The authors argue that the standard definition of τ\tau-dormant neurons, which relies on relative activation magnitudes, fails to work for modern architecture components such as residual connections, non-ReLU activations, and normalization layers. The paper's central insight is that a neuron’s learning capacity is better measured by the magnitude of its gradients than by its activations.

The paper introduces GraMa (Gradient-Magnitude Neuronal Activity metric) to measure learning capacity. The authors adapt the well-established ReDo algorithm to use the GraMa metric instead of activation magnitudes(ReGraMa). The paper provides:

  • an analysis of the failure cases of ReDo for BRO-net,
  • a theoretical proof that GraMa subsumes τ\tau neuron dormancy in ReLU MLP layers, and
  • an evaluation against ReDo using a BRO-net agent on DMC, a diffusion-policy (DACER) on MuJoCo, and SAC with varying activations.

Compared to ReDo and a vanilla baseline, ReGraMa maintains lower inactive-neuron ratios and yields higher returns or faster learning.

优缺点分析

My biggest problem with the paper is the lack of novelty: Using gradient norms instead of activation magnitudes has already been proposed (see weaknesses). Therefore, I would require an exceptionally thorough evaluation using strong baselines or novel insights into RL training dynamics to recommend acceptance. While there are some interesting insights (neuron learning capacity loss is irreversible), overall, I find the evaluation to be lacking: ReDo does not work well outside NatureCNNAtari, which is known (see weaknesses), and the baselines are weak. For me to move to an accept, the paper would need to benchmark against more baselines (see weaknesses) and uncover more about gradient behavior during RL training. As it stands now, the contribution is just too incremental for a NeurIps paper.

Strengths

  • Clear writing and thorough explanations. Easy to understand.
  • Using gradient norms instead of activations is a natural extension of ReDo.
  • Great presentation including high-quality figures.
  • ReGraMa introduces two new hyperparameters, but the authors show convincingly that their settings work broadly.
  • Code is provided.

Weaknesses

  • The paper's central idea (use gradient norms instead of activations) has been developed before as "gradient dormancy" measure for variable-strength Shrink & Perturb resets [1]. Therefore, I cannot consider this a novel contribution of the author's paper.
  • The baselines are weak. The authors only benchmark against ReDo, which is known to not work well outside of the NatureCNN-Atari regime ([2], see Figure 12 in the Appendix). For the included benchmarks, there should be more baselines (not all are needed): Full resets (as BRO includes them), Plasticity Injection (also no catastrophic forgetting), and resets according to [1].
  • I'm not convinced that ReGraMa provides that much of a benefit outside of ReLU-style activations (see Figures 12). When using TanH, which is most common for state-based continuous control, the benefits of ReGraMa are minimal over a vanilla baseline (Figure 12).
  • The authors claim that "Recent studies [Nauman et al., 2024b, Lee et al., 2025a] have shown that integrating residual modules into deep RL agents can significantly improve representation capability on complex visual tasks". BRO is not a pixel RL agent; this statement must be clarified.
  • The introduction mentions catastrophic forgetting as motivation to discard reset approaches, but then proceeds to not evaluate catastrophic forgetting at all. The paper only evaluates off-policy approaches, which mitigate catastrophic forgetting through their replay buffer. Therefore, I believe reset approaches to be a valid baseline for ReGraMa.
  • Sometimes the writing is rather verbose. For example, the introduction is rather lengthy.

[1] Ji, Tianying, et al. "Ace: Off-policy actor-critic with causality-aware entropy regularization." (2024).

[2] Nauman, Michal, et al. "Overestimation, overfitting, and plasticity in actor-critic: the bitter lesson of reinforcement learning." (2024).

问题

  • How dependent is ReGrama on different optimizers? I assume the experiments use Adam, but this is not stated anywhere in the paper.
  • How does ReGraMa interact with parameter regularization, such as weight decay or L2 init [3]?

[3] Kumar, Saurabh, Henrik Marklund, and Benjamin Van Roy. "Maintaining plasticity in continual learning via regenerative regularization." (2023).

局限性

Yes.

最终评判理由

The authors have addressed all my concerns regarding the evaluation in their rebuttal. Therefore, I raise my score by one. During the rebuttal, the authors highlighted their conceptual contributions. However, I still believe that ACE has previously developed the main fundamental insights behind gradient-based reset approaches. In my eyes, this work provides an interesting, but ultimately limited extension to ACE's findings. Thus, I cannot recommend acceptance.

格式问题

None.

作者回复

Thank you for your thoughtful review and detailed feedback! We carefully considered your suggestions and conducted additional experiments and analyses to address your concerns. Below we respond to each point.

Weaknesses

W1: The paper's central idea has been developed before as "gradient dormancy" measure for variable-strength Shrink & Perturb resets in ACE. Therefore, I cannot consider this a novel contribution of the author's paper.

Thank you for highlighting this relevant prior work! After a detailed comparison and new empirical analysis, we believe our contribution remains novel and significant. We have added a discussion of similarities and differences between ACE and our work to the revised manuscript.

Key Differences with ACE [1] in focus, scope, and mechanism and our new insights:

  • ACE primarily proposes a novel causality-aware entropy term to improve exploration, with a secondary and empirical gradient dormancy component to prevent overfitting that guides Shrink & Perturb resets (which is implemented in a “global” way, and please find our new experimental analysis below for further explaining the difference in this part).
  • In contrast, our work provides a comprehensive framework that provides novel insights for understanding neuronal learning capacity in deep RL. We provide the first rigorous analysis of why popular activation-based metrics fail with modern deep RL architectures and identify new phenomena. We show how gradient information resolves these issues and the widely adopted ReDo measure can be restored effectively if properly guided by gradient magnitude through our principled approach.
  • We provide detailed analysis and new perspectives that reinterpret the failure as stemming from ReDo’s statistical objective, not its mechanism, and examine how modern DRL architectural components affect this previously overlooked phenomenon.

Additional empirical comparison:

Following the reviewer’s suggestion for a more thorough evaluation, we have expanded our comparisons to include more recent strong baselines across diverse architectures (see W2) and non-ReLU activations (see W3); and investigated the differences in learning behavior (see W5) as requested. Uncover the neuron-level gradient behavior during training from the time and model perspective(see response to W1 of Reviewer AyZs).

We compare ReGraMa and ACE's reset method regarding the generalization on diverse architectures. The consistent improvements demonstrated by ReGraMa across diverse architectures validate our theoretical insights and demonstrate their practical significance.

Table: Advantages over base policy at 3M step on Walker Walk (average of 3 runs):

MethodBRO-net basedUnet-DACER based
Baseline1.01.01.01.0
ReGraMa1.18±0.111.18\pm 0.111.07±0.051.07\pm 0.05
ACE’s reset1.09±0.141.09\pm 0.140.94±0.080.94\pm 0.08

In summary, our work provides a rigorous, novel, and in-depth foundation, analysis, and comprehensive evaluation of neuronal learning capacity in deep RL, addressing key challenges previously unexplored in the literature.

[1]: Ji, Tianying, et al. "Ace: Off-policy actor-critic with causality-aware entropy regularization." arXiv preprint arXiv:2402.14528 (2024).

W2: The baselines are weak... there should be more baselines (not all are needed): Full resets, Plasticity Injection, and resets according to ACE.

We have added the related strong baselines including full resets, Plasticity Injection and Shrink & Perturb on 4 DMC tasks (6M steps). All methods follow the official implementations. Results show that ReGraMa improves learning performance across all 4 tasks. The radical approach of full resets magnified performance uncertainty, as evidenced by a high standard deviation across different seeds. Plasticity Injection performs the weakest, likely because abrupt architectural changes can interfere with learning in complex agents. Shrink & Perturb reset shows stable but relatively minor gains in 3 tasks.

Table. Performance improvement over baseline (Average of 3 runs):

MethodQuadruped RunHumanoid RunWalker WalkHumanoid Walk
vanilla BRO-net policy (baseline score)1.01.01.01.01.01.01.01.0
ReGraMa1.08±0.031.08\pm 0.031.16±0.071.16\pm 0.071.07±0.041.07\pm 0.041.12±0.081.12\pm 0.08
Shrink & Perturb reset1.05±0.061.05\pm 0.061.04±0.051.04\pm 0.051.09±0.031.09\pm 0.030.97±0.060.97\pm 0.06
Full Resets (every 300k)1.03±0.081.03\pm 0.081.13±0.161.13\pm 0.160.92±0.140.92\pm 0.140.91±0.070.91\pm 0.07
Plasticity Injection1.08±0.131.08\pm 0.130.94±0.110.94\pm 0.111.03±0.091.03\pm 0.091.08±0.091.08\pm 0.09

W3: When using TanH, which is most common for state-based continuous control, the benefits of ReGraMa are minimal over a vanilla baseline (Figure 12).

We replaced MuJoCo with a more challenging task (Walker Walk) based on SAC. The results below (normalized to vanilla SAC) show clear benefits from ReGraMa, outperforming both ReDo and the vanilla policy:

Table. Advantages over vanilla policy (Average of 3 runs):

MethodWalker Walk
TanH-SAC (baseline score)1.01.0
ReGraMa1.18±0.111.18\pm 0.11
ReDo0.93±0.140.93\pm 0.14

Besides, we followed ReDo's setup in Appendix B and adopted the sparse-SAC to avoid the noise caused by parameter redundancy. The results show that our method significantly improves the policy performance.

Table. Advantages over vanilla policy (Average of 3 runs):

MethodAnt
0.5 width SAC (baseline)1.01.0
ReGraMa1.14±0.041.14\pm 0.04
ReDo1.09±0.061.09\pm 0.06

W4: The authors claim that "Recent studies [Nauman et al., 2024b, Lee et al., 2025a] … improve representation capability on complex visual tasks". BRO is not a pixel RL agent; this statement must be clarified.

We would like to clarify that our claim is based on Table 8 of the official BRO-net paper (camera-ready, OpenReview NeurIPS’24), which evaluates BRO-net on image-based tasks. We will highlight that we use the same encoder setup (DrQv2-style convolutional encoder) and BRO’s official PyTorch wrapper.

W5: The introduction mentions catastrophic forgetting as motivation to discard reset approaches, but then proceeds to not evaluate catastrophic forgetting ... Therefore, I believe reset approache to be a valid baseline for ReGraMa.

We use DrQ as the backbone and compare the performance on the long-term training task Cheetah Run (8M steps). It can be seen that full resets may cause instability and oscillations. ReGraMa balances the restoration of network activity with forgetting by precisely resetting individual neurons.

Table. DrQ on Cheetah Run (Average of 3 seed):

Methodduring 1/5 training stageduring 2/5 training stageduring 3/5 training stageduring 4/5 training stageduring 5/5 training stage
DrQ478.39±78.96478.39\pm 78.96612.52±112.34612.52\pm 112.34549.17±136.81549.17\pm 136.81662.03±173.27662.03\pm 173.27637.61±193.68637.61\pm 193.68
ReGraMa492.67±92.04492.67\pm 92.04641.31±107.93641.31\pm 107.93631.58±125.36631.58\pm 125.36653.03±173.27653.03\pm 173.27673.62±126.21673.62\pm 126.21
Full Reset (every 2M step)495.14±108.35495.14\pm 108.35627.89±159.06627.89\pm 159.06514.73±154.57514.73\pm 154.57629.97±186.14629.97\pm 186.14652.37±177.35652.37\pm 177.35

We also added PPO as the on-policy baseline to compare the above methods. ReDo reset has a relatively weak positive effect on the performance of neurons. Full resets caused performance instability, while ReGraMa significantly enhance learning efficiency.

Table. PPO Performance on HalfCheetah (Average of 3 runs):

MethodHalfCheetah
vanilla PPO2604.71±487.612604.71\pm 487.61
ReGraMa3418.38±336.143418.38\pm 336.14
ReDo2641.31±394.272641.31\pm 394.27
Full Reset (every 250k step)2313.85±564.932313.85 \pm 564.93

We would like to emphasize that "mitigate catastrophic forgetting" is mainly used to introduce the contributions of related work, not to describe the approach proposed in this paper.

W6: Sometimes the writing is rather verbose. For example, the introduction is rather lengthy.

We’ve streamlined the introduction and moved the detailed related work to its own dedicated section.

Questions

Q1: How dependent is ReGrama on different optimizers?

We uniformly use the Adam optimizer for fair comparison.

We have tested ReGraMa with different optimizers on Walker Walk. The optimizers include Adaptive Moment Estimation (Adam), Stochastic Gradient Descent (SGD), and Root Mean Square Propagation (RMSprop). Results show that ReGraMa might be more suitable for optimizers with adaptive learning rate, though future work could explore optimizer-reset interactions further.

Table. Optimizer Dependence (Average of 3 seeds):

MethodAdamSGDRMSprop
SAC1.01.01.01.01.01.0
ReGraMa1.07±0.051.07\pm 0.050.99±0.070.99\pm 0.071.08±0.031.08\pm 0.03

Q2: How does ReGraMa interact with parameter regularization, such as weight decay or L2 init [3]?

The effectiveness of the combination ReGraMa with weight decay (follow the optimal setting in [1]) and L2 init (λ=1e2\lambda=1e-2) is tested using DrQv2. Results show that integrating either method with ReGraMa can further enhance learning efficiency. L2 init may be more compatible with ReGraMa.

Table. ReGraMa with Regularization (Average of 3 runs):

MethodQuadruped Run (3M step)
Vanilla Policy649.13±182.43649.13\pm 182.43
ReGraMa706.59±127.25706.59\pm 127.25
ReGraMa + L2 init747.36±127.31747.36\pm 127.31
ReGraMa + Weight Decay721.31±94.26721.31\pm 94.26
ReGraMa + L2 Init + W. Decay739.42±104.58739.42 \pm 104.58

[1]: Lyle, Clare, et al. "Disentangling the causes of plasticity loss in neural networks." arXiv (2024).


Thank you again for your constructive feedback. We hope these additional results address your concerns. Please let us know if further clarification is needed.

评论

Thank you for the extensive rebuttal, particularly the additional experiments. The new baselines address my request for broader empirical comparison, and I will raise my score by one to reflect this.

However, my concerns about the paper’s novelty remain. In my view:

our work provides a comprehensive framework that provides novel insights for understanding neuronal learning capacity in deep RL

ReGraMa combines ACE’s gradient-dormancy criterion with the ReDo reset, replacing Shrink & Perturb. While this performs better, it does not amount to the “comprehensive framework” claimed.

We provide the first rigorous analysis of why popular activation-based metrics fail with modern deep RL architectures and identify new phenomena

The failure of activation-based metrics in continuous control was already demonstrated by ACE. Your results on BRO-net provide a minor extension of ACE's insights for a particular network architecture. The authors of ACE have already shown that gradient dormancy works on more modern architecture (see Appendix J, results with DrQ-v2). To me, the main new phenomenon established by your paper is the permanent loss of learning capacity over time, which is minor.

We provide detailed analysis and new perspectives that reinterpret the failure as stemming from ReDo’s statistical objective

The stated insight that ReDo’s issues stem from its statistical objective is the same as ACE’s original motivation for introducing gradient dormancy (("However, the dormant phenomenon defined in these works is not evident in state-based RL", p.4), so I do not see a genuinely new conceptual contribution.

In short, the paper is technically solid and now empirically stronger, but I still find the conceptual advance incremental. Thus, after the score increase noted above, my recommendation remains below the acceptance threshold.

评论

We would like to thank the reviewer for the detailed feedback and acknowledgment of our additional experiments that have mitigated the concerns. After careful consideration, we believe our core technical contribution may have been overlooked, and we respectfully believe that our work merits further consideration regarding its novelty and contributions.

The orthogonality statement and nature of our contributions and ACE

In the new version, we will further emphasize the orthogonality of the contributions between our work and ACE. While ACE made valuable contributions through its causality-aware entropy term for exploration and briefly observed gradient-based measurements in specific contexts, our work addresses a fundamental gap in the literature: the lack of systematic analysis explaining why gradient-based measures outperform activation-based approaches across diverse neural architectures.

Fundamental Technical Innovation

Our ReGraMa work analytically uncovers the limitations of widely adopted activation-based metrics in various modern deep RL architectures, and systematically reveals the following points that will facilitate the RL community's deeper understanding of neuronal activity:

  • Systematically exposes critical failures of widely-adopted activation-based metrics across diverse modern deep RL architectures (with analysis from key aspects in Sec 3.1)
  • The mechanistic reasons behind the failures (besides demonstrating the phenomenon), and critical analysis in conventional dormancy detection that explain their inconsistent performance across architectures
  • Generally applicable across diverse neural architectures with broad empirical evidence

These findings don’t merely extend prior observations but establish critical new insights and advancements for understanding neuronal functionality in deep RL. Moreover, our work:

  • Provides critical architectural insights and broad empirical evaluations, which explain why and when activation-based metrics succeed or fail across diverse architectures, and why gradient-based approaches remain reliable across diverse architectures
  • Reveals their equivalence under specific cases for a better and new understanding to the community, which is a previously unknown relationship that provides validation, and demonstrates temporal measurement dynamics for an in-depth analysis
  • Reveals temporal dynamics: Our analysis shows that activation-based metrics trigger aggressive resets early in training when activation patterns are naturally unstable, potentially disrupting learning, while gradient-based measurement shows a more progressive pattern that aligns with learning dynamics, and quantifies layer-wise preferences.
  • Demonstrates cross-architectural applicability: we validate our approach across multiple widely-used architectures that achieve consistent results across policy architectures from simple MLPs to complex diffusion policies.

Our experiments show consistent gains across environments with varying complexity levels, empirically validating that our approach represents a critical advance, not an incremental combination. These analyses also provide insights that were not previously available in the literature and help explain why ReGraMa outperforms previous reset strategies.

Through the above extensive analysis, our work delivers a general framework for neuronal functionality in deep RL, which leads to ReGraMa, enabling more reliable measurements, better performance, and a deeper understanding of its learning behavior.

While this performs better, it does not amount to the “comprehensive framework” claimed.

Our use of “comprehensive framework” may have come off as stronger than we intended. What we meant to convey was the breadth of experiments we ran to justify our resulting analyses:

  • We have summarized and analyzed the characteristics of neural activity and the correlation between gradient-based and activation-based measures from multiple perspectives (such as resetting neuron types (Fig. 3), policy model types (BRO-net, Unet-based diffusion policy, TanH-based SAC, DrQ), reversibility of dormancy (Fig.6)) through a number of carefully designed experiments.
  • We tracked the proportion of neurons reset by ReDo and ReGraMa, and uncovered the differences in identification preferences from the temporal dimension (response for W1 in Reviewer AyZs)
  • We further analyzed the layer-wise distribution for both types of methods to uncover the layer-wise preferences (Table 2 of the response in Reviewer AyZs).

We believe these contributions collectively represent a meaningful advancement in understanding neuronal dynamics in RL, beyond the initial observations in ACE, and our work provides both critical insights that help advance the understanding of neuronal behavior in deep RL.


Thanks again for reviewing our work and constructive feedback! We hope our responses address your concerns. We are happy to provide additional clarification if needed.

评论

Thank you again for the clarifications and for the new experiments.

I still see the foundational contributions as coming from ACE. Namely, 1) revealing the failure of activation-based metrics in common Deep-RL settings, 2) proposing a gradient-based alternative, and 3) showing that gradient-based approaches can transfer across architectures.

That said, your paper extends ACE's findings: you

  • broaden the evaluation to a wider range of architectures
  • analyze the temporal dynamics of (gradient) dormant neurons
  • apply a different (but already established) reset formulation that yields practical gains
  • show that gradient dormancy subsumes activation dormancy in a stylized ReLU-MLP setting.

To reflect these additional contributions and your new experiments, I have raised my score by one.

As our views on the relative novelty of the core insight and contributions still differ, I suggest we agree to disagree and let the final decision rest with the chair.

评论

We greatly appreciate the reviewer's recognition that our work is 'technically solid and now empirically stronger' according to our additional experiments and analysis during the subsequent responses, which led to raising your score. Thanks once again for your time and input which has helped improve our manuscript.

最终决定

This paper introduces gradient magnitude as a novel and practical metric to identify and reset inactive neurons in reinforcement learning. The proposed measure, GraMa, is both lightweight and highly effective, particularly in complex network architectures. The manuscript is well-written, and the key results are clearly explained, well-illustrated, and thoughtfully summarized. Notably, the method is runtime-efficient and significantly enhances performance in many scenarios, sometimes yielding substantial improvements. While the paper offers limited formal guarantees or theoretical explanations for why GraMa works, its empirical effectiveness and innovative approaches make it a valuable contribution to the field.