/10

Rejected3 位审稿人

最低2最高4标准差0.8

ICML 2025

Hadamard Representations: Augmenting Hyperbolic Tangents in RL

Jacob Eeuwe Kooi,Mark Hoogendoorn,Vincent Francois-Lavet

OpenReview PDF

提交: 2025-01-20更新: 2025-06-18

摘要

关键词

Representation LearningReinforcement LearningHadamardLatent Representation

评审与讨论

审稿意见

评分: 22025-03-13

This paper addresses the issue of "dying neurons" in reinforcement learning, focusing particularly on continuously differentiable activation functions like hyperbolic tangent (tanh). The authors demonstrate that the dying neuron phenomenon is not exclusive to ReLU activations but also affects tanh, where saturation turns weights into "hidden biases" rather than pruning them as in ReLU. The paper proposes a "Hadamard representation" (HR) that defines a hidden layer as the Hadamard product of two separately parameterized activation vectors. Experiments on various Atari games using DQN, PPO, and PQN show that tanh with HR leads to significant performance improvements, reduced dead neurons, and increased effective rank compared to standard tanh and ReLU activations.

给作者的问题

Have you conducted experiments combining Hadamard representations with layer normalization? Since layer normalization can also mitigate neuron saturation, it's important to understand whether your approach provides complementary benefits or if the simpler normalization approach alone achieves similar results.
The paper focuses primarily on Atari environments. Have you tested the approach in continuous control tasks or other RL domains where the state representation is different? This would help establish the generality of your findings beyond Atari.

论据与证据

The core claims regarding reduced dead neurons and improved performance when using Hadamard representations with tanh are supported by the experimental evidence presented. The theoretical analysis of why Hadamard products help with tanh but not with ReLU or sigmoid is sound and verified by the experimental results shown in Figure 14. However, the claim that this is the best approach to address dying neurons lacks comparative evidence against other existing solutions like layer normalization, which is only briefly explored in one ablation experiment.

方法与评估标准

The methods proposed are logical for addressing the stated problem. The evaluation on multiple reinforcement learning algorithms (DQN, PPO, PQN) across Atari environments is appropriate for benchmarking in RL. The metrics used (dying neuron rate, effective rank, and performance) directly measure the phenomena being studied. However, the evaluation would be stronger if it included more comparative baselines from the literature that specifically address neuron saturation and plasticity (such as DMC).

理论论述

I have checked the theoretical derivations in Section 4, particularly the product saturation probability analysis and Theorem 4.1. The mathematical reasoning appears sound, where the authors show that the probability of dying neurons is reduced from p to p² when using Hadamard representation with tanh. The proof that weights connected to dead tanh neurons become biases (Theorem 4.1) is valid and provides a useful insight into why dying tanh neurons can be more problematic than dying ReLUs.

实验设计与分析

The experimental design is generally sound. The authors test their approach across multiple RL algorithms and environments, use appropriate metrics, and include ablation studies. The kernel density estimation approach for detecting saturated neurons is reasonable. However, I found the learning rate sensitivity experiment for the larger 1024-dimensional network in Figure 13(d) to be incomplete, as it doesn't fully explore whether the performance gains could be achieved through simpler hyperparameter tuning alone.

补充材料

与现有文献的关系

This work connects to literature on network capacity in RL, particularly the dying ReLU problem, but expands it to continuously differentiable activations.

遗漏的重要参考文献

The paper lacks sufficient discussion of layer normalization (Ba et al., 2016) as a potential solution for the dying neuron problem. While it's mentioned briefly in Figure 8(a), a more thorough comparison is needed given that normalization techniques are widely used to address activation saturation. The authors should also discuss more recent work on maintaining plasticity in RL networks, such as weight reinitialization techniques or specialized regularization approaches that target the same problem but may be simpler to implement.

其他优缺点

The paper provides a novel perspective on dying neurons in tanh activations and offers a conceptually simple solution with the Hadamard representation.

其他意见或建议

The paper would benefit from clearer exposition in some sections, particularly in explaining the practical implementation details of HR.

作者回复

2025-03-31

We appreciate Reviewer KgES’s detailed review and appreciation of our work’s experiments and theoretical claims. Your suggestions are valuable, and we hope to address them here:

Layer Normalization (LN): We now see that we could have emphasized the effect of LN with respect to dying neurons and effective rank more strongly. We have therefore ran experiments examining the effect of LN on dead neurons and the effective rank (ER) similar to Fig. 7a,b. These experiments further show that a reduction in dead neurons does not always correlate with an increase in a layer's ER. As stated in [1], LN enforces a 'simplicity bias' throughout the layer, generally favoring lower-frequency, less complex solutions. This could further explain the performance discrepancy we observe between the Hadamard Representation (HR) and other regularizing techniques that force neuron resets or impose specific distributions. We will update Figures 7a,b in the revised manuscript to include dead neurons and ER during training with LN. Additionally, in the corresponding Experiments sections, we will further highlight the differences between LN and HR with respect to dead neurons, ER, and their ratio. We will also highlight this difference in the introduction.

Activation Effective Rank (ER) Dead Neurons
Tanh (HR) 428.3 0.297
Layernorm 347.5 0.227
Tanh 363.4 0.393
Sigmoid 376.8 0.449
ReLU 245.6 0.618

Activation	Effective Rank (ER)	Dead Neurons
Tanh (HR)	428.3	0.297
Layernorm	347.5	0.227
Tanh	363.4	0.393
Sigmoid	376.8	0.449
ReLU	245.6	0.618

Combining LN & HR: You are intuitively correct to suggest combining them, and we overlooked mentioning this in the manuscript. We would therefore like to point out that the baseline of the 51-Atari Environment PQN [4] experiments already uses LN. We implemented HR on top of this baseline, meaning it combines LN with HR (Layer Output -> LN -> Nonlinear (Tanh) -> HR). Thus, they can very well be used together. As said before, this was not clarified by us in the manuscript, and we will add a detailed explanation of the PQN implementation in the final Experiments subsection. We will also update the Figure 1 caption in the revised manuscript to explicitly show that the PQN baselines are Layer-Normalized.
Continuous Control Tasks: We have now conducted experiments applying HR to state-based continuous control (Mujoco) with PPO. Preliminary results indicate that both HR and LN perform similarly to the baseline. We attribute this to: 1) A negligible occurrence of dead neurons in the Baseline, and 2) The low-dimensional state observation and resulting encoder structure. We hypothesize that encoders compressing high-dimensional input spaces (e.g., Atari, pixels) benefit more from HR due to its increased representation capacity, as measured by the ER. Recent related work [2] also shows a strong simplicity bias preference in state-based RL. However, we will add an Appendix section discussing these results and explaining the differences between encoders in low-dimensional state-based RL and high-dimensional pixel-based RL. Additionally, regarding environment diversity, we also have strong HR results in another pixel-based (50x50 pixels) toy RL environment that was used early in our HR research. For completeness, we will include this in another Appendix section.
Literature: We identified a recent related paper [3] investigating the causes of plasticity loss in Deep Value-Based RL. They found that combining regularizations yields good results, as standalone L2 norm or Batchnorm underperforms in Atari compared to LN alone. We will include this paper in our related work and discuss its relevance with our research in the revised manuscript. If there is any other research missing, please let us know.
Implementation and Hyperparameters: We thank the Reviewer for highlighting this and will provide a clearer overview of the precise HR implementation at the beginning of the Experiments section. We will also make sure to emphasize the possibility of combining HR with LN. Furthermore, we would like to point out that no hyperparameter tuning was used for any of the HR implementations. We only used a learning rate sweep in the ablation for the 1024 latent dimension baseline (Fig. 8a, sweep seen in Fig. 13d), to make the 1024 dimensional baseline stronger, and add to the notion that HR does not receive its benefits from a mere parameter increase.

Finally, we are grateful for your input, and believe that the revisions based on your review will enhance the paper’s scope and clarity.

[1] Teney et al, 2024: Neural Redshift: Random Networks are not Random Functions. CVPR 2024

[2] Lee et al, 2024: SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning.

[3] Lyle et al, 2024: Disentangling the Causes of Plasticity Loss in Neural Networks. CoLLas 2024

[4] Gallici et al, 2025: Simplifying Deep Temporal Difference Learning.

审稿意见

评分: 32025-03-14

The paper demonstrated through experiments that activation functions such as tanh, sigmoid suffer from dying neurons in a comparable scale to that of relus in RL settings. A hadamard product with a carry gate is used to mitigate the dying neuron issue. It was shown that the hadamard representation method effictively reduces the dying neuron rate.

给作者的问题

See the strengths and weaknesses part.

论据与证据

Yes

方法与评估标准

Make sense to me.

理论论述

Looks correct to me.

实验设计与分析

Looks reasonable to me.

补充材料

N/A

与现有文献的关系

The paper discussed the dying neuron effect for different types of activation functions in RL settings, bringing more insights to why relu works better over other activation functions.

遗漏的重要参考文献

N/A

其他优缺点

Strengths: The paper conducted a lot of experiments to support the discussion and provided evidence why relu is preferred over other activation functions in RL settings. In addition, a hadamard product based method was proposed to mitigate the dying neuron effect for tanh activation function. The paper is well written.

Weaknesses: This is not weakness but a question. How different is the hadamard representation method in the paper from the idea of using product of hiden layers and a 'gate' proposed in (Srivastava et al., 2015) for supervised learning?

其他意见或建议

N/A

作者回复

2025-03-31

We thank Reviewer BWA7 for the review, and for acknowledging our experimental efforts and discussion around the dying neuron effect.

We are happy to clarify your question about Srivastava et al. [1] . While both approaches use products of hidden layers, they slightly differ in purpose and design. Highway Networks use a learnable gate, T(x), in H(x) = F(x) · T(x) + x · (1 - T(x)), with x as the input and T(x) often sigmoid-activated, targeting improving gradient flow in deep supervised learning. Our Hadamard Representation (HR) defines z(x) = z^enc(x) · z^(x), with z^enc(x) and z^(x) as independent Tanh-activated layers from the same input, no carry gate or residual. HR uses Tanh’s properties to reduce dead neurons in RL, and to boost effective rank. Highway Networks focus on preserving gradients throughout deep networks by using learnable, partial residual connections. In other words, the gate learns in what proportion the input directly travels to the next layer instead of using the layers' weights and activation.

Additionally, following the Reviewer's question, we have ran experiments and tested a Highway Network variant, following the Highway network paper and using a sigmoid gate in PQN. Interestingly, it looks to perform slightly better than the ReLU baseline. However, it doesn't show the same performance benefits as the Hadamard Representation, although further research into these architectures should be interesting future work.

Activation Function	Median Human-Normalized Score (51 Games)
ReLU	1.057
ReLU (Highway)	1.154
Tanh	0.597
Tanh (HR)	1.340

To further implement changes according to the Reviewer's question, a clearer overview of the distinction between Highway Networks and the Hadamard Representation will be added to our revised manuscript in the corresponding section (Line 197). We will also add a new Appendix section, showing the Highway Networks' result on PQN and comparing it with the Hadamard Representation.

[1] Srivastava et al, 2015. Highway Networks.

审稿意见

评分: 42025-03-17

The paper is about developing new strategies to mitigate dead neurons that are prominent in typical reinforcement learning settings. The authors proposes Haamard representations, which uses two hidden layers and an activation function. Experiments were done using Atari games using DQN, PPO and PQN (which is a parallel Q-learning algorithm). The proposed method showed lesser dead neurons when compared with methods that use other types of activation functions such as Relu, Continual Relu, Tanh and Sigmoid. While not directly contributing to better performance, the use of Haamard representations also lead to better effective rank of the representations.

给作者的问题

I only have one question, which is to see some analysis done for continuous actions benchmark such as mujoco. Would be interesting to see how state observations and the use of tanh activation function in the actor network may or may not lead to a different set of results.

论据与证据

Yes, the authors did sufficient work to support their claims. Beyond the analysis of the performance of the agents, the authors also presented studies relating to dead neurons.

方法与评估标准

Yes, the authors chose the complex setting of Atari games to evaluate their hypothesis.

理论论述

Yes, I have verified that they are correct to the best of my knowledge.

实验设计与分析

Yes, the experimental design are sound.

补充材料

Only briefly since I was familiar with the topic discussed in the paper. Most of the concepts in the paper are covered sufficiently and there isn't a lot of reference from the main paper to the supplementary material that required plenty attention.

与现有文献的关系

It relates to the current work on improving continual learnability of deep RL. Many current studies have looked at dead neurons and introduces various type of activation functions to ensure better gradient propagation to the shallower layers. Here, the authors use a combination of layer and tanh activations to show how dead neurons can be reduced. Therefore, this piece of work aligns well with the body of work exploring representations and activation functions in reducing dead neurons.

遗漏的重要参考文献

None.

其他优缺点

Strength

Paper is well written and clear. Made it easy to read and understand the paper.
Relevant literatures were discussed.

Weakness

Only evaluated in Atari Benchmark and discrete actions

其他意见或建议

Line Column 143: "Proof. Let us consider a set of neurons αj i and forward connected weights wjαi in layer z_j. "Should it be layer j instead of layer z_j?

作者回复

2025-03-31

We’re grateful for Reviewer X4xE’s positive comments on our paper!

You’re correct about the typo on Line 143. It should be "layer j," or it could then also be called "the hidden layer z^j" and we will fix it in the revision. We thank the Reviewer for pointing it out!
We have now ran preliminary experiments applying a HR to state-based continuous control (Mujoco) with PPO. Preliminary results show that both the HR and LN have similar performance to the baseline. We credit this to two reasons: 1 - A near-zero occurence of dead neurons in the baseline, and 2 - the different encoder structure: We believe that an encoder that compresses high dimensional input spaces (Atari, Pixels etc.) benefits more from the Hadamard Representation due to the added representation capacity - measured through Effective Rank (ER). We also see that recent related work [1] shows a preference in state-based RL for a simplicity bias [2]. We will add a section to our Appendix showing the results and explaining differences between encoders in low dimensional state-based RL and high-dimensional pixel based RL.
Additionally, for completeness and to add another environment different from Atari, we also have strong results using HR in a 50 x 50 greyscale pixel toy RL environment, which we will add to the Appendix. These results were acquired in the early stages of our research as a predecessor to the Atari domain.

[1] Lee et al, 2024: SimBa: Simplicity Bias for Scaling Up Parameters in Deep Reinforcement Learning.

[2] Teney et al, 2024: Neural Redshift: Random Networks are not Random Functions. CVPR 2024

审稿人评论

2025-04-04

Thank you to the authors for running the additional experiments on Mujoco. It was mainly to assure that similar phenomena are also observed in continuous actions setting and the proposed method does not hurt. Nonetheless, these additional studies will allow this work to be more comprehensive and the studies can be added the appendix.

Having read the other reviews from the other reviewers, I will keep my score as it is.

最终决定Reject

2025-05-01

The work proposed Hadamard Representations, a particular parametrization of tanh neural networks that allows for better gradient flow, and hence less dead units, permitting the network to maintain plasticity. The authors explore the proposed method on Atari, showing good performance.

Reviewers agree that the theory and proposed method is sound. Experiments also seem ok. I read the rebuttal and reviews, as well as parts of the paper carefully.

Overall, while I think the work is interesting, the manuscript is not ready yet for publication at ICML. Some specific reasons and direction of improvement:

Baselines. Reviewer KgES raises this point and is not properly answered in the rebuttal, but the work mostly compares to a standard baseline, but not to methods aimed at solving the plasticity loss problem (e.g. continual backprop, NAP from Lyle et al, etc.). Even if some of thede methods are ReLU specific -- though I would argue that is not exactly true -- comparison to them applied to a ReLU architecture is still meaningful. Otherwise the work is not well contextualized in existing work
better understanding of when the method is needed -- with more diverse environments.

The authors have run experiments on Mujoco where it seems the method has not improved on top of the baseline, potentially because loss of plasticity is not a big issue (I think the hypothesis mentioned by the authors is that the network does not need to compress the observation too much, the signal is too low dimensional). Can there be an ablation showing whether this intuition is somewhat correct? Or explore more complex environments in the continuous control setting that might suffer from this issue? Showing when the method is useful and when not would greatly improve the paper. Or finding other settings beside Atari where the method help would be useful

better motivation or discussion on why improving tanh activated models is useful/important In some sense the proof is in the pudding, the HR-tanh model performs well. However it would be good to have an intuition of why switching to continuously differentiable activation is worth exploring (maybe in RL, or maybe in a wider context)

Overall in the current format I think the paper will have limited impact within the wide ICML community and is not sufficiently well positioned among existing methods aimed at solving this problem. I urge the authors to improve the work in any dimension mentioned and resubmit.