4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

4.0

置信度

正确性2.3

贡献度2.3

表达3.3

ICLR 2025

Hadamard Representations: Augmenting Hyperbolic Tangents in RL

Jacob Eeuwe Kooi,Mark Hoogendoorn,Vincent Francois-Lavet

OpenReview PDF

提交: 2024-09-19更新: 2025-02-05

摘要

关键词

Representation LearningReinforcement LearningActivation Functions

评审与讨论

审稿意见

评分: 3置信度: 42024-11-04

This work aims to counteract the dead neuron and rank-collapse phenomena in RL. It proposes the 'Hadamard representation': simply duplicating the final encoder layer in the encoder of the 'Nature DQN' architecture and taking the elementwise product between the two outputs. Empirically, the results show an overall moderate increase in learning speed, together with lower dead neurons and higher rank-collapse as averaged over 8 Atari games.

优点

Overall, I found the paper well-written and easy to parse through.
Already in the introduction, the paper concisely introduces the problem settings and effectively conveys the tradeoff between ReLU and Tanh with results (fig. 1) and toy examples (fig. 2).
Evaluation is conducted both with canonical off-policy and off-policy architectures/algorithms (Nature DQN and PPO) widely used in the broader literature.

缺点

Major:

The experiments are conducted on a very small fraction of only 8 hand-picked Atari games. While the performance of the proposed method is slightly higher than ReLU on average, this does not hold consistently even across these 8 games (as reported in the Appendix). Moreover, while these environments are notably sensitive to stochasticity, evaluation is carried out with only 5 random seeds I believe this to be very much insufficient for a paper with a simple architectural contribution that should be thoroughly validated. Thus, I am very much not convinced of the generality of these results, even across the full set of Atari environments, and would strongly encourage to expand this section. In case Atari is too demanding, I would suggest considering other more computationally-friendly benchmarks such as the 16 environments in the highly-parallelizable Procgen benchmark.
Even in the results of this paper, there does not seem to be a clear relationship between dead neurons/effective rank and performance, (e.g., ReLU outperforms Tanh while having a higher number of dead neurons). Thus, even analytically, I feel the paper in its current states leaves many unresolved questions and provides very little novel information.
I believe the claims about the probabilities of neurons 'dying' such as: "Taking a product of hyperbolic tangent activated neurons thus reduces the probability of neuron saturation from p to p^2." (ln242-243) are not correct, and potentially misleading, as they make independence assumptions that are not even reported by the paper's own results in Appendix C.3.

Minor:

Figures 3 and 5 show 32 very small graphs each with the value distribution of 16/512 hand-picked neurons at two hand-picked points during training. I did not find these figures convey much meaningful information and would suggest replacing them with a graph showing the percentage of dead neurons for tanh/relu/and Hadamard representations throughout training.

I would suggest adding Subsection numbers.

问题

I would appreciate it if the authors could address, in their response, the criticism raised above. In addition:

In the MLP block of several modern LLMs such as Llama, the hidden representation is computed as (using this paper's notation) f(A_1x + B_1) * (A_1x + B_2). According to the paper's own analysis in Section 3, this simpler version should entirely prevent the dead neuron phenomenon with respect to the input activations. I wonder if the authors have considered and/or have this version.

评论- Response to Reviewer HMtR

2024-11-21

We thank reviewer HMtR for the rigorous review of our paper.

“The experiments are conducted on a very small fraction of only 8 hand-picked Atari games. While the performance of the proposed method is slightly higher than ReLU on average, this does not hold consistently even across these 8 games (as reported in the Appendix). Moreover, while these environments are notably sensitive to stochasticity, evaluation is carried out with only 5 random seeds I believe this to be very much insufficient for a paper with a simple architectural contribution that should be thoroughly validated. Thus, I am very much not convinced of the generality of these results, even across the full set of Atari environments, and would strongly encourage to expand this section. In case Atari is too demanding, I would suggest considering other more computationally-friendly benchmarks such as the 16 environments in the highly-parallelizable Procgen benchmark.”

We understand the reviewer’s point about the limited evaluation. Although we do believe that our paper provides insights into the dying neuron phenomenom and an activation function’s correlation with it, we have additionally ran a set of experiments on the full Atari suite. We have evaluated the Hadamard representation on a very recent vectorized version of DQN called Parallelizable Q-Network (PQN) [1]. Using PQN, we were able to run the (nearly) full atari suite of 51 games, and compared to tanh and ReLU.

“Even in the results of this paper, there does not seem to be a clear relationship between dead neurons/effective rank and performance, (e.g., ReLU outperforms Tanh while having a higher number of dead neurons). Thus, even analytically, I feel the paper in its current state leaves many unresolved questions and provides very little novel information.”

We agree with the reviewer’s point that there remains an analytical gap as to the strong ReLU performance as compared to Tanh. We have added a section in the main paper that goes more into depth on this subject. This section is highlighted in red and can be found on page 9. Specifically, we argue that, in contrast to the ReLU, dying hyperbolic tangents generate a bias on the next layer.

“I believe the claims about the probabilities of neurons 'dying' such as: "Taking a product of hyperbolic tangent activated neurons thus reduces the probability of neuron saturation from p to p^2." (ln242-243) are not correct, and potentially misleading, as they make independence assumptions that are not even reported by the paper's own results in Appendix C.3.”

You are correct. We will state these independence assumptions. We have slightly rewritten the corresponding explanation in the main paper as well as in Appendix C.3.

“In the MLP block of several modern LLMs such as Llama, the hidden representation is computed as (using this paper's notation) f(A_1x + B_1) * (A_1x + B_2). According to the paper's own analysis in Section 3, this simpler version should entirely prevent the dead neuron phenomenon with respect to the input activations. I wonder if the authors have considered and/or have this version.”

We were not aware of this version, as we had only seen these multiplicative interactions before in the Highway Network paper [3]. We have now added a reference to Gated Linear Unit [4] in our algorithm section, as we found that the function you proposed is derived from [4] (And our function is similar to it). Although we think this could also be a good activation, we do not believe that this version would completely prevent the dead neuron phenomenon. The f in f(A_1x + B_1) * (A_2x + B_2) represents the Swish function. The Swish also saturates into 0 for decreasing inputs, meaning that the final product will remain 0 if the Swish-activated representation saturates into 0.

5 (minor): To allow readers to visualize the problems with activations in hidden layers, we respectfully chose to keep figures 3 and 5 intact. In the experiment section, we already have quantitative dying neurons plots.

We hope to have taken away most of the reviewer's issues with the old manuscript. We kindly request that you review the updated manuscript in light of these clarifications and revisions, and update the score accordingly. If there are any more clarifications needed please do not hesitate to reply.

[1] Gallici et al. Simplifying Deep Temporal Difference Learning. Arxiv 2024

[2] Delfosse et al. Adaptive Rational Activations to Boost Deep Reinforcement Learning. ICLR 2024

[3] Srivastava et al. Highway Networks. 2015

[4] Dauphin et al. Language modeling with gated convolutional networks. PMLR 2017

评论- Response to authors

2024-11-26

Thanks for your response and for running some additional experiments.

However, I do not find my initial criticism to have been fully addressed:

In the caption of Figure 10, it reads that the Hadarmard representations (in contrast to the other results) are being applied at every layer of the network:

I would really appreciate if the authors would clarify why the divergent implementation from their previous applications.
Most importantly (please, correct me if I happen to be wrong in my understanding): doesn't this imply the number of parameters is doubled for each layer? This seems like a major limitation that is never discussed and that makes the PQN comparison unfair.

The new PQN results also lack clear details. The authors mention they do not evaluate Atari's hard exploration games, yet, I could not find what exactly these games are. Moreover, only the median results are reported over 5 seeds, without per-task results.
As the authors seemed to have agreed that the independence assumptions about neurons dying contradict the paper's own empirical results, I do not think adding a single sentence "Lastly, we make an independence assumption between two individual neurons." makes the paper's analysis any more valid.

Once again, I would like to state that I appreciate the research direction and its potential value. However, it seems to me that paper in its current form is still not sufficient, especially from an empirical perspective. Furthermore, much information and limitations seemed to not be clearly stated in the text (something which I found to be worsened in the current revision). For these reasons, I will not be modifying my score at this moment.

2024-11-28

Like in Gulcehre et al. 2022, we originally focused on the final hidden layer in all architectures. We argued that this was the most important representation, and by focusing on this layer we could do a more concise qualitative analysis like in Gulcehre et al. 2022. However, as per request of reviewer Vagz and 3nuW, we decided to additionally run tests using a Hadamard Representation in all hidden (CNN + MLP) layers. This is what we did for 51 Atari games using the PQN algorithm. In this evaluation, we see that using a Hadamard Representation in every layer significantly increased performance over the baseline and over CReLU.

We have now updated the manuscript to describe this better on page 6 Line 276: "In line with [Gulcehre et al. 2021], for our qualitative experiments, a Hadamard Representation is only applied to the last hidden layer $z_{t} \in \mathbb{R}^{512}$ ." And further on line 282: "Finally, the PQN experiments are conducted with an HR on all hidden layers, showing the Median Human-Normalized scores on 51 Atari games." Interestingly, changing from normal hyperbolic tangents to Hadamard hyperbolic tangents more than doubles the performance in PQN.

Thank you for pointing this out. Yes, using a full Hadamard representation nearly doubles the incoming (not outgoing) parameters of the hidden layer. To make this more clear, we have now added this prominently in our Limitations section in Line 518: "Using a Hadamard representation will double the incoming weights connected to a hidden layer. However, in Fig.10, a comparison against the CReLU is shown, which equally doubles the network's parameters. Furthermore, recent work shows that scaling baselines in Atari often leads to reduced performance Obando-Ceron et al. 2024a Obando-Ceron et al. 2024b."

To add to this, we notice relatively similar training times for Hadamard Representations as for normal representations. In contrast, the Rational activation baseline takes us around 30% longer to train due to the additional learnable component, while using fewer parameters than the Hadamard Representation. Furthermore, we have now compared PQN-Baseline with PQN-CReLU and PQN-Hadamard , where PQN - CReLU has 3,366,886 parameters, PQN-Hadamard has 3,376,070 parameters and PQN-Baseline has 1,688,550 parameters. PQN-Hadamard outperforms both!

The environments we are not using are: MontezumaRevenge-v5, Pitfall-v5, PrivateEye-v5, Skiing-v5 and Solaris-v5 and Pooyan-v5. These were acquired by simply taking the Full Atari Suite (57 games) minus the hard-exploration games. As you pointed out, we had forgotten to add the plots of the 51 individual games, as we have now done in Appendix D.2. The per-game analysis also shows convincing outperformance of Hadamard Representations to CReLU and the baseline. As per the seed comment here and in your original review: We understand that 5 seeds might be relatively low in general Deep Learning, but we are not different from most of the top-tier RL papers working in Atari. For instance: CReLU uses 5 seeds, Munchausen-DQN uses 3 seeds, Rational Activations uses 5 seeds, ReDo uses 5 seeds, InFeR uses 3 seeds. Does this alleviate your concerns?
We apologize, but we do not understand what you mean with a contradiction of the empirical results. Our proposed dying neuron probability differences using a Hadamard representation (Table 1) strongly correlate with the empirical results (Table 2). If possible, could you please explain more clearly how this analysis is invalid, so that we can use your feedback to improve our reasoning?

Due to concerns about limited evaluations in the initial feedback of multiple reviewers, we had added an additional evaluation of 51 Atari games. To summarize, we now have empirical results on (simple) Supervised Learning, DQN on 8 Atari games, PPO on 8 Atari games and PQN on 51 Atari games, where we significantly outperform CReLU. We would be happy to know whether this empirical evaluation still seems limited, or what would be needed to further convince you?

评论- Response to authors

2024-12-03

The methodology proposed in the original paper was to add Hadamard representations to a single layer. As compared to standard architectures (i.e, ReLU) this seemed to have some performance gains and promise in a small set of handpicked Atari tasks. As detailed in my review, the limited evaluation made me question the method's actual generality across other tasks. The newly added results are, instead, doing something quite different from a methodology perspective: doubling the number of parameters in each layer (something that was not even ever explicitly stated in the previous rebuttal revision). Now, there are so many potential confounding factors that have been added (e.g., faster learning due to more parameters) and downsides (doubling inference time and memory) that is quite difficult for me to draw any conclusive evaluation of the proposed methodology from these new results. I believe it would have been much more logical for the authors to collect/report the results for their much cheaper original architecture, which I hope would have still shown some improvements.
I am fully aware that there exist some other papers that also evaluate very few seeds (even though these papers, such as Munchausen RL, focus on less stochastic problem settings after much more expensive training runs of hundreds of millions of steps). I do not think that the fact this paper is not the only one with such a limitation invalidates my concerns.
Table 2 in your paper shows the actual fraction of dead neurons. The relationship seems to me far from quadratic.

2024-12-03

Thank you for your reply.

We are not doing something different from a methodology perspective. Instead of using our technique on 1/4 layers in DQN and PPO, we have now also tested it on 4/4 layers in PQN. Adding Hadamards to the CNN activation adds negligible parameters, as seen in the difference between row 1 and row 2 on the Table below (We can add this table to the paper, if you'd like). Furthermore, as we have stated before, CReLU also doubles the parameters but reduces performance in PQN compared to the baseline. We believe there is a too strong focus on the amount of parameters we use in our research, as we compare with equal-parameter baselines and have repeatedly stated that training speed is relatively unaffected.

Architecture	Convolutional Parameters	MLP Parameters	Output Layer Parameters	Total Parameters
Conv `tanh_HR` + MLP `tanh_HR`	155,968	3,214,336	3,078	3,373,382
Conv `ReLU` + MLP `tanh_HR`	77,984	3,214,336	3,078	3,295,398
Conv `CReLU` + MLP `CReLU`	77,984	3,211,776	3,078	3,292,838
Conv `ReLU` + MLP `ReLU` (Baseline)	77,984	1,607,168	3,078	1,688,230

Also, the claim of confounding factors due to more parameters is something we thought would be important to clarify for Reviewers, which is why our original manuscript contained Fig. 7b and appendix C.2, and we recently extended these ablations with different learning rates. This shows that simply scaling parameters only slightly works when you explicitly re-tune the learning rate $\alpha$ , but decreases performance when keeping the same learning rate. Note that we do not change any hyper-parameters for Hadamard Representations. We would appreciate it if the Reviewer could comment on this.

You are claiming that a lot of the top conference RL algorithms provide limited evaluations on Atari. Although this is up for discussion, we do not believe we should be judged for copying the evaluation methods used in these conferences. Furthermore, you claim that training for more frames in M-DQN reduces evaluation stochasticity. Could the reviewer explain this claim more clearly?
Thank you for pointing this out. This is why we use the term "correlate" in line 308 in our paper. A clear correlation with 1: reducing dying neurons for tanh 2: increasing dying neurons for ReLU and 3: equal dying neurons for Sigmoid, is, in our opinion, a fair empirical result correlating with our probability analysis. If you'd like, we could more prominently explain the difference between our probability analysis and the empirical results in our paper.

审稿意见

评分: 6置信度: 42024-11-04

The paper investigates the role of activation functions on performance in reinforcement learning. Continuously differentiable, such as Tanh, can have advantages, including bounded output and smooth gradients. Practically, however, ReLU activations are more preferred. The authors investigate the reason and identify that the number of saturating tanh activations for all inputs increases with experience throughout the training. This decreases the network’s representation power (measured by the effective rank). To resolve this problem, the authors introduce Hadamard representations (HR) with Tanh. The authors showed that HR with Tanh experiences less saturation and mitigates loss in effective rank, which also reflects on performance. Through a series of experiments, the paper demonstrates the effectiveness of HR with Tanh.

优点

This paper presents a novel approach to have more effective representations for reinforcement learning. Based on insights from previous works on dying/dormant neurons and their relationship to reduced network capacity, the authors present the Hadamard representations that experience less dormancy and thereby maintain the network representational capacity. I think this research is fundamental and important, and the proposed method is very simple to use and integrate into a wide range of methods, further enhancing its potential impact.

缺点

Some definition inaccuracies:
- The authors claim that the dying neuron problem is not exclusive to ReLU but also extends to Tanh and provide a definition where the activation output $|\alpha_i| \approx 1$ . This definition is problematic for two reasons: 1) Intuitively, the neuron is not ‘dead’ if its output is not zero, and 2) the definition doesn’t work for ReLU. Instead, I suggest making this definition about saturating neurons, which would only work for tanh and sigmoid. The figures measuring the fraction of dead units (e.g., Figure 6a) need to reflect the fraction of dead or saturated units instead.
- The authors mentioned that the definition of a dead/saturating neuron applies when we test the neuron in all data in the buffer. Is this quantity actually being measured in the experiments? I am asking this because going through all the data in the buffer at each evaluation time seems very computationally expensive, so I was wondering if some approximation is made in measuring this quantity in the experiments section (e.g., sampling a fixed small number of samples from the buffer instead).
Relatively limited empirical evaluation:
- The empirical evaluation could be more substantial. The authors used 8 Atari environments with only five independent runs (with some overlapping error bars). I encourage the authors to use a larger number of independent runs (e.g., 10) to have more statistically significant results. Adding more environments would be great, although I recognize the hurdle of running extra Atari games.
- More relevant activation function baselines are needed. Specifically, concatenated ReLU has been shown to improve plasticity in reinforcement learning [1] in addition to adaptive rational activations [2]. I think adding these two baselines is very relevant to understanding the effectiveness of HR Tanh against other recently investigated activation functions.

Minor issues:

The discussion in lines 207-212 needs revising.
In line 103, $r_t$ should be $r_{t+1}$ .
In line 205, $h^\prime(x)$ should be $z^\prime(x)$ .
“Stronly” in line 357 is a typo

Overall, I would like this paper to be accepted. There are a few fixable issues, so I’m willing to increase my score if the authors address those issues based on my feedback in a paper revision.

References:

[1]. Delfosse, Q., Schramowski, P., Mundt, M., Molina, A., & Kersting, K. (2024). Adaptive Rational Activations to Boost Deep Reinforcement Learning. International Conference on Learning Representations.

[2]. Abbas, Z., Zhao, R., Modayil, J., White, A., & Machado, M. C. (2023). Loss of plasticity in continual deep reinforcement learning. Conference on Lifelong Learning Agents (pp. 620-636).

问题

The authors mentioned in line 367 that they considered addition operation instead of Hadamard product and showed it performs poorly. There was no reason or postulate for why this is the case. Can the author provide a probability analysis similar to the one given on page 5?
If the score is normalized with respect to the ReLU activation baseline score, why doesn’t the final ReLU baseline reach 1 at the end of training in Figure 7a and Figure 9a but slightly less than 1?
Why did the authors use HR Tanh only for the last layer? What happens if you use it in all layers? The authors did mention that the success might not translate to convolutional layers. Can the authors present an experiment showing that and suggest what the reason might be? Additionally, why wasn't HR Tanh applied for the last two fully connected layers in the architecture?
In line 13, “linear-unit functions” should be “piece-wise linear functions”
I don’t see the reason for the paragraph starting at line 467. It is not relevant to the conclusion of the paper, and hence it should be removed.
In some environments, such as Seaquest and SpaceInvaders, HR hurts performance compared to ReLU. Is there a reason for why?

评论- Response to Reviewer Vagz

2024-11-21

We thank reviewer Vagz for the review, and the positive note at the end. We will reply to all the weaknesses and questions the reviewer has noted.

"The authors claim that the dying neuron problem is not exclusive to ReLU but also extends to Tanh and provide a definition where the activation output . - - (Omitted for response length)"

You are right about the definition not working for ReLU. We have changed the definition and the explanation for it in line 137. However, what do you mean that intuitively, the neuron is not 'dead' if its output is not zero? We think of dying neurons as having no variation according to the input observation, meaning they are basically collapsed. We feel like a dead ReLU is a saturated ReLU, just like a dead Tanh is a saturated Tanh (Only into 1 end. so always only 1 or only -1). Please let us know if you have any other thoughts on this.

"The authors mentioned that the definition of a dead/saturating neuron applies when we test the neuron in all data in the buffer. Is this quantity actually being measured in the experiments? - “” “”."

Thanks for the observation, we indeed forgot to mention our approximation to this. We sample a batch from the buffer to calculate the dead neurons. We have added this explanation in Line 138 in the revised Manuscript.

"The empirical evaluation could be more substantial. - "" """

As this was an overlapping theme across reviewers, we have added experiments using a parallelizable version of DQN called PQN [1] on the (nearly) full Atari suite of 51 games. We report the median Human-normalized scores on page 9 of the new manuscript.

"More relevant activation function baselines are needed. - "" """

We have run additional experiments on our 8-environment DQN baseline, where we compared against the recent novel Rational [2] activation over 40M frames. We have updated figure 7a and Fig. 19 accordingly. As for the concatenated ReLU [3], this paper itself does not report any performance improvement over the non environment-resetting baseline (See Fig. 6 of [3]), which is why we have added the paper to related work but not compared against it. We hope this clarifies things.

Questions

1. "The authors mentioned in line 367 that they considered addition operation instead of Hadamard product and showed it performs poorly. There was no reason or postulate for why this is the case. Can the author provide a probability analysis similar to the one given on page 5?"

We believe that multiplicative representations also have different function approximation properties as compared to additions. As for the dying neuron analysis, this should give the same results as for a product of hyperbolic tangents, i.e: The final representation is only dead if both neurons are dead. However, due to poor performance of these additive representations, we have not looked further into depth on their exact functioning.

2. "*If the score is normalized with respect to the ReLU activation baseline score, why doesn’t the final ReLU baseline reach 1 at the end of training in Figure 7a and Figure 9a but slightly less than 1?"

If you look closely, the top of the uncertainty of the ReLU curve should hit 1. We normalize with respect to the highest and lowest score achieved by the ReLU per environment.

3. "* Why did the authors use HR Tanh only for the last layer?*"

Preliminary results showed a strong preference for CNN's towards the ReLU. However, for more clarity, we are now running additional experiments showing tanh and hadamard representations on the CNN's aswell. We hope to add them to the Appendix at the beginning of next week. As to the second part of your question, the architecture only has 1 hidden linear layer. Or do you mean the Q-values?

4. - Thanks for noticing, we changed this throughout the manuscript.

5. - Agreed. We have removed this paragraph and slightly rewritten this section.

6. "In some environments, such as Seaquest and SpaceInvaders, HR hurts performance compared to ReLU. Is there a reason for why"

Possibly due to a low-frequency/simplicity bias for these environments, but we still see substantial improvement over the tanh. However, we are hesitant to draw causal conclusions for performance differences per environment.

Note that due to response length limitations, we have visually shortened some of your questions.

We hope to have alleviated the reviewer's concerns for the paper. In light of our response and the revised manuscript, we kindly ask the reviewer to reconsider the given score.

[1] Gallici et al. Simplifying Deep Temporal Difference Learning. Arxiv 2024

[2] Delfosse et al. Adaptive Rational Activations to Boost Deep Reinforcement Learning. ICLR 2024

[3] Abbas et al. Loss of Plasticiy in Continual Deep Reinforcement Learning. CoLLAs 2023

评论- Thank you for your detailed response!

2024-11-25

I would like to thank the authors for their detailed response. Here is my response:

Weaknesses:

My issue is that the authors consider saturated and dead units the same while the former may give large output but the slope is zero (e.g., 1) and the latter gives zero output and the slope is zero. They are not the same. I suggest the authors say something like “can be classified as saturated or dead if..”. Thus, we cannot say a saturated ReLU.
Thank you for explaining this and adding the information in the paper.
Adding results on more Atari environments is highly appreciated. This increases my confidence in your approach.
Thank you for comparing against rational activation functions. I think comparing against concatenated ReLU needs to be added as well since it shares similarity with your Hadamard Tanh. Concatenated ReLU has been shown to improve plasticity which reflects on performance in reinforcement learning agents. Please refer to the rational activation paper for comparisons in the single RL tasks domain. The authors can compare against Concatenated ReLU and/or PeLU to show the effectiveness of their Hadamard Tanh.

Questions:

I thank the authors for running the additional experiment. Please let me know when the results are ready.

评论- Second response to Reviewer Vagz

2024-11-25

We thank the reviewer for the quick response.

1. As this is only a minor change to the paper, we have no problem with the recommended phrasing and have changed it accordingly in line 133.

4. We are now running tests comparing against Concatenated ReLU on the 51-game atari domain for PQN. We will update the new Fig. 10 tomorrow with the Crelu added.

Furthermore, as reviewer 3nuW had also recommended, we have run additional evaluations in PQN using Tanh Hadamard on all the hidden layers in the network. Interestingly, we have noticed a significant increase over the baseline PQN in the Median Human-Normalized scores over 51 Atari environments. The updated figure can be seen in Fig.10 on page 9. (Note that, as said before, we will add the Crelu scores tomorrow).

Finally, we have now added a figure in Appendix C.5 showing what happens when, in DQN on 8 atari environments, the baseline ReLU is combined with normal hyperbolic tangent CNN activations.

评论- Update to experiments

2024-11-28

We have now updated the PQN experiments to include the CReLU. Over 51 Atari games, our plots show significant improvement over both the baseline and the CReLU. We have also added the individual game scores in Appendidx D.2. Interestingly, switching from hyperbolic tangents to Hadamard hyperbolic tangents more than doubles the score in PQN.

2024-11-29

Thank you for your efforts to address my concerns. I raised my score accordingly.

审稿意见

评分: 5置信度: 42024-11-04

The paper proposes a novel activation function based on Hadamard representations to address the limitations of traditional activation functions in reinforcement learning. It highlights issues like vanishing gradients and dying neurons, demonstrating that the proposed method improves learning efficiency and representation capacity in deep Q-networks and proximal policy optimization.

优点

The proposed Hadamard representation effectively addresses dead neuron issue with existing activation functions, particularly in reinforcement learning contexts.
The experimental results show clear improvements in learning speed and performance metrics, providing solid empirical evidence for the claims made.
The paper presents a well-defined problem and clearly articulates its contributions, making the paper easy to understand.

缺点

The lack of a diverse range of network architectures tested limits the generalizability of the findings. The paper should test HR on various networks and test HR on different layers. It would be better if HR were a universal activation function suitable for all layers, rather than only suitable for the last layer.
Absence of detailed comparisons with other novel activation functions that have emerged recently.
The paper did not conduct tests on diverse tasks, e.g. MuJoCo, DMControl. using only Atari does not demonstrate that this is a universal issue in reinforcement learning.
As mentioned in Line 279, the experiments are all performed on 8 Atari games. Testing on only 8 Atari games does not sufficiently demonstrate a stable improvement in performance.
In the PREVENTING DYING NEURONS section in Line 191, the analysis is not related to reinforcement learning task at all. So, claiming that HR is an activation function particularly suitable for reinforcement learning feels somewhat forced.

问题

How do Hadamard representations compare with other recent activation functions not covered in the paper?
What potential challenges might arise when applying this method to different neural network architectures?
What is the result if HR is tested on various tasks, e.g. MuJoCo, DMControl, and more tasks on Atari?

评论- Response to Reviewer 3nuW

2024-11-21

We thank reviewer 3nuW for the review of our paper.

“The lack of a diverse range of network architectures tested limits the generalizability of the findings. The paper should test HR on various networks and test HR on different layers. It would be better if HR were a universal activation function suitable for all layers, rather than only suitable for the last layer.”

In the Limitations section on line 533, we have explained that preliminary experiments showed a very strong preference for ReLU activations in the convolutional section of our network. As the networks used for most Pixel-based environments only have 1 hidden linear layer, this is the reason for our choice of this final hidden layer. Furthermore, we believe that this is the most important hidden layer as the Q-values or policies are usually linearly extracted from here. However, we agree with the reviewer that it would make sense to test the Hadamard representation on more layers. We are running additional experiments using a Hadamard representation on our convolutional hidden layers as well. We hope to add this to our Appendix at the beginning of next week.

“Absence of detailed comparisons with other novel activation functions that have emerged recently.”

We have conducted experiments on our original 8 Atari games comparing the Hadamard representation to the novel Rational [1] activation function. We have updated Figure 7a in the main paper comparing against this activation. Furthermore, we have updated figure 19 in Appendix D.2 to show the individual runs of the Rational activation.

“The paper did not conduct tests on diverse tasks, e.g. MuJoCo, DMControl. using only Atari does not demonstrate that this is a universal issue in reinforcement learning.”

Preliminary experiments have tested our hypothesis on a pixel maze environment, after which we experimented in the Atari domain. We respectfully feel that state-based continuous environments that do not support DQN are out of the scope of our paper. However, to make our evaluation more complete, we have done additional experiments using PQN [2] on the (nearly) full Atari suite of 51 games. The median Human-Normalized scores can be found in our updated manuscript in Figure 10.

"As mentioned in Line 279, the experiments are all performed on 8 Atari games. Testing on only 8 Atari games does not sufficiently demonstrate a stable improvement in performance."

See our response to 3.

"In the PREVENTING DYING NEURONS section in Line 191, the analysis is not related to reinforcement learning task at all. So, claiming that HR is an activation function particularly suitable for reinforcement learning feels somewhat forced."

We agree that this might not have been evident from the old manuscript. We now have added a section in our introduction which explicitly states that the dying neuron phenomenon is stronger in RL as compared to supervised learning. We leave further tests in supervised learning for future work. However, due to reviewer HMtR, we have noticed that modern LLM's also use (half-linear) multiplicative representations. We have cited to the paper using these architectures [3] in the algorithm section.

Questions 1. See our answer to Weakness 2.

2. In our opinion, the biggest challenges that arise might be finding the perfect combination of activation functions to use in a Hadamard representation. As seen in related work in LLM's [4], there are a lot of possible combinations. Furthermore, as our paper shows, the ReLU does not gain the same benefits as the hyperbolic tangent. As most algorithms use ReLU's in their network architecture, changing all these ReLU's to Hadamard hyperbolic tangents fundamentally changes the representation layers and might require different hyperparameter settings (learning rate, network width).

3. See our answer to Weakness 3.

We hope to have adressed most of the reviewer's concerns. We kindly ask the reviewer to review the updated manuscript in light of our revisions, and reconsider the assigned score. If there are any more questions, please do not hestitate to send another message.

[1] Delfosse et al. Adaptive Rational Activations to Boost Deep Reinforcement Learning. ICLR 2024

[2] Gallici et al. Simplifying Deep Temporal Difference Learning. Arxiv 2024

[3] Dauphin et al. Language modeling with gated convolutional networks. PMLR 2017

[4] Noam Shazeer. GLU Variants Improve Transformer. 2020

评论- Final comment to reviewer 3nuW

2024-11-28

Dear reviewer 3nuW,

Following your (and other reviewers) comments, we want to remind you that we have added the Rational activation to our DQN experiments, and the CReLU to our 51 Atari Game PQN experiments. In PQN, we show significant improvement over the CReLU, as well as over the just released baseline. Interestingly, switching from normal hyperbolic tangents to Hadamard hyperbolic tangents more than doubles the performance in PQN!

There is also a section on page 9 explaining the difference between the effects of dying ReLUs and dying tanhs on the next layer in a network, providing insights into the poor performance of hyperbolic tangents.

Next to this, a lot of minor changes have been made and highlighted in Red. We would kindly ask you to review the paper under the current changes, and reconsider the assigned score.

审稿意见

评分: 5置信度: 42024-11-05

This paper studies the dying neuron problem with tanh-type activations. The paper's first result is that bounded activation functions like tanh perform worse than ReLU-type activations. It is argued that the poor performance is probably due to the vanishing-gradient/dying-neuron problem faced by the tanh-type activations. The paper then proposes to solve the dying-neuron problem using Hadamard representations. Finally, it is shown that Hadamard representations outperform the standard DQN architecture.

优点

The paper is generally well-written. The proposed solution is simple to use and novel. The usefulness of the proposed method is shown to be more effective than standard DQN.

缺点

The biggest issue with the paper is the weak empirical evaluation—particularly the following two points.

The number of hidden units in Hadamard Representations is twice that of the base system. However, the comparison with the base system with 1024 hidden units is not done correctly. The base system with 1024 hidden units is not tuned. At least the learning rate of the base system should be tuned. I suspect that something like 5e-5 could be a good value for the learning rate for the base system with twice the number of hidden units. Using default hyper-parameters is insufficient when the architecture is changed by making it twice as wide. The paper has to convincingly show that Hadamard representations provide a benefit over a well-tuned base system with twice as many hidden units.
Missing baseline. The paper does not compare Hadamard representations to resetting baseline like Redo [1] and continual backprop[2]. Why not just reset the units that have saturated in the tanh network?

[1] Sokar et al. The dormant neuron phenomenon in deep reinforcement learning. ICML 2023 [2] Dohare et al. Loss of plasticity in deep continual learning. Nature 2024

问题

See Weaknesses

评论- Response to Reviewer S1DP

2024-11-21

We thank reviewer S1DP for the review of our paper. We will reply to both Weakness proposed by the reviewer.

“The paper has to convincingly show that Hadamard representations provide a benefit over a well-tuned base system with twice as many hidden units.”

To accommodate for the reviewers’ valid concerns, we have tested the 1024 latent dimensional setup for three different learning rates, and added these to appendix C.2 in our revised manuscript. Over three different learning rates (1e-5, 5e-5, 1e-4) with a 1024 dimensional latent state, it shows that performance increases when decreasing the learning rate, but eventually becomes worse again. Our Hadamard representation still proves to be stronger.

“Missing baseline. The paper does not compare Hadamard representations to resetting baseline like Redo [1] and continual backprop[2]. Why not just reset the units that have saturated in the tanh network?”

We understand the reviewers concern of missing baselines. However, a deeper dive into Redo [1, Fig. 10] shows that Redo does not really improve DQN performance in the normal setting, but only when you change the algorithm by increasing the gradient steps per frame from 0.25 to 1. Further work in [3, Fig. 12] also shows that Redo (resetting) can even harm performance. For this reason, although Redo remains interesting related work, we did not implement their algorithm as a baseline but use their work for fundamental insights.

As for [2], we thank the author for this addition which we will add to our related work section. However, as this paper was published 7 days before the ICLR 2025 submission deadline, we were unfortunately not aware of this paper and did not compare with it. To accommodate the reviewer and compare against a suiting baseline, we have run experiments comparing our Hadamard Representation to a novel ICLR 2024 learnable Rational Activation [4] and have added this comparison in our main paper.

We kindly request that you review the updated manuscript in light of these clarifications and revisions and reconsider your score.

[1] Sokar et al. The Dormant Neuron Phenomenon in Deep Reinforcement Learning. ICML 2023

[2] Dohare et al. Loss of plasticity in deep continual learning. Nature 2024

[3] Delfosse et al. Adaptive Rational Activations to Boost Deep Reinforcement Learning. ICLR 2024

2024-11-27

Thank you for your response. I appreciate the work you've put in the rebuttal.

As you find a different learning rate performs better, please use that in Figure 7 and anywhere else you're showing results for "Latent Dim 1024". I've increased my score based on this new result.

The argument that "Redo does not really improve DQN performance in the normal setting" is not convincing. The reason Redo doesn't help could be because there are not too many dormant units in the normal setting. But in your case, you see an improvement in performance when the number of dead tanh units is reduced. I suspect a resetting baseline like ReDO will improve performance in this case.

In Figure 6, when do you call a tanh or sigmoid unit to be dead?

2024-11-28

Thanks you for your response.

We have now changed Fig. 7b in the main text to incorporate the 1024-dimensional latent with the lower (5e-5) learning rate.
We see your point. Although resetting values for tanh's might have a more profound instant effect, as the difference between initialization value and saturation value is generally larger than with a ReLU. In the short-term, we are willing to add some additional experiments to the paper incorporating this. For now however, we have added CReLU as a baseline for the new 51 Atari Games PQN experiments on Page 8. (We couldn't find a jax-based implementation of Rational activations in order to add it to the PQN experiments).
In Fig. 6, we define a tanh or sigmoid neuron as dead (or collapsed, saturated), when the neuron's KDE exhibits a spike larger than 20, as can be seen in Fig. 3b. We further explain this in Appendix B.1.

Based on all the reviewers comments, we have now made quite some changes since the original manuscript, and would kindly ask you to review them. To summarize, the biggest changes are:

Added Rational Activations to the DQN experiments.
Added PQN experiments on 51 games, testing Hadamard representations on all layers and significantly outperforming CReLU. Interestingly, moving from a hyperbolic tangent to a Hadamard hyperbolic tangent in PQN more than doubles performance.
Added a large section on page 9, stating the effects of dead ReLU's and dead tanh's on the subsequent layer, as an insight into the activation's performance correlation with dying neurons.

2024-11-29

Dear Reviewer S1DP,

As the review period is coming to an end, we want to summarize the changes we have made:

Added the Rational activation to our DQN experiments on 8 Atari Games.
Extended our evaluation by adding PQN, and testing on 51 Atari Games for 5 seeds (Median Human-Normalized scores are available in Fig. 10, Individual Game scores can be seen in Appendix D.2)
Compared against the CReLU activation on the 51 Atari Games.
Added a theorem on page 9 to show the difference of effects on the next layer between dying hyperbolic tangents and dying ReLUs.
Minor changes throughout the paper based on all the reviewers' suggestions. Changes can be seen in Red.

We would kindly ask you to review the final manuscript based on the changes made.

2024-11-30

Thank you for your response. I appreciate the changes made in the manuscript, they certainly make the paper stronger.

However, a missing comparison with a selective reinitialization baseline is a critical flaw. This paper shows that the dying neuron problem exists in RL even with tanh activation and proposes a solution. However, there is prior work that has shown the dying neuron problem in RL (Sokar et al.), and they've proposed a solution based on selective reinitialization. For the paper to be accepted, it must include a comparison with the previously proposed solution.

2024-12-02

Thank you for appreciating our changes. Please consider that the Hadamard Representation (HR) also has function approximation benefits (Fig. 2), to which we credit a part of the performance increase. We therefore understand the concern, but believe we are already making a substantial contribution to the field without the specific ReDo baseline. Coming back your point on using ReDo for tanh: We think that ReDo and the HR might even be mutually beneficial in the case of tanh, as (Fig.6) shows that using HR does not eliminate all dying hyperbolic tangents. As said before, we want to run the additional ReDo experiments in the short term for the camera-ready version.

评论- Message to all Reviewers.

2024-11-21

We thank all reviewers for their reviews. Based on your reviews, we have run additional experiments and made significant changes (highlighted in red) to our manuscript. The most important changes are summarized here:

To adress concerns regarding comparison with a valid baseline, we have added a comparison with the ICLR 2024 Rational [1] Activation function in our original 8-environment DQN setting. This activation function has been added to the main plotting results in the Experiments section, as well as the per-game score in Appendix D.2.

To address multiple reviewers’ concerns about the lack of Atari environments, we have run the (nearly) full Atari suite of 51 games for 5 seeds on a new parallelizable version of DQN called “PQN” [2]. Here, we compare with ReLU, tanh, and the tanh Hadamard. As per standard practice, we have added the median-human normalized scores to our experiment section, showing consistent results across the nearly full Atari suite.

Finally, as reviewer HMtR pointed out, there seems to be a slight gap in understanding why the ReLU might be less affected by performance issues correlating with dead neurons, as opposed to the hyperbolic tangent. We have added an additional section in the main paper on page 9 to give insights into this problem. Specifically, we show that dead hyperbolic tangents transform weights into biases, while dead ReLU’s resemble network pruning.

We will additionally reply to each reviewer individually.

[1] Gallici et al. Simplifying Deep Temporal Difference Learning. Arxiv 2024

[2] Delfosse et al. Adaptive Rational Activations to Boost Deep Reinforcement Learning. ICLR 2024

评论- Addition to the general message to Reviewers:

2024-11-25

As per comment of Reviewers 3nuW and Vagz, we have run additional experiments on the 51 Atari suite for PQN, using both full Hadamard representations (CNN + MLP) as well as only MLP Hadamard representations. We have updated Fig.10 on page 9 with the Median Human-Normalized scores. We show a significant improvement over the novel PQN baseline using the full Hadamard representation over 51 games.

In addition, also as per reviewer requests, we are running the Crelu baseline on the 51 Atari environments, and plan on adding it to the PQN results in Fig.10 tomorrow before the deadline!

评论- Final message to all reviewers (Nov 28)

2024-11-28

Dear reviewers,

Based on the reviews, we have done a significant amount of additional experiments since the first manuscript. We have compared against the Rational activation in DQN on 8 Atari games, and compare against the CReLU in PQN on 51 Atari games. In PQN, we significantly outperform CReLU, which, like the Hadamard, has nearly double the parameters of the baseline. Interestingly, in PQN, switching from normal hyperbolic tangents to Hadamard hyperbolic tangents more than doubles the performance, without any hyperparameter changes. Individual games for PQN can now also be seen in Appendix D.2.

Furthermore, we have added a large section on Page 9 explaining the difference in the effects on the next layer for dying ReLUs and dying tanhs. Specifically, we show that dying tanhs essentially turn weights into biases.

Based on each reviewer, we have also made minor changes throughout the text. We have highlighted all changes in Red.

We hope to have clarified the reviewers' concerns, specifically in the empirical evaluation of the Hadamard representation.

AC 元评审

2024-12-28

In this paper, the authors proposes a novel activation function based on Hadamard representations (HR) to address limitations of traditional activation functions in reinforcement learning, such as vanishing gradients and dying neurons. The authors demonstrate improved learning efficiency and representation capacity using HR in Deep Q-Networks (DQN) and Proximal Policy Optimization (PPO).

The major concerns raised by the reviewers for the first draft lies in

1, Limited evaluation and comparison: the empirical study in the first draft lies in limited Atari games, and mising the other activation functions.

2, Weak justification in RL: the analysis in the "Preventing Dying Neurons" section is not related to RL tasks, making the claim that HR is particularly suitable for RL feel forced.

In the rebuttal period, the authors completed more Atari games comparison, partially addressed the first weakness. However, it is not clear whether this is universal applicable for other environments, without experiemnts on MuJoCo and DMC suite. Meanwhile, it is not clear whether the "Dying Neurons" is only for RL and supervised learning. I suggest the authors further investigate the issue and improve the draft.

审稿人讨论附加意见

In the rebuttal period, the authors completed more Atari games comparison, partially addressed the limited evaluation and comparison issue. The draft has been largely improved.

最终决定Reject

2025-01-22

Reject