PaperHub
6.5
/10
Poster4 位审稿人
最低6最高8标准差0.9
6
6
8
6
3.0
置信度
ICLR 2024

DEEP NEURAL NETWORK INITIALIZATION WITH SPARSITY INDUCING ACTIVATIONS

OpenReviewPDF
提交: 2023-09-22更新: 2024-03-13
TL;DR

Sparsity inducing nonlinear activations are designed, analysed to be stable for training, and shown to be effective with hidden layer sparsity as high as 85%.

摘要

Inducing and leveraging sparse activations during training and inference is a promising avenue for improving the computational efficiency of deep networks, which is increasingly important as network sizes continue to grow and their application becomes more widespread. Here we use the large width Gaussian process limit to analyze the behaviour, at random initialization, of nonlinear activations that induce sparsity in the hidden outputs. A previously unreported form of training instability is proven for arguably two of the most natural candidates for hidden layer sparsification; those being a shifted ReLU ($\phi(x)=\max(0, x-\tau)$ for $\tau\ge 0$) and soft thresholding ($\phi(x)=0$ for $|x|\le\tau$ and $x-\text{sign}(x)\tau$ for $|x|>\tau$). We show that this instability is overcome by clipping the nonlinear activation magnitude, at a level prescribed by the shape of the associated Gaussian process variance map. Numerical experiments verify the theory and show that the proposed magnitude clipped sparsifying activations can be trained with training and test fractional sparsity as high as 85% while retaining close to full accuracy.
关键词
Deep neural networkrandom initialisationsparsitygaussian process

评审与讨论

审稿意见
6

The paper studies how very deep neural networks, including densely connected and convolutional networks, behave at initialization when using sparsity inducing activation functions. The two natural sparsity-inducing functions studied in the paper are the shifted ReLU activation, which is just a ReLU with a fixed bias, and the soft thresholding function, an activation that evaluates to zero in some fixed interval. The main result shows that these activations make the initialization unstable for very deep networks. This instability can be fixed by using a clipped version of these activation functions. The authors show some experiments, demonstrating that deep networks can be trained with a clipped version of the above activation functions, with minor drop in accuracy.

优点

Using sparse networks, with sparse activations and/or sparse weights, is an important field of study, and it seems that the paper gives some contributions on how the activation function of the network should be adapted to support training networks with sparse activations. This work can potentially serve as a basis for future works on building sparse neural networks. The theoretical analysis of sparse activation functions, their problematic behavior at initialization of deep networks and the solution of clipping the weights is to my understanding novel.

缺点

The main weakness I find in the paper is that while the motivation for the paper comes from a practical perspective, namely building neural networks with sparse activations that can be implemented more efficiently in practice, it seems that the applicability of the results is not clear. To my understanding, the results only apply for very deep neural networks (the experiments use 100-layer networks). The authors should clarify whether or not their results apply to networks of reasonable depth. Specifically, it would be good to show some experiment for networks of reasonable depth and show how the activation choice affects the behavior. It seems that in this setting depth is only hurting performance, so while it is in theory interesting to analyze how such deep networks should be trained, it seems that the applicability of this method is limited.

The authors should also discuss the effect of adding residual connections on the stability of the network training. As residual-networks has been the main solution for training very deep networks, the authors should clarify whether their results also apply for residual networks.

Additionally, it seems that some of the experiments were left out of the main paper, and only appear in the appendix (for example, studying the CST activation and comparing the clipped activations with non-clipped ones). These are key experiments in the paper and should appear in the main text.

问题

See above.

==================

Score is updated after author response.

评论

We are very grateful to the reviewer for their careful reading of our paper, thoughtful feedback, and constuctive suggestions. Below we respond to the concerns and/or questions raised.


Comment:

The main weakness I find in the paper is that while the motivation for the paper comes from a practical perspective, namely building neural networks with sparse activations that can be implemented more efficiently in practice, it seems that the applicability of the results is not clear. To my understanding, the results only apply for very deep neural networks (the experiments use 100-layer networks). The authors should clarify whether or not their results apply to networks of reasonable depth. Specifically, it would be good to show some experiment for networks of reasonable depth and show how the activation choice affects the behavior. It seems that in this setting depth is only hurting performance, so while it is in theory interesting to analyze how such deep networks should be trained, it seems that the applicability of this method is limited.

Response:

While we agree that our paper provides primarily a theoretical contribution -- insofar as the experiments presented are on relatively simple tasks, and were primarily designed to verify the theory -- we believe that the contributions are important and of interest to the community nonetheless. The key goal of EoC initialisations is to preserve signal propagation and avoid vanishing and exploding gradients, and the best proof of the ability to do this is to show that we can train networks with very many layers. This is what motivated our choice of experiments. We should also note that our work is not unique in this regard, and similar experimental setups are relatively common in experiments which develop EoC theory.

Having said that, we think that your comment is fair and important, and that it is worthwhile to test how the comparisons play out in practice when the networks are not quite so deep. To explore this, we have repeated the DNN experiments from the paper, but with depth=30 instead of depth=100. The results are shown in the table below, and are now included in Table 4 and 5, in Appendix H.

评论

Experimental results with 30 layers DNNs:

mean accuracyaccuracy stdmean sparsitysparsity std
ssτ\tau
ReLUτ_\tau0.700.520.930.010.670.01
0.800.840.600.460.470.42
0.851.040.270.370.180.38
0.901.280.100.000.020.00
STτ_\tau0.701.040.270.380.140.31
0.801.280.100.000.000.00
0.851.440.100.000.010.00
0.901.640.270.370.190.38
mean accuracyaccuracy stdmean sparsitysparsity std
ssτ\taummV(q)V'(q^*)V(q)V''(q^*)
CReLUτ_\tau0.700.521.050.5-0.370.910.010.700.01
1.450.7-0.310.920.000.700.01
2.050.9-0.040.930.010.690.01
0.800.840.890.5-0.240.900.020.800.01
1.270.7-0.120.920.010.800.01
1.850.90.210.920.010.780.02
0.851.040.810.5-0.140.900.010.850.01
1.170.70.020.910.010.840.01
1.740.90.410.920.010.840.01
0.901.280.720.50.000.850.060.900.01
1.060.70.230.900.030.890.01
1.610.90.690.910.010.740.02
CSTτ_\tau0.701.040.810.5-0.140.910.010.700.00
1.170.70.020.920.010.690.01
1.740.90.410.930.010.660.02
0.801.280.720.50.000.900.010.800.01
1.060.70.230.910.010.790.01
1.610.90.690.920.010.490.16
0.851.440.670.50.110.880.020.840.01
1.000.70.390.910.010.840.01
1.530.90.890.270.370.440.21
0.901.640.620.50.280.770.170.900.01
0.930.70.630.910.010.890.01
1.440.91.200.110.000.250.01

We can see from these new results that the key conclusions from the 100 layer experiments hold true with 30 layers too. While ReLUτReLU_\tau and STτST_\tau networks fail to train consistently once sparsity reaches 80% and 70% respectively, CReLUτ,mCReLU_{\tau, m} and CSTτ,mCST_{\tau, m} networks can train consistently to high accuracy with activation sparsity of 90%.


Comment:

"The authors should also discuss the effect of adding residual connections on the stability of the network training. As residual-networks has been the main solution for training very deep networks, the authors should clarify whether their results also apply for residual networks."

Response:

In this paper we develop present and analyse EOC theory for sparsifying activation functions for feedforward and convolutional networks only. It unfortunately does not directly apply to residual networks, which we agree were of course a very important architectural development for stably training deep nets. However, ResNets would need their own EOC theory worked out for these activation functions. This is an important and promising avenue for future work, and we have noted this as such in the conclusion. However, we feel that this falls beyond the scope of the present paper, and that the insights and contributions of the paper in its current form merit publication in their own right.

评论

Comment:

"Additionally, it seems that some of the experiments were left out of the main paper, and only appear in the appendix (for example, studying the CST activation and comparing the clipped activations with non-clipped ones). These are key experiments in the paper and should appear in the main text."

Response:

Thank you for this suggestion. Putting ReLUτReLU_\tau, STτST_\tau, and CSTτCST_\tau experimental results in the Appendix was originally done purely due to space constraints, but we acknowledge and agree that it is important for all key experimental results to appear in the main paper. In order to make space, we simplified and combined Figures 2 and 3, and so we have now included all the important experimental results in Table 2.

评论

Thank you for your response.

These experiments are indeed more convincing, showing the potential applicability of the suggested sparsity inducing activation function. Therefore, I have raised my score.

About the comparison to ResNets: I understand that presenting EOC analysis for ResNets is beyond the scope of the paper, but I believe that the authors should add experiments comparing networks with and without residual connections, studying how the modification of the activation function interacts with the residual connections. It is interesting to understand whether the stability issue of sparse networks can be solved by adding residual connections.

审稿意见
6

The authors study sparsity inducing non-linear activation functions in neural network training. They provide a theoretical analysis of the instability effects for using shifted ReLU and SoftThreshold when training FeedForward networks and CNNs, motivating their clipped versions of these functions. The authors use the previously introduced EoC initialization and show why training is stable, and becomes unstable with the introduction of a modified ReLU activation function that should induce sparsity. To remedy the instability in training a further modification (clipping) is introduced and proven to be theoretically stable. They then demonstrate the feasibility of their CReLU and CST activation functions for training deep and wide networks on MNIST and CIFAR10, showing only minor degradation in prediction accuracy with tunable levels of activation sparsity.

Overall, the paper is nicely written and relatively easy to follow, despite the math-heavy theoretical section. The argumentation on instability of the shifter ReLU and SoftThreshold seems valid, and the experiments, though not extensive, provide a proof-of-concept of the author’s claim. The idea of clipping the activation functions to achieve stability is as simple as it is effective, providing a valid contribution that can be actually translated into application. However, a bit more in-depth evaluation of certain aspects would be nice. The findings are interesting, though presentation is lacking, as well as more exploration of the introduced concept.

优点

  • The proposed clipped functions are an easy and natural extension of already existing activation functions.
  • The introduced parameters tau and m are able to control sparsity and training stability as shown theoretically and experimentally.

缺点

  • The stddev in table 2 are a bit odd to me, either a higher numerical precision is needed, or it needs to be explained why the are many cases of zero std.
  • Results on ST are only shown in the appendix. They should be shown and discussed in the main paper, given that they are an essential part in the rest of the manuscript.
  • the higher sparsity regime in table 2 (s=0.85) needs to be explored/explained more, there are interesting things happening.
  • There is no comparison to existing methods, but the authors clearly describe the lack of related research.
  • No source Code provided, implementation details on experiments only in the appendix

问题

  • I don’t fully understand the meaning of Figure 4 and it is not sufficiently discussed in the manuscript.
  • The important EoC concept is not motived and explained enough.
评论

We are very grateful to the reviewer for their careful reading of our paper, thoughtful feedback, and constuctive suggestions. Below we respond to the concerns and/or questions raised.


Comment:

"The stddev in table 2 are a bit odd to me, either a higher numerical precision is needed, or it needs to be explained why the are many cases of zero std."

Response:

Indeed most of the 0.00 standard deviations were simply because the results in the table were rounded to 2 decimal places. We have amended Table 2 to include higher precision to make the results clearer. 0.00 standard deviation was only accurate to higher precision as well in the cases when all of the seeds completely fail to train at all. Thank you for this suggestion.


Comment:

"Results on ST are only shown in the appendix. They should be shown and discussed in the main paper, given that they are an essential part in the rest of the manuscript."

Response:

Thank you for this suggestion. Putting ReLUτReLU_\tau, STτST_\tau, and CSTτ,mCST_{\tau, m} experimental results in the Appendix was originally done purely due to space constraints, but we acknowledge and agree that it is important for all key experimental results to appear in the main paper. In order to make space, we simplified and combined Figures 2 and 3, and so we have now included all the important experimental results in Table 2.


Comment:

"The higher sparsity regime in table 2 (s=0.85) needs to be explored/explained more, there are interesting things happening."

Response:

We agree that the high sparsity regime is very interesting! But we suggest that our analysis of the impact of ss and mm on the shape of the corresponding variance maps and the related failure modes accounts for the results quite nicely. In particular, we see that initially increasing mm is necessary to achieve good accuracy at high sparsity, as expected. However, as predicted by the theory, increasing mm too much results in V(q)V'(q^*) and V(q)V''(q^*) together being too large, causing qq^* to fail to be stable in practice. This relates to your other question about Figure 4. The phenomenon described here is exactly what is illustrated in Figure 4, which shows the Variance maps of CReLUτ,mCReLU_{\tau,m} and CSTτ,mCST_{\tau, m} in the high sparsity, large mm regime. When the variance map curves of to trace or even cross the q=V(q)q=V(q) line, the result is that qlq^l does not converge to qq^*, and instead remains larger than qq^* layer on layer. Figure 4 shows that the corresponding χ1\chi_1 value in this case is larger than one, which causes exploding gradients and training failure. We have tweaked the wording at the bottom of page 8 in Section 4 to make it clearer which point from the main text Figure 4 is illustrating.

If you still think something is not clear, could you perhaps explain what analysis you think is missing?


Comment:

No source Code provided, implementation details on experiments only in the appendix.

Response:

The implementation details were left to the appendix due to space constraints. We ask that this please not be considered a weakness of our paper. We are working to make the source code available and will alert you if this is possible before the discussion period concludes.

审稿意见
8

The paper aims to encourage high sparsity in the activations of neural networks with the motivation to reduce computational cost. To this end, the authors study the activation dynamics induced by common sparsity inducing nonlinearities such as ReLU and Soft-Thresholding (ST) under random initialization. The authors, via the large width Gaussian process limit of neural networks, discover a training instability for ReLU and ST nonlinearities. They show that the instability can be resolved by clipping the outputs of these nonlinearities. The authors validate the theory through experiments and show that the modification allows training MLPs and CNNs with very sparse activations with no or little reduce in test accuracy.

优点

  • The writing is very clear and easy to follow.
  • The theory is elegant.
  • The theory works in practice and the authors effectively demonstrate being able to train neural networks while maintaining high activation sparsity.

缺点

Minor weaknesses:

  • There could have been a study of the computational efficiency since that is the main motivation of the work.

问题

Questions:

  • Is this the first time that the variance map equations are being derived for these non-linearities?

Minor:

  • Plots of Figure 2 are missing axis labels and the plot legends are not readable on printed paper.
评论

We are very grateful to the reviewer for their careful reading of our paper, thoughtful and positive feedback, and constructive suggestions. Below we respond to the concerns and/or questions raised.


Comment:

"Is this the first time that the variance map equations are being derived for these non-linearities?"

Response:

Yes, to the best of our knowledge this is the first time they have been derived.


Comment:

"Plots of Figure 2 are missing axis labels and the plot legends are not readable on printed paper."

Response:

Thank you for highlighting this. We have added axis labels and enlarged the legend font.


Comment:

"There could have been a study of the computational efficiency since that is the main motivation of the work."

Response:

Unfortunately it was not possible for us to perform an empirical study of the computational efficiency gains due to sparser activations, because the sparse operations necessary to leverage sparse activations are not yet well supported on the accelerator hardware on which we are running our experiments. However, better support for efficient sparse operations is a high priority and ongoing research area for both deep learning hardware and software developers, given the potential performance gains. The number of flops in each matrix-vector multiplication ABAB for ARm×nA \in \mathbb{R}^{m \times n}, BRn×dB \in \mathbb{R}^{n \times d} in the forward pass could in theory shrink from O(mnd)\mathcal{O}(mnd) to O(msd)\mathcal{O}(msd).

评论

Thank you for the response. I very much enjoyed reading your paper and I retain my positive opinion of the work.

审稿意见
6

The paper studies the effect of sparsity on the activation function for deep neural network initialization using the existing dynamical isometry and edge of stability theory. In particular, the authors compute the so-called variance map and correlation maps for sparse activating functions, namely, shifted ReLU and soft thresholding, and interpret the shape of these maps, in particular, the values V(q)V'(q^*) and V(q)V''(q^*) to explain the failure. Then they propose magnitude clipping as a remedy and empirically show that with these magnitude-clipped sparse activation functions, it is possible to train the deep net without losing test accuracy and with high test fractional sparsity.

优点

The paper introduces two natural classes of activation functions with a tunable parameter τ\tau. I think understanding what activation function works better for which purpose is an active and fascinating area which I also believe is to appeal to general interest. Sparsity is particularly an important goal to achieve for modern large deep learning. Making sure that the network has non-exploding or non-vanishing gradients at initialization is indeed a sufficient condition for the applicability of an activation function.

The introduction is well-written.

On the flip side, I am not entirely sure why sparse ReLU fails to train. Table 2 only shows results for magnitude clipping for CReLU and the usual ReLU with τ=0\tau=0.

缺点

I have three major questions that I wasn't able to resolve by reading the main text only.

  1. What is the criterion on VϕV_\phi for successful training? For example, let's see Figure 2. Which of the shapes are good and expected to train well vs which are the ones that are expected to fail? I see that for τ=1\tau=1 the curves have higher curvature. In particular, the blue curve intersects the line x=yx=y at one point where the derivative is non-zero but the curvature is positive. Is this expected to fail because the derivative is non-zero? I found the explanations in the text somehow repetitive and hard to parse. Can the authors explain the criterion on VϕV_\phi in words just from Figure 2?

  2. Can the authors please provide experiments with CReLU m=0m=0 as well and also for ReLU in Table 2? Why is there only one row for ReLU? As the table stands now, I am not convinced that sparse activation functions without magnitude clipping fail to train. Is the 'unstable training dynamics' reported in the paper for very large ss and small mm as claimed in the conclusion?

问题

Also, I do not understand the heuristics given in Section 3.1 for how to choose mm. I understand the dependence of V(q)V'(q^*) and V(q)V''(q^*) on the magnitude value mm is non-trivial from Figure 4. Still, the curves follow regular shapes so maybe it is possible to give simple heuristics based on Figure 4?

I will consider increasing my score based on the author's response to my questions.


Post-rebuttal: The authors sufficiently addressed my concerns. I am leaning acceptance.

评论

We are very grateful to the reviewer for their careful reading of our paper, thoughtful feedback, and constructive suggestions. Below we respond to the concerns and/or questions raised.


Comment:

"I am not entirely sure why sparse ReLU fails to train. Table 2 only shows results for magnitude clipping for CReLU and the usual ReLU with τ=0\tau=0." and "What is the criterion on VϕV_\phi for successful training? For example, let's see Figure 2. Which of the shapes are good and expected to train well vs which are the ones that are expected to fail? I see that for τ=1\tau=1 the curves have higher curvature. In particular, the blue curve intersects the line x = y at one point where the derivative is non-zero but the curvature is positive. Is this expected to fail because the derivative is non-zero? I found the explanations in the text somehow repetitive and hard to parse. Can the authors explain the criterion on VϕV_\phi in words just from Figure 2?"

Response:

According to the theory, two things are required for stable training:

  1. we require that χ1=1\chi_1 = 1, which guards against vanishing and exploding gradients as well as instability to input perturbations at initialisation.

  2. we require that the variance map V(q)V(q) is sufficiently stable around qq^* such that qlq^l does in practice converge to and remain at qq^*. This is a necessary requirement for the EOC initialisation in practice since qq^* is used in the calculation of the values of σw\sigma_w and σb\sigma_b in order to ensure χ1=1\chi_1 = 1. If in practice qlq^l does not stably converge to qq^*, and instead grows larger, then we no longer have χ1=1\chi_1=1 and experience the associated training failure modes of exploding or vanishing gradients.

The problem we have shown with ReLUτReLU_\tau and STτST_\tau is precisely that it is impossible for them to satisfy both criteria χ1=1\chi_1=1 and qlq^l converging stably to qq^*.

In light of the above, we agree that the original Figures 2 and 3 could be misleading, since they included variance maps for ReLUτReLU_\tau and STτST_\tau which did not correspond to EOC initialisation. To address your question and to improve clarity on this point we have combined Figures 2 and 3 into a single figure, now showing only the variance maps for both activation functions corresponding to χ1=1\chi_1=1. We have also edited Section 2 of the paper to make this point more clearly (in particular see paragraphs 2, 3, 4, and 5 of Section 2 in the updated version). We think it reads better now, so thank you for this helpful feedback.


Comment:

"Why is there only one row for ReLU?"

Response:

Standard ReLU was simply included in Table 2 as a baseline, a commonly used activation function against which we can compare the accuracy and activation sparsity in our experiments.

评论

Comment:

"Can the authors please provide experiments with CReLU m = 0 as well and also for ReLU in Table 2? ... As the table stands now, I am not convinced that sparse activation functions without magnitude clipping fail to train."

Response:

Indeed Table 2 is not intended to provide evidence that ReLUτReLU_\tau and STτST_\tau fail to train at high sparsities. Due to space constraints, the experiments showing that ReLUτReLU_\tau and STτST_\tau networks do indeed fail to train at higher sparsities were included only in the appendix in Table 3, top of page 17, in the original version. We fully agree that the pointer to these experimental results was not sufficiently clear (it was made on page 4 at the end of the following sentence: ``To make matters worse, ... and consequently prove effectively impossible to train for large sparsity; see App. C."). Having made some additional space available by merging Figures 2 and 3 as described above, we have now included all experimental results in Table 2. For ease of reference, these are the experimental results for ReLUτReLU_\tau and STτST_\tau:

ssτ\taummVϕV'_\phi at q*VϕV''_\phi at q*DNN accuracy meanDNN accuracy stdDNN sparsity meanDNN sparsity stdCNN accuracyCNN sparsity
ReLUτReLU_\tau0.500.00N/A1.00.00.940.0020.500.0010.700.52
0.600.25N/A1.00.120.760.370.490.270.680.6
0.700.52N/A1.00.30.100.000.000.000.10.0
STτST_\tau0.50.67N/A1.00.430.100.000.000.000.10.0
0.60.84N/A1.00.590.100.000.000.000.10.0
0.71.04N/A1.00.810.100.000.000.000.10.0

The results clearly show the failure of the ReLUτReLU_\tau networks to train once sparsity reaches 70%, and the failure of STτST_\tau at even 50% sparsity.

While CReLUτ,mCReLU_{\tau,m} with m=0m=0 will be just the 0 function for any τ\tau, and so would not work as a candidate activation function; potentially we are misunderstanding the request "experiments with CReLU m = 0."


Comment:

"Is the `unstable training dynamics' reported in the paper for very large s and small m as claimed in the conclusion? Is the 'unstable training dynamics' reported in the paper for very large s and small m as claimed in the conclusion?"

Response:

Table 2 shows the failure to consistently train when combining the largest ss together with smallest mm considered in the experiments. Table 2 also shows that stability is recovered by decreasing mm and/or ss sufficiently.


Comment:

"Also, I do not understand the heuristics given in Section 3.1 for how to choose mm. I understand the dependence of V(q)V'(q^*) and V(q)V''(q^*) on the magnitude value m is non-trivial from Figure 4. Still, the curves follow regular shapes so maybe it is possible to give simple heuristics based on Figure 4?"

Response:

As you say, the dependence of the shape of VϕV_\phi on ss and mm is non-trivial, and we are hesitant to suggest a simple and clean heuristic, since as far as we know none can be derived analytically. Our experiments in Table to 2 suggest that a decent starting point for fully connected networks would be to maximise mm for a given sparsity subject to ensuring that V(q)0.2V''(q^*) \lessapprox 0.2, while allowing V(q)0.4V''(q^*) \lessapprox 0.4 appears to work in the CNN case.


Comment:

"I will consider increasing my score based on the author's response to my questions."

Response:

Thank you very much. We hope that we have sufficiently addressed your questions and suggestions.

评论

Thank you to all reviewers once again for your valuable time and insightful comments which have helped us improve the paper further.

As the deadline for the Author/Reviewer discussion is approaching, we would very much appreciate it if you could let us know whether you are satisfied with our answers to your questions and the corresponding updates we have made to the paper, reflected in the most recent revision. If so, we would humbly ask that you consider adjusting your scores as you feel appropriate. We are of course happy to provide any additional clarifications that you may need.

AC 元评审

The paper uses ideas from the infinite-width limit and Gaussian process to propose stable alternatives for sparsity-inducing activations. The overall response by the reviewers is positive. The results are novel, and the empirical analysis for justifying the ideas has been performed thoroughly. There were a few concerns, such as the criterion on VϕV_\phi for successful training and the heuristic for choosing mm, which were sufficiently addressed by the authors.

为何不给更高分

The empirical analysis is done meticulously, justifying each step in the construction. Some theoretical results are novel but need to be stronger for a higher score.

为何不给更低分

The paper is good and the reviewers all agree on acceptance.

最终决定

Accept (poster)