Neural Synaptic Balance
The paper presents a theory of neural synaptic balance based on systematic relationships between the input weights and the output weights of neurons.
摘要
评审与讨论
This paper aims to study and explain the phenomenon of neural synaptic balance, where a balanced neuron means that the total norm of its input weights is equal to the total norm of its output weights. Particularly, the authors study the reasons why and when randomly initialized balanced models (so, models whose neurons are balanced) tend to be balanced at the end of training as well. The study takes into account many different components of neural networks (activations, layer kinds, regularisers).
优点
The study is very comprehensive, and sheds light on some interesting properties of deep neural networks.
缺点
While it is true that, as the authors state in the conclusion, neural synaptic balance is a theory that is interesting on its own, I would encourage the authors to expand the discussion on possible application domains of this theory. Why is it interesting? What are the advantages that a complete understanding of such phenomenons could bring to the table?
问题
Backpropagation is not biologically plausible, and hence does it really make sense to state that the methods proposed by the authors are, if they are then applied to backdrop-based models? I would suggest to either remove such a discussion, or to expand on it, showing even empirically on small models, that the results extend to different kinds of neural networks, where both neural activities and synapses are updated locally in a bio-plausible way (PC). A third way of addressing this would be to add a discussion on it, and avoid to do the experiments.
局限性
No concerns here
We thank reviewer VcLX for the positive review of this work and insightful comments.
While we have focused here on developing the theory of neural synaptic balance, neural synaptic balance has practical applications. It can be viewed as an additional, complementary, method of regularization on par with other methods, such as dropout. It is based on a rigorous theory that connects it to convex optimization. And finally, it may have additional applications in biological or neuromorphic systems, due to the locality of the balancing operations. The interesting fact about neural balance is that, while balancing a single neuron may ruin the balance of its adjacent neurons, iterated stochastic balancing of all the neurons in a network leads to a unique, stable, configuration of the weights (the globally balanced state).
Regarding the biological implausibility of backpropagation, we will add a discussion to the final version. Note that the balancing algorithm presented in our work can be applied to a network after training with any learning rule. In other words, Theorem 5 does not depend on the training algorithm and balancing can be applied to any set of weights, at any time, during or after learning, and with any cost function. [For example, one could train a network with L2 regularization and apply L1 balancing to the weights after the training is complete.].
The authors present a theory of neural synaptic balance, defined as the condition in which a total loss achieves the same value for the input weights to a neuron and its output weights. This is different from the well studied E/I balance in neuroscience and machine learning literature. The authors show mathematical derivations of how to balance a neuron without affecting the outcome of the network and show that balancing a network is a convex optimization process.
优点
The paper is overall clear and detailed, the mathematical proofs are sound and the paper structured well moving from straightforward claims to less trivial points.
缺点
The paper is about neural synaptic balance, but the authors do not provide convincing motivation why we should care about such balancing. As they mentioned, adding a simple L2 regularizer will balance the network naturally (in a distribution sense, not necessarily each neuron individually) during training and have other well-known benefits, so the elaborate mathematical derivations on the general balancing process seem redundant. In addition, in the authors' own plots, unbalanced networks sometimes outperform the balanced networks (e.g., fig 3E), which just emphasizes the point. One of the mentioned motivations is biological neurons, but they claim that biological neural data about synapses do not exist. However, they could test their hypothesis against the currently available connectomes e.g., from or the Drosophila fly brain. They mention spiking networks, but the notion of input-output homogeneity is unclear in spiking networks. Finally, physical neurons' energy consumption is mentioned without details.
问题
Why is the energy consumption of physical neurons lower when they are balanced? Why not just have a regularizer to keep the overall activation low and weights small? Why does each neuron need to be balanced separately?
局限性
The whole framework is specific to BiLU neurons or perhaps to other power-law functions. The relevance to spiking neurons is therefore questionable. It is also questionable as a general principle for machine learning.
We thank reviewer QTyq for the positive review of this work and insightful comments.
"Why is the energy consumption of physical neurons lower when they are balanced?" Because the balancing algorithm also decreases the norm of weights.
Why not just have a regularizer to keep the overall activation low and weights small? Using the balancing algorithm is indeed another way by which we can achieve a balanced state while keeping the overall activation low and weights small. We have shown that using a regularizer is not the only way to achieve a balanced state and keep the overall activation low and weights small.
Why does each neuron need to be balanced separately? It is more elegant and biologically (or neuromorphically) more plausible to be able to achieve a global balanced state through local rules that each neuron can apply independently of all the other neurons in the network, at any point in time, in a completely asynchronous way. In other words, neurons do not need to exchange information between each other in order to achieve a global balanced state. Global order emerges from local order.
This paper provides a thorough characterization of regularizers which lead to synaptic balance (when the "cost" of input weights to a neuron or pool of neurons is tied to the cost of output weights) in trained neural networks. Their results apply to many different activation functions and architectures.
优点
The paper is very well-written and easy to follow. I was able to read everything, including the math, smoothly. The mathematical arguments themselves are crisp and correct, which I really appreciated.
缺点
The paper is strongly lacking in motivation. I never really understood why I should care about synaptic balance. Also, it is clear from the numerical experiments that synaptic balance only emerges in networks when it is enforced via a regularizer (expect in the case of infinitely small learning rate), but why is this surprising? It seems obvious that adding a regularizer for some property tends to result in that property. It would be shocking if synaptic balance occurred without some regularization towards the property. Thus, while the "what" and "how" of the paper are nicely addressed, I feel the paper is missing the "why". I believe if the authors could address this from the outset, it would make the paper much stronger, and I would of course be willing to increase my score.
问题
-It is claimed throughout the paper that "network balance can be used to assess learning" progress. I do not really understand how. If my total loss is the sum of a task loss and a regularizer , then there is nothing preventing a situation where I get and , meaning that task loss is decoupled from the network balance loss. If the authors could clarify this point, that would be great.
Small typos:
- Line 128: alpha is not rendered in latex
- Figure 4 caption, subplot (D-F) "CFAR10" -> "CIFAR10"
局限性
Yes.
We thank reviewer cT2m for the positive review of this work and insightful comments.
Synaptic balance does not necessarily emerge in networks trained with a regularizer (unless they are trained very carefully, with very small learning rates, etc). Our work shows that one can obtain synaptic balance without a regularizer, simply by applying the balancing algorithms described in the paper during training or just at the end of training. However,reviewer cT2m is right that we could have provided a clearer motivation. In addition to the theoretical motivations, there are also practical motivations as discussed in the overall Rebuttal. In particular, balancing can be viewed as an alternative way of regularizing networks, in the same way that dropout is viewed as an alternative or complementary way of regularizing networks. This will be made clear in the revised version.
The surprising result in our work is that without any regularization, if each neuron tries to balance its input and output synapses independently (without any coordination with any other neurons) the network reaches a unique, stable, and globally balanced state. Thus a unique global order emerges from local, independent, balancing operations.
By the term “network balance can be used to assess learning” we mean that if a network trained by regularized SGD is in a balanced state and does not move from it, then the gradient must be zero and the learning must have converged. Conversely, if the state is not globally balanced, then learning has not fully converged.
All the minor points are fixed in the revised version.
I thank the authors for their reply. I have increased my score to a borderline accept.
I am still confused by this point:
"By the term “network balance can be used to assess learning” we mean that if a network trained by regularized SGD is in a balanced state and does not move from it, then the gradient must be zero and the learning must have converged. Conversely, if the state is not globally balanced, then learning has not fully converged."
Could the authors please be more precise as to which gradient they are talking about? The total gradient (i.e., including the regularizer), or the gradient of the "task" component of the overall loss function?
We thank this reviewer for appreciating our reply. Bu "regularized SGD" we refer to the "total gradient". In any case, we will revise the text to remove any confusion.
The authors provide a theoretical approach to the analysis of balanced neurons and networks. Their theoretical work includes proof of the convergence of stochastic balancing. In addition, they investigate the effect of different regularizers and learning rates on balance, training loss, and network weights, including practical simulations for two classification problems.
优点
The paper tries to reveal the inner structure of neural networks during the training phase. This is a very important but difficult problem; its solution could provide new insights for developing better training algorithms. The work proposed can ultimately be an important step toward more transparent networks as opposed to their current black box character.
缺点
The paper has some weaknesses, most notably how the material is presented and part of the evaluation.
Theorem 5.1, dealing with the convergence of stochastic balancing, is arguably the central piece of the paper. However, its formulation is bulky and should be reduced to a shorter, more manageable size, potentially with the help of lemmata. This becomes apparent when seeing that its proof contains the proof of another proposition.
In Figure 4, the authors say that these panels are not meant for assessing the quality of learning. However, measuring not only the training loss but also the accuracy on a test set will give important insights. How does the classification performance relate to the degree of balancing? Why did the authors not include this analysis? It could give important insights into the relationships between overtraining, generalization capability, balance, and accuracy.
The author should discuss the consequences of their work on network training. They do not discuss the immediate practical consequences or any recommendations they can make based on their results.
问题
It would help the paper's clarity if the authors answered their own questions in a brief summary at the end of the paper, as concise as possible:
Why does balance occur? Does it occur only with ReLU neurons? Does it occur only with L2 regularizers? Does it occur only in fully connected feedforward architectures? Does it occur only at the end of training? And what happens if we balance neurons at random in a large network?
局限性
The authors could be more specific about the consequences of their work, including limitations. For example, can they recommend any specific learning rate, network structure, or other features for optimal training?
We thank reviewer TDzF for their positive review of this work and insightful comments.
Regarding Theorem 5.1, the reviewer has mentioned a fair point. In the revised version we will shorten Theorem 5.1, and move Proposition 5.4 and its proof outside of the proof of Theorem 5.1.
Regarding Figure 4, for a fixed set of weights, synaptic balancing does not change the input-output function of the network, as shown by the theory. Thus, for a fixed set of weights, we do not expect to see any change in performance after applying the balancing algorithm. The new figure attached to our rebuttal does what this reviewer is asking for, which is showing the regularizing effect of balancing throughout learning.
As explained in our general response, we will add text on the application of synaptic balancing to regularization and cite additional work.
We will add a brief summary at the end of the revised version to improve the paper's clarity.
Thanks for adding the figure, which should improve the quality of the presentation. I have upgraded this part of my grading. Compared to the other reviewers, I am more easily convinced that this work could lead to a better understanding of how neural networks operate. However, I agree with the other reviewers that this needs to be better motivated. The feeling is that something important is missing. If we only knew what.
We thank the reviewers for appreciating our work and for their insightful comments. We have provided a separate response to each reviewer. The primary goal of our paper is to present the theory of synaptic balancing in neural architectures and the main theorem (Theorem 5.1) connects synaptic balancing to convex optimization. The simulations included are meant to corroborate the theory.
Overall, the main criticism is that we should have included additional information regarding the regularization value of synaptic balancing in the motivation section or in the conclusion. This is a fair point. The reason we did not make this point as clear as we should have done is that we focused primarily on the main result (Theorem 5,1) establishing the properties of the balancing algorithm. Although we discuss the applications of the balancing algorithm, we should have given more space to this. While Theorem 5.1 remains the cornerstone of synaptic balancing, in the revised version, we will make space for additional text and a new figure to describe the regularization applications of synaptic balancing. We will free up space primarily by shortening the proof of Theorem 5.1 [as described below], since the complete proof is available anyway in the supplementary material. The new figure is attached to this rebuttal.
We will add a few sentences on regularization in the motivation section, and in the new conclusion we will make very clear that:
- synaptic balancing is a novel approach to regularization;
- synaptic balancing is very general in the sense that it can be applied with all usual cost functions, including all L_p cost functions;
- synaptic balancing can be carried in full or in partial manner (due to the convexity property in Theorem 5.1);
- full or partial synaptic balancing can be applied effectively at any time during the learning process: at the start of learning, at the end of learning, or during learning, by alternating balancing steps with stochastic gradient steps.
- simulations show that these approaches can improve learning in terms of speed (fewer epochs), accuracy or generalization abilities (see examples in new figure). Thus, in short, balancing is a novel effective approach to regularization that can be added to the list of tools available to regularize networks like dropout and other regularization tools.
We hope the reviewers will agree that this addresses their main concern and that synaptic balance is a novel theoretical and practical topic worthy of being presented at the NeurIPS conference.
This paper studies neural synaptic balance that happens when the total cost of a neuron's input weights is equal to the total cost of its output weights. While the paper presents a clear and mathematically solid study of when this phenomenon arises. However, the motivation for studying neural synaptic balance is not convincingly articulated.