Understanding the Initial Condensation of Convolutional Neural Networks
We make a step towards a better understanding of the non-linear training behavior exhibited by neural networks with specialized structures.
摘要
评审与讨论
This paper analyzes the phenomenon called initial condensation in simple CNN models. Initial condensation refers to the occurrence of weight grouping in the early stages of neural network (NN) training, which has been discussed in previous literature mainly with respect to fully connected NNs. A two-layer CNN is theoretically analyzed and it is shown that initial condensation occurs in one CNN layer with respect to convolutional kernels.
优点
- Initial condensation in CNN models is analyzed theoretically, although the assumed model is restricted to a simple two-layer one.
缺点
- It is difficult to follow the technical details of the paper for the following reasons:
- is not well defined for and , although it is required to define .
- I could not understand the meaning of . What is ?
- The description of the paper does not follow the notations defined in Section 3.1. For example, the matrix notation introduced in Section 3.1 is not used in the main technical discussion in Section 4. Also, the operator norm is defined in Section 3.1, but does not appear in the manuscript.
问题
-
Why are only 500 samples used in the CIFAR10 experiment?
-
What is the test accuracy of the final model in Figure 3? I wonder if the final model overfits the training set.
-
It is better to use \citet{} for textual citation.
-
It is better to avoid using multiple in one line.
-
In page 4: .. -> .
It is difficult to follow the technical details of the paper for the following reasons:
-
is not well defined for and , although it is required to define.
-
I could not understand the meaning of . What is ?
-
The description of the paper does not follow the notations defined in Section 3.1. For example, the matrix notation introduced in Section 3.1 is not used in the main technical discussion in Section 4. Also, the operator norm is defined in Section 3.1, but does not appear in the manuscript.
-
We disagree that it is necessary to define for and . This is a notation for the weight of convolution kernels, where represent the weight and height of a convolution kernel. And these values will not reach in the real convolutional networks.
-
We apologize for any confusion regarding the term . In our theoretical framework, we base our analysis on the concept of gradient flow, which is the continuous form of gradient descent. Here, represents time in the continuous sense. The gradient flow is formulated as = \lim\limits_{\eta\rightarrow 0}\frac{\theta_{k+1}-\theta_{k}}{\eta}=\frac{d\theta}{dt}\frac{d\boldsymbol{W}_{p,q,\beta}}{dt}t\eta \times \text{epochs}$$.
-
The notation such as and is indeed critical to the theoretical derivations in the appendix. While they may not appear prominently in the main technical discussion in Section 4, their inclusion is essential for maintaining the coherence and completeness of our paper. We have reviewed our manuscript to ensure that the notations introduced in Section 3.1 are consistently applied and clearly articulated throughout the paper.
Why are only 500 samples used in the CIFAR10 experiment?
Here are two key reasons. One, our theory demonstrates the condensation phenomenon at the initial stage is independent of training set size. Two, 500 samples can save a lot of computational cost for performing a large set of experiments.
What is the test accuracy of the final model in Figure 3? I wonder if the final model overfits the training set.
Since only samples from CIFAR10 were employed to train the model, the test error can not give any useful information. This experiment is to show the condensation of final learning stage rather than the model's generalization.
For your reference, we have found that many pretrained large neural networks also exhibit clear condensation, which are easy to be verified.
It is better to use citet{} for textual citation.
Thanks for your advice, and we have changed the textual citation with citet{}.
It is better to avoid using multiple in one line.
Thanks for your advice, and we have modified the mathematical expression in Page 6.
Thank you for your responses.
I still do not understand why the definition of is not necessary for and , even though they are summed over and .
I am still confused about matrix notations. For example, in Section 3.1 it is defined to use to denote the -th row of a matrix . In Assumption 3, however, both and appear. What do these expressions mean?
If notations in Section 3.1 are only used in the appendix, it is better to define them in the appendix.
we have modified the mathematical expression in Page 6.
I have checked the current version of your manuscript and find no update on page 6.
I would like to keep my score.
We are sorry that we have forgotten to upload the revised manuscript, and we have uploaded the newest version with revision including the current reply.
I still do not understand why the definition of is not necessary for and , even though they are summed over and .
As we define at the last of page 3 that
we have the term in summation outside and is zero for both and . And the summation over and is just for a convenient writing. Thus the definition of is not necessary for and .
I am still confused about matrix notations. For example, in Section 3.1 it is defined to use to denote the -th row of a matrix . In Assumption 3, however, both and appear. What do these expressions mean?
If notations in Section 3.1 are only used in the appendix, it is better to define them in the appendix.
Thanks for your advice and we have modified the Section 3.1, and we have also added some notations for high-dimensional tensors, for example, a four-dimensional tensor which may help understanding.
is for multiple input channels and is for one input channel. In our revised manuscript, we clarify at the last of page 4 that "We consider the dataset with one input channel, i.e. , and omit the third index in in the following discussion. Multi-channel analysis is similar and is shown in the Appendix. D. " And we modify the notation on page 5. Since the case of multiple input channel is only analyzed in appendix, we have remove this notation (along with those only used in appendix) to appendix and only keep in main text.
This manuscript presents a study of condensation in convolutional neural networks. The main theorem states that under certain assumptions on the data, activation function and initial weights, the following two things hold:
- the final weights go arbitrarily far away from the starting point
- the final weights all point in the direction of the principal eigenvector of some data-dependent matrix.
The experiments confirm the theoretical results, even in settings where assumptions are broken.
优点
- (significance) Insights in the learning dynamics of neural networks typically help to guide model development and speed up learning.
- (originality) The condensation problem has been studied for fully-connected networks, but this appears to be the first work on convolutional networks.
缺点
- (clarity) The paper is quite chaotic and therefore hard to read. Especially the frequent notation changes and notation that is used only in one place make the paper hard to read. E.g. something like would be much clearer than the current formula above equation (13).
- (clarity) The variables and come out of nowhere and no intuition or explanation is provided about what these variables represent.
- (clarity) The experiment section mentions that every CNN has an additional 2 fully-connected layers to produce outputs. As a result I do not understand how it is possible to do experiments with the theoretical setting which only considers convolutional networks.
- (significance) I am unable to distill from the manuscript whether condensation is a good thing or a bad thing. Figure 1 seems to suggest condensation enables learning smaller networks. However, the main theorem implies that only two possible directions of weights survives, which intuitively feels like a bad thing that would hinder expressivity. As a result, I also do not quite understand why having experiments where the assumptions are violated is supposed to be a selling point (unless it is a good thing). On the other hand, if it were a good thing, I am concerned about the caption of Figure 2 and 3 where it is stated that the network attains less than 20% accuracy.
- (originality) It is not clear which parts of the analysis are taken from prior work and what new insights are necessary to make this work for convolutions. The use of the eigenvectors seems to be one of the most obvious differences but it would be good to highlight where exactly the differences are.
- (quality) I am unable to properly assess the derivatiations and proofs because I understand too little of what is going on. However, I shortly skimmed over the proof of the main theorem and noticed a transition where the norm of sums becomes the sum of norms without any comments. Also, some non-obvious statements seem to be planted without proper explanation.
- (quality) The theoretical results seems to build on an analysis of the dynamics of gradient descent. However, the experiments make use of adaptive optimisers like Adam, which should lead to significantly different dynamics than plain GD.
Minor Comments
- The hyperlinks in the paper seem to be broken. Reading the paper required more scrolling than I'm used to.
- Assumption 4 seems to be more of a definition than an assumption.
- I don't quite understand what the infinity norms are supposed to do in equation (6).
问题
- Please, reduce the noise in the mathematical notation for the sake of readability. Note that there is more noise than the one example I provided (e.g. and , , and , ...)!
- Why is ?
- Where is the activation function in the time derivatives of the parameters? E.g. I would have expected the following for the derivative of the parameters in the last layer:
- Why is the supremum necessary in Theorem 1? Is there a chance that condensation stops again before ?
- Do the assumptions correspond to the condensation regime from (Luo et al., 2021)? If not, what do the assumptions stand for?
- In which exact steps does this analysis differ from the analysis for fully-connected networks? It seems like the analyses have a lot in common.
- Is condensation a good or a bad thing?
- Why are the experiments conducted with networks that attain less than 20% accuracy?
- Does the theoretical setting in the experiment section also have 2 fully-connected layers at the end (as claimed in the experimental setup section)?
- Can the theory be directly applied to adaptive gradient methods, as used in the experiments?
- Does condensation also occur when the last layer is initialised with zeros?
- If the results apply when assumptions are violated, shouldn't it be possible to loosen the assumptions?
Where is the activation function in the time derivatives of the parameters? E.g. I would have expected the following for the derivative of the parameters in the last layer:
We agree that (x_{u,v,\beta}^{ 1 } . In our settings, and x_{u,v,\beta}^{ 1 }$$ (i) \sum\limits_{\alpha=1}^{C_{0}} $\sum\limits_{p=-\infty}^{+\infty} \sum \limits_{q=-\infty}^{+\infty} x_{u+p, v+q, \alpha}^{\[0 $ } \cdot W_{p,q,\alpha,\beta}^{[1]} \cdot \chi(p, q)\] = \varepsilon $$ \bar x_{u,v,\beta}^{ 1 }(i)$$ , a =\varepsilon\bar{a}. Thus \sigma(x_{u,v,\beta}^{[1]}(i)) $$ = \sigma(\varepsilon \bar x_{u,v,\beta}^{[1]}(i)) $$ = \sigma(0) + \varepsilon $$ \bar x_{u,v,\beta}^{[1]}(i)\sigma'(0)+\varepsilon by taylor expansion. As a result, \frac{d \bar a_{u,v,\beta}}{d t} $$ \approx\frac{1}{n} \sum\limits_{i=1}^n y_i \bar{\boldsymbol{x}}^{[1]}_{u,v,\beta}, under assumption and (Tanh satisfy this assumption).
Why is the supremum necessary in Theorem 1? Is there a chance that condensation stops again before ?
The supremum is necessary since the limiting procedure is not true for arbitrary time . For instance, if we take , the first limitation will be , and the second limitation will not be since the orientation at initialization will not condense at two opposite directions. is the time when our initial model is effective, which means the model will condense at some point between and . We remark that condensation never stops before .
Do the assumptions correspond to the condensation regime from (Luo et al., 2021)? If not, what do the assumptions stand for?
Assumption 3 corresponds to the condensation regime as described by Luo et al. (2021), while assumption 1 aligns with the assumption proposed by Zhou et al. (2022). Assumption 2 assumes that the data can be bounded, which is feasible in real-world applications, and assumption 4 indicates the presence of a pair of opposing directions.
Does condensation also occur when the last layer is initialized with zeros?
Our analysis suggests that condensation can indeed occur under these conditions, as the initialization with zeros still satisfies all the assumptions outlined in our manuscript.
However, while this zero initialization meets the theoretical conditions for condensation, it is not a common practice in real-world neural network training. The primary reason for this is the homogeneity it introduces in the training dynamics of different parameters within the last layer, which potentially hinder the network's ability to diversify its learning paths and explore the parameter space effectively.
If the results apply when assumptions are violated, shouldn't it be possible to loosen the assumptions?
Yes. We are still on working loosening assumptions but there are many technical details needed to be addressed.
I would like to let the authors know that I will take into account their rebuttal for my final decision.
Upon a quick read-through, the following questions still remain:
- I see how However, with the inequalities and , which only hold with high probability(!), I would have expected which is not upper bounded by . Furthermore, it should be stated explicitly that the inequality only holds with high probability.
- The Taylor expansion is only a good approximation when is close to zero. Is this captured by one of the assumptions or does this require an additional assumption?
(originality) I am unable to properly assess the ... norms without any comments. Also, some non-obvious statements seem to be planted without proper explanation.
We understand your concerns regarding the complexity of the derivations and proofs presented in our manuscript, particularly in the main theorem. It's crucial for our work to be accessible and comprehensible to our readers, including the detailed mathematical aspects.
Regarding your specific observation about the transition from the norm of sums to the sum of norms, we realize that this step in our proof may not have been adequately explained, leading to potential misunderstandings. In our revised manuscript, we have ensured to clarify this transition more explicitly and provide the necessary mathematical justifications.
Additionally, We have thoroughly reviewed these sections and augment them with more comprehensive explanations and justifications, to ensure that our mathematical reasoning is transparent and thoroughly grounded.
The theoretical results seems to build on an analysis of the dynamics of gradient descent. However, the experiments make use of adaptive optimisers like Adam, which should lead to significantly different dynamics than plain GD.
Question 10: Can the theory be directly applied to adaptive gradient methods, as used in the experiments?
Our theoretical analysis primarily focuses on gradient flow, which represents the continuous form of gradient descent. This approach allows for a rigorous and detailed understanding of the dynamics inherent in gradient descent algorithms.
First of all, in Figure 2, the experimental parameters strictly follow the theoretical settings, which we did not state this clearly in the article at the beginning, but have now modified it. Besides, the objective of our experimental design, as detailed in Section 6, was to explore the boundaries of our theoretical predictions. We aimed to investigate whether the condensation phenomenon, which our theory predicts for simple CNNs under gradient flow, would still be observable under various activation functions and optimization methods, including adaptive ones. The results indeed confirm the presence of this phenomenon across these different settings, thus providing empirical support for the robustness and relevance of our theoretical findings, even in scenarios that extend beyond the specific conditions of our theoretical model.
Therefore, while our theory is grounded in the context of gradient flow, the experimental outcomes suggest that the underlying principles may have broader applicability, including in settings involving adaptive gradient methods. We acknowledge, however, that a more detailed theoretical analysis tailored to these methods would be a valuable direction for future research.
Please, reduce the noise in the mathematical notation for the sake of readability. Note that there is more noise than the one example I provided (e.g. and \theta_{\beta}$$UV , ...)!
Thanks for your suggestion. We have changed the mathematical notation and make the manuscript more readable.
Why is
I wonder if it referred to the inequality on page 5. We have that by triangle inequality and Schwartz inequality. And in our settings, since , then by concentration of Gaussian variables, we obtain that with high probability, i.e., the probability of the event is at least for some constant , the following holds: and . In the manuscript, we have already rescaled the parameters to order and have taken the of both elements outside the inner product to have outside the . Also, compared with large , the magnitude of and are both of order . Thus the inequality on page 5 holds.
(clarity) The paper is considers convolutional networks
First of all, we agree that a consistent and clear notation, such as using \\theta_{W,v_1} := \operatorname{vec}({\theta_{W,\beta}\cdot v_1})\), would enhance the readability, particularly around equation (13). We are committed to revising the manuscript to ensure that the notation is more coherent and understandable throughout the text.
Besides, for the abrupt introduction of variables and , we recognize the need for providing better context and explanation for these terms. In the revised manuscript, we have included a detailed description of as the time period relevant to relation (13) and crucial to our asymptotic model's applicability. Similarly, , which appears in appendix C.5, have been explained more thoroughly to enhance the reader's understanding.
At last, the error in Section 5.1 about the CNN structure was unintentional. To clarify, only the experiment depicted in Figure 3 involves two fully connected layers. In contrast, the experiments in Figure 2 use a CNN with convolutional layers followed directly by an output layer with neurons, aligning with our theoretical framework. Additionally, the network used in Figure 4 only has a different optimizer than that in theory, which we have clarified in the revised manuscript.
(significance) I am unable to distill from the manuscript whether condensation is a good thing or a bad thing. ... On the other hand, if it were a good thing, I am concerned about the caption of Figure 2 and 3 where it is stated that the network attains less than 20% accuracy.
Question 7: Is condensation a good or a bad thing?
Question 8: Why are the experiments conducted with networks that attain less than 20% accuracy?
As outlined in our "Introduction" section, condensation plays a crucial role in reducing the initial complexity of over-parameterized neural networks. This reduction in complexity is significant from the standpoint of complexity theory (referenced in Bartlett and Mendelson, 2002), as it contributes to our understanding of why such networks often exhibit good generalization performance despite being over-parameterized.
The reviewer wonders how condensation on only two directions can be a good thing. Here is an explanation. This paper focuses on the initial stage. In this stage, condensation on two directions plays a role that resets the neural network to a simple state, whose expressivity is small. Then, due to the driven of non-zero loss, we find in the experiments that the network would condense on more and more directions in order to increase the expressivity. Such condensation with more and more directions can ensure the network to fit the target data points but with as low complexity as it can. Therefore, the initial condensation is an extremely important feature for neural network to well fit data in over-parameterized situation.
Since we only study the initial stage as indicated by our paper title, it is then no surprise why the network attains less than 20% accuracy.
Our experiments show that the conclusion from the theory can also applies to the more general settings. This is a good thing because this indicates we can develop a similar theory for more general cases.
(originality) It is not clear which parts of the analysis are taken from prior work and what new insights are necessary to make this work for convolutions. The use of the eigenvectors seems to be one of the most obvious differences but it would be good to highlight where exactly the differences are.
Question 6: In which exact steps does this analysis differ from the analysis for fully-connected networks? It seems like the analyses have a lot in common.
The difference between FCNs and CNNs compels us to provide an additional commentary in the concluding section of Section 4 that ‘Through our careful analysis, we discovered that the primary difference between condensation in fully-connected neural networks and CNNs at the initial stage is that in fully-connected neural networks, condensation occurs among different neurons within a given layer (Zhou et al., 2022), whereas in CNNs, condensation arises across different convolution kernels within each convolution layer. This difference in condensation is mainly caused by the structure of the local receptive field and weight-sharing mechanism in CNNs’.
The exact step is that the weight-sharing mechanism in CNNs does indeed impact the structure of the matrix in their linearized dynamics, enabling more possible directions for the clustering effect of the convolution kernels. This multiplicity of directions may potentially exert impact on future training process of CNNs. Thus, it is essential for us to develop a condensation theory tailored to CNNs.
We agree with the reviewer that in the context by the reviewer, it is upper bounded by . However, in our context, where and, as are constants independent with , then, by ignoring constants and as , we obtain immediately that
The manuscript is revised accordingly. Since the scale of is of order for some (based on our assumption of initial scale ), then compared with , the magnitude of can be ignored. The above relation can be guaranteed by a basic calculus exercise: For any , however small that is, the following holds hence the estimate in our context reads Therefore, we have
For another question about Taylor expansion, we have on page 4 where . Thus, in our settings, and with assumption 2 that is bounded, we have is close to zero.
This paper studies the initial condensation phenomenon of training CNNs, supported by both experimental results and theoretical analysis.
优点
Understanding the training dynamics of gradient-based method is a crucial theoretical issue. While previous research has primarily concentrated on fully connected networks (FCNs), this paper represents a significant advancement in comprehending the training dynamics of CNNs. It investigates the initial condensation dynamics in CNNs, supported by comprehensive experimental evidence. Furthermore, the authors provide a precise mathematical characterization and time estimation for this initial condensation phenomenon in CNN training.
缺点
Assumption 4, while somewhat strict, effectively explains the initial condensation phenomenon. As discussed in my first quenstion below, I think that by relaxing this assumption, we can gain a more profound insight into the condensation phenomenon.
问题
-
If Assumption 4 becomes , how will the initial condensation phenomenon change? My guess is that there will be two condensation directions, corresponding to the first two eigendirections. Is that right?
-
In the initial stage of training, if we decompose the dynamics in polar coordinates, the radial velocity is substantially smaller than the tangential velocity. Does the time estimation provided in Theorem 1 relate to this property of the training dynamics?
If Assumption 4 becomes , how will the initial condensation phenomenon change? My guess is that there will be two condensation directions, corresponding to the first two eigendirections. Is that right?
Yes this will make the situation extremely complicated. Since solving the linear dynamics involves diagonalization of the matrix, it could happen that the matrix would decompose into a Jordan form. But it is very rare to have such case of .
In the initial stage of training, if we decompose the dynamics in polar coordinates, the radial velocity is substantially smaller than the tangential velocity. Does the time estimation provided in Theorem 1 relate to this property of the training dynamics?
The radial velocity is much smaller than the tangential velocity, and the time estimated in Theorem 1 is not related to this property. Our linear approximate holds true throughout the time estimated in Theorem 1, which is the time required for to grow from to , i.e.,, thus we have , then .
This paper looks at condensation in convolutional neural networks from both a theoretical and empirical perspective. Reviewers found the question important, but they struggled to follow the paper due to challenges with the writing, and they were confused about the significance of the eventual results. I defer to the reviewers and suggest rejecting the paper.
为何不给更高分
The reviewers struggled to follow the paper. Even the authors acknowledged that the reviewers were asking "basic questions." I think that was confusing writing rather than lack of knowledge on the part of the reviewers (although this is a pretty niche topic that I don't really understand either).
为何不给更低分
N/A
Reject