Towards better generalization: Weight Decay induces low-rank bias for neural networks
Both theoretically and empirically, we show that weight decay leads to low rank bias of neural network and better generalization performance.
摘要
评审与讨论
The authors prove an implicit bias toward low-rank (one or two) matrices in training two-layer RELU neural networks with stochastic gradient descent and weight decay. By exploiting this implicit bias towards the set of rank-two matrices, the authors were able to derive tighter generalization bounds.
优点
The paper is overall well written. The claims, proofs and motivation of the work are clearly presented. Moreover, the problem of implicit biases and their implications in generalization theory is a very significant topic.
缺点
The proposed work doesn't study any dynamical conditions, thus to kind of convergence to the claimed points was presented. I would say the results are more on the line of studying fixed points of the SGD dynamics. While I can intuitively believe assumption 2.5, the authors don't present any theoretical guarantee or even empirical evidence suggesting that the dynamic in practice drives onto regions in parameter space for which that assumption is satisfied. Given that the second part of the paper relies on the first one as a motivation, I believe also that the soundness of the results in the second part decreases (please see above and weaknesses). Also, the experimental setting is too restricted.
问题
I would personally appreciate some clarification on the weaknesses. Moreover, I would appreciate discussing with the authors the following, hoping there was no misunderstanding:
- It is not clear to me what is the interest of studying the case for which ? I understand that for a constant this leads to the typical case, but what is the gain in considering this more complicated case in the analysis? In the proof this leads to rank-2, but I cannot think of an example in which this kind of choice for the regularization parameter is done in applications.
- The analysis doesn't rule out the possibility of being in for some , so to me it is essentially just the analysis of a deep-linear model with layers, which has been done already multiple times in literature.
TLDR: this paper contains zero contribution -- all of the results presented are either trivial or wrong.
优点
None
缺点
This paper beyond any redemption! I will just point out a few issues that I am most frustrated with.
The introduction is quite a wild ride. I feel that someone without intimate knowledge of the area will immediately be confused. Notably, the term "implicit regularization/bias" was introduced without a proper explanation. And the problem as defined in eq. (1-3) is lacking in context (I think a few sentences is enough, but jumping directly into equation certainly looks very awkward).
Dubious choice of references:
- Some well-known work in the space were not mentioned. For example, [1, 2] are two of the earliest and most-cited works in the analysis of GD's implicit bias, but these are not referenced in this paper.
- On line 50, NTK is definitely not the only reason why optimization error goes to zero as model size grows. There are plenty of other works covering this topic, e.g. [3, 4]. The authors should consider attributing a more diverse set of works.
- On line 58, the author claimed that "these implicit biases are not theoretically understood yet" while the references on the previous line all provide theoretical contributions to this question.
- Table 1 needs to be better explained. I struggle to understand what it exactly means.
The proof of Theorem 2.4 is broken. First, line 665 does not work when the function is constant, which corresponds to the standard practice of a fixed, data-independent weight decay factor. Secondly, assumption 2.3 contradicts the claim the results do not rely on the convergence of weights.
The entirety of Section 3 contain no new results at all All of the stated results are either direct citations or trivial applications of said existing results. The proofs for this section do not even span half of a page, yet this is advertised as a major technical section.
The experiments in Section 4 is very lacking. Results on MNIST have very little implication on generalization since plenty of ''small'' models work well for this. This is especially damning since there is very little theoretical contributions that need numerical verification.
In summary, this paper contains no meaningful contribution and authors fail to demonstrate a deep understanding of this topic. I recommend rejection without a shred of doubt.
[1] Soudry D, Hoffer E, Nacson MS, Gunasekar S, Srebro N. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research. 2018.
[2] Ji Z, Telgarsky M. The implicit bias of gradient descent on nonseparable data. In Conference on learning theory. 2019.
[3] Belkin M, Hsu D, Ma S, Mandal S. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences. 2019.
[4] Du S, Lee J, Li H, Wang L, Zhai X. Gradient descent finds global minima of deep neural networks. In International conference on machine learning. 2019.
问题
None.
The paper investigates how weight decay (WD) induces a low-rank bias in neural networks, leading to better generalization. The authors prove that, under sufficient training with weight decay and stochastic gradient descent (SGD), a two-layer ReLU neural network’s weight matrix approaches a rank-two structure. Empirical evidence supports this claim across regression and classification tasks, showing that weight decay encourages low-rank matrices even without conventional assumptions about data distribution or specific network architectures.
This low-rank bias contributes to reduced generalization error. The authors theoretically derive improved error bounds by leveraging the low-rank property and confirm these bounds with experiments on the California housing and MNIST datasets. Specifically, larger weight decay values reduce the rank of the weight matrix, decreasing generalization error. This study offers insights into the regularizing effect of SGD with weight decay, suggesting that low-rank bias is an implicit mechanism for improved generalization in neural networks. The paper's findings could extend to other architectures and optimization methods in deep learning.
优点
I believe that the overall goal of the paper is interesting. The authors attempt to study the implicit complexity of neural networks trained with SGD and regularization and to connect it to the test performance of the resulting neural network. I think it's an interesting problem to explore how the rank of the weights contributes to the model's performance.
缺点
I believe the paper is quite misleading. The authors claim they do not rely on any assumptions about the convergence of the weights, but this is inaccurate.
-
Table 1 compares their results on low-rank bias with previous work. However, the authors' findings are very similar to those of Galanti and Poggio (2021) and Xu et al. (2023). The analysis follows a nearly identical approach: it involves writing out the formula for the training step, assuming small gradient steps (as in Assumptions 2.3 and 2.5 of the current paper) across all batches, examining the difference between gradient steps for two batches that differ by one sample, and then using this gap to suggest that the learned weights are close to a low-rank weight matrix. The authors never acknowledge that this proof technique was introduced in Galanti and Poggio (2021) and Xu et al. (2023). The main differences across these works are that Xu et al. (2023) used a slightly different optimization setting (e.g., training with SGD, WD and WN) and an assumption similar to 2.3 to justify that the weights converge to a low-rank matrix, while Galanti and Poggio (2021) relaxed the assumption, considering a scenario similar to 2.5. Both previous papers offer results that hold for a wider range of architectures than the current one.
-
The authors claim their results hold without assuming the convergence of the weights, but this is incorrect. Both Assumptions 2.3 and 2.5 are essentially convergence assumptions. While Assumption 2.3 makes this more explicit, Assumption 2.5 implies the same concept. Furthermore, the authors do not justify the practicality of the quantity , which is, unfortunately, unrealistic. To illustrate, following the analysis in Galanti et al. 2023 [https://arxiv.org/abs/2206.05794], we can consider the last equation in their paper's page 10 and easily derive that . This implies , which is a constant larger than 1. Therefore, the bound in Theorem 2.6 is trivially true.
This issue also applies to Galanti and Poggio (2021), which was addressed in their recent paper Galanti et al. 2023 [https://arxiv.org/abs/2206.05794]. It appears that the authors of the current paper overlooked this work, which similarly aims to demonstrate that SGD + WD induces a low-rank bias in deep learning. In this paper, the authors show that the rank of weight matrices is bounded by a function dependent on batch size, regularization parameter, and learning rate. These results hold for most architectures of interest without requiring assumptions about the data. By unrolling the training process of SGD + WD, they avoid assumptions like 2.3 and 2.5, making only the weaker assumption that the norms of the weights converge.
-
The result in 3.11 is also not very new. The result is based on a VC-style generalization bound applied to a neural network whose weight matrices are low rank. When combined with UV decomposition, we can represent the network as a network with 3 layers where each one of them includes a small number of parameters. I admit that I don't recall a specific paper that explicitly runs this derivation, but it's fairly trivial if I am being honest.
-
The narrative presented in this paper is somewhat simplistic. By relying on a very strong assumption (2.3 or 2.5), it derives an unrealistic result about matrix rank, which then leads to a very tight generalization bound for these specific solutions. However, in practice, the weight matrices of neural networks do not converge to rank-2 matrices. Furthermore, the rank of these matrices does not fully explain neural networks' success and only has a marginal effect on the performance (we can easily train neural networks that generalize well on common datasets without weight decay). A more compelling direction would be to explore the "in-between" scenario: understanding the actual behavior of rank in practice without assuming strong convergence assumptions (similar to what is done in Galanti et a. 2023) and its marginal impact on performance.
问题
-
The paper states that the results do not assume weight convergence, yet Assumptions 2.3 and 2.5 appear to imply it. Could you clarify how you define convergence in this context and why Assumptions 2.3 and 2.5 are not considered convergence assumptions?
-
Could you elaborate on how your approach to analyzing low-rank bias fundamentally differs from that in Galanti and Poggio (2021) and Xu et al. (2023)? Given the apparent similarities in proof technique and assumptions, what distinct contributions does this paper offer in Sec. 2.2 beyond those works that is not just weakening Assumption 2.3 by taking into account?
-
Could you provide practical examples or empirical evidence to support the feasibility of the term ? Given that previous analyses (such as in Galanti et al. 2.3) can simply show that this term is typically large how do you justify its practicality in your bounds?
-
Given that neural network weight matrices in practice do not typically converge to rank-2 matrices, how do you envision the practical implications of your theoretical rank assumptions?
伦理问题详情
I believe that the paper does not properly acknowledge that they used techniques that appeared in previous work of Galanti and Poggio (2021) and Xu et al. (2023), neither in Sec. 2.2 where the results appear not in the proofs themselves where the techniques are employed.
The paper studies the implicit bias towards low rankness in two-layered RELU networks trained with SGD and weight decay. In the regime where the network nearly interpolates the data/achieves stationarity across all batches, the authors show that the network's weight matrix is close to a rank one matrix in the Frobenius norm. Along with existing tools in learning theory, this implies an improved generalization bound for two-layered RELU networks. The authors also conducted experiments on small datasets to verify their predictions.
优点
The paper studies an important problem of understanding the regularization effect of weight decay and SGD optimization on learning with Neural Networks that can memorize the training data. This is an active area of research, and the paper has correctly identified a gap in the literature. The paper is also written clearly, and little background knowledge (beyond what is discussed) is needed to understand the results. I like the brief mention of the extension in the discussion, which hints at a data/batch-adaptive regularization strategy. Such a result/algorithm would be interesting.
缺点
-
Theorems 2.4 and 2.6: I have several concerns about the theorems, which I believe are the main results of this paper. Section 2.2.1 seems redundant, and does not serve any purpose, given Theorem 2.4 is only a special case of Theorem 2.6. Why do the authors choose to discuss Theorem 2.4? Beyond the setting where , why does the upper bound 2.6 not gradual decline as we reduce down to zero? Intuitively should be approximable by matrices of higher and higher rank, even when weight decay is driven to zero. Also, a minor but important point here is that both the theorems assume . The upper bound in 2.6 does not look tight when batch sizes are high. Why would that be the case? Should the rank-minimizing effect decrease when we use GD instead of SGD? If so, why? Also, ensuring is small for all batch sizes is not impossible in the interpolation regime. In that regime how does the result compare to the more general (albeit without SGD optimization and weight decay) result of Ongie and Willett? Finally, the results seem too weak, as the authors do not show the effect of depth, and the proofs themselves are not very insightful about what will happen when depth increases.
-
Generalization results: The calculations for the generalization error of low-rank neural networks seem incremental. They are likely not novel, as all the tools needed to derive these results already exist in the learning theory literature. Having said that, I can not confirm this, as I could not find the exact result in existing literature myself. More importantly, the results only seem to make sense in the interpolating regime. This is because Theorem 3.11 assumes Assumption 2.3. Is it not possible to give a result in terms of epsilon from Assumption 2.5 to understand the effect on approximate sationarity?
-
The experiments are too simplistic and unsurprising: It seems that the empirical results in are on less diverse tasks than [1], which looks at more practical deep learning tasks with more complex architectures showing a low-rank bias across them. While [1] does not give a theoretical result, they provide a more comprehensive picture of the effect of weight decay. Moreover, the authors do not verify their assumptions or even plot their predicted upper bound in their experiments, which would have verified (or shown slackness of) their theorem.
-
Missing related works: The paper is missing several related works about the effect of weight decay on neural network training. I encourage the authors to go through section 5 of [1].
问题
See the questions asked in the weaknesses above.
伦理问题详情
N/A
The paper studies the implicit bias towards rank minimization and its implications on generalization. They show that mini-batch SGD with WD converges to low-rank solutions in two-layer networks (under certain assumptions). They also provide a generalization bound for low-rank networks, which might explain how rank minimization helps generalization.
优点
The paper considers an interesting and fundamental question for understanding generalization in overparameterized networks. It provides some nice insights on this question.
缺点
- The authors claim that the assumption about the small norm of the gradients simultaneously for all batches might be too strong. But even in Theorem 2.6 they make this assumption approximately up to some epsilon, and it’s not clear whether the epsilon in this assumption is small.
- The experiments compute this epsilon, but I don’t understand whether they imply that this epsilon is sufficiently small. For example, in Figure 3 they show that epsilon is at most 12. If we use , then the bound from Theorem 2.6 becomes . I don’t see in the paper what the value of is in this experiment (maybe it’s as in Figures 1 and 2?). Then, the authors compare it to the Frobenius norm of a random Gaussian matrix with variance , which is . I don’t understand why the authors chose , and I don’t see why is large compared to .
- Lemmas 2.1 and 2.2 hold for almost all matrices V. However, in the proofs of Theorems 2.4 and 2.6, they use this property for the convergence point . The convergence point is not a random point, and hence the fact that almost all matrices satisfy a certain property does not imply that this property holds for the convergence point. Essentially, the authors assume that at convergence there are no hidden neurons with input , but they don’t state this assumption explicitly.
- Regarding the generalization bounds:
- Corollary 3.9, as well as Theorem 3.11, depend on . If we assume w.l.o.g. that the inputs are in the unit ball, then can be upper bounded by the product of the spectral norms of the layers. Then, the authors should compare their bound to well-known sample complexity bounds for neural networks. (see, e.g., Golowich, Rakhlin, Shamir, “Size-independent sample complexity of neural networks”, and the reference therein).
- The generalization bound holds only for low-rank matrices, and not for the “approximately low-rank” case. That is, it does not work with Theorem 2.6. Moreover, the authors claimed in section 2.2.2 that the assumption in Theorem 2.4 might not be feasible, and hence Theorem 2.6 is the more realistic result.
问题
Please address the issues from the “weaknesses” section.
Thanks for the valuable feedback provided during the review process. After careful consideration, we have decided to withdraw our paper. We sincerely appreciate the time and effort the reviewers have dedicated to evaluating our submission.