On the Local Complexity of Linear Regions in Deep ReLU Networks
摘要
评审与讨论
This paper suggests a new measure for analyzing feedforward neural networks based on the density of points where the model’s gradient is discontinuous (“kinks”) near training samples, which they term Local Complexity (LC). It builds opton prior findings that neural networks tend to form fewer linear regions, especially in later training stages, which relates to the generalization question in over-parameterized networks. To establish their measure’s relevance, the authors connect it to: (a) the dimensions of the features manifold, as previous results suggest that networks that generalize better tend to form low-dimensional representations; (b) robustness to gradient-based adversarial attacks, and show that LC is at least as indicative than the TV of the model over the data distribution; and (c), the “representation cost”, which is the smallest possible norm of the model’s parameters needed to exactly compute some function. These connections are proved theoretically as upper and lower bounds for the LC, and empirically for toy examples.
给作者的问题
See above.
论据与证据
Strengths
-
It offers an intuitive and mathematically rigorous measure for analyzing models and can be used to formalize previous results. Most, if not all, previous results referenced by this paper are only tested on very small networks (1D or 2D inputs) that can be analyzed analytically; this paper offers a method to generalize their results to more complex settings.
-
The math is well-developed, extensive and easy to follow.
-
Simple and intuitive examples.
-
The authors provide experimental results to highlight the limitations of some connections.
Weaknesses
- The LC term was previously introduced in a paper Humayun, A. I., Balestriero, R., and Baraniuk, R. Deep networks always grok and here is why. In High-dimensional Learning Dynamics 2024: The Emergence of Structure and Reasoning, (https://openreview.net/pdf?id=NpufNsg1FP). This raises questions about the novelty of the concept, even though the purposes of the two papers appear to be different. There are quite similar ideas, definitions and illustrations.
- It is not clear that the paper offers a new insight model generalization or otherwise. The dynamics of linear regions in the input space and their connections to the training procedures and generalization are referenced and no new insight is given, and no results demonstrate that a drop in LC correlates with a drop in the generalization error. While LC is an interesting measure, It's not clear that it currently provides a useful novel perspective.
- It is very lacking in results. The only somewhat practical result is its connection to the adversarial robustness, which also needs to be tested on more complex settings than MNIST. While it seems that prior work in this area also lacks large-scale experiments, this paper could have used LC to demonstrate analytical results in more complex settings.
- The paper shows that the connection to the feature’s manifold dimension only empirically holds in toy examples and breaks on MNIST, but the counterexample is in the appendix, and although referenced, is not discussed in detail. This contradicts parts of their insights and conclusion and feels "hidden".
- There is not much discussion on the tightness of the bounds.
- In the case of Total Variation (TV), both Figure 3 in the main paper and Figure 9 in the Appendix were expected to provide clear evidence that TV is bounded by Local Complexity (LC). However, the figures suggest that TV does not always remain strictly below LC, raising questions about the tightness of the theoretical bound and whether additional factors influence TV's behavior in late-stage training.
方法与评估标准
See above
理论论述
The theory appears good, relating LC to TV. Its relevance and importance is less clear. Also novelty with respect to previous papers, as stated above.
实验设计与分析
Not sufficient. Very simple networks and datasets.
补充材料
Reads well, theory and proofs are discussed broadly in sufficient manner.
与现有文献的关系
Relation to generalization properties at later stages of training, which is an important problem in DNNs.
遗漏的重要参考文献
Relations to Humayun et al 2024 paper should be well discussed and the additonal contributions here.
其他优缺点
see above
其他意见或建议
• Contains multiple incorrect references (e.g., “Figure 5” on page 7 should reference Figure 3). Additionally, the Impact Statement is missing a period.
• The MNIST example was not fully clear. It did not specify how many or which classes were used for training, only stating that a very small subset of the data was taken.
• In Figure 2, why is there a spike rising after 1e6 iterations? Does this always happen (as it seems from Figure 7)?
• In Figure 2, the bounding connection between LC and LR was not completely clear when the drop occurred in each. Adding a grid might help improve readability.
• Lines 158, 307 – Missing capital letters, affecting consistency.
• Left paragraph, Lines 77-78 – Sentence unclear.
• Right paragraph, Line 137 – "that" should be "than".
• Line 290 – Unclear sentence: "an 4", "to estimate the learn a map" needs correction.
We sincerely thank the reviewer for their thoughtful and detailed comments which have helped us improve our paper. We have addressed your thorough list of minor errata for the camera-ready version. Below we address each major point raised.
The LC term was previously introduced in a paper ... This raises questions about the novelty of the concept, even though the purposes of the two papers appear to be different.
Response: We appreciate this point and wish to clarify the novelty of our contribution. The referenced prior work introduced the LC term predominantly in an empirical context, highlighting an intriguing phenomenon without providing any theoretical analysis or claims. In contrast, our work explicitly develops a theoretical foundation for the Local Complexity. Specifically, we derive rigorous theoretical explanations and demonstrate connections between LC and other established theoretical aspects of deep learning. Therefore, while the previous paper contributed an interesting empirical observation, our paper significantly extends the theory of Local Complexity in neural networks.
... It's not clear that it currently provides a useful novel perspective
Response: We respectfully emphasize that our theoretical framework does indeed provide a novel and valuable perspective, as it unifies different areas of study in deep learning theory under the common theme of the local complexity. We are able to derive quantitative insights into important phenomena, including adversarial robustness, grokking, and neural collapse. To the best of our knowledge, this has not been explored or established in previous literature.
It is very lacking in results / ... needs to be tested on more complex settings than MNIST.
Response: We acknowledge this feedback and note that readers seeking extensive empirical results may find Humayun et al. particularly relevant. Our paper, however, focuses primarily on the rigorous development of a theoretical framework that lays a significant foundation for understanding and analyzing related empirical phenomena. We provided experiments to better illustrate and motivate our theoretical contributions, which is the focus of our paper. Nevertheless, we will be glad to add more extensive experiments on the local complexity on CIFAR-10, CIFAR-100 and Imagenette for the camera-ready version.
There is not much discussion on the tightness of the bounds. ... raising questions about the tightness of the theoretical bound and whether additional factors influence TV's behavior in late-stage training.
Response: We thank the reviewer for raising this important point. For an extensive on the tightness of our theoretical bounds and the conditions under which they hold, we direct the reviewer to Appendix B.4. In that appendix, we clarify the relationship between total variation (TV) and local complexity (LC), particularly addressing situations where TV increases while LC does not. We demonstrate that such discrepancies are driven by increases in the Lipschitz constants of the networks, which are explicitly accounted for in our bound, thus clarifying the precise circumstances affecting the tightness of our theoretical predictions.
Relations to Humayun et al 2024 paper should be well discussed and the additional contributions here.
Response: We agree that our paper is closely related to the work of Humayun et al. (2024). Indeed, our work aims to provide a theoretical angle on how to understand the empirical observations from their work. We are adding more discussion of the similarities and differences for the camera-ready version.
Please let us know if any aspects of our work would benefit from further clarification.
Thanks for the detailed rebuttal response. I agree the theoretical contribution is of significance and hope more clarifications regarding previous art and additonal experiments are provided, as promissed. I raise my rank.
The authors introduce two novel metrics, Local Complexity (LC) and Local Rank (LR), to analyze the structure of linear regions in ReLU networks. These metrics provide insights into the relationship between network complexity, feature representations, adversarial robustness, and representation cost. Theoretical bounds are established for LC and LR, and empirical experiments validate these findings, offering a new framework for studying the geometric properties of neural networks.
给作者的问题
It is known that constraining the Lipschitz constant K of a neural network enhances its robustness. Following your results on robustness, do you think LC and K could be correlated, and if so, in what ways?
论据与证据
All theorems/propositions/corollaries are supported with convincing proofs detailed in the supplementary materials. Some of them are summarized below :
Claim : lower Local Complexity corresponds to learned low-dimensional representations. Evidence : theoretical bound (Theorem 5, proved in A.5) + empirical study (Figure 2).
Claim : Local Complexity is an upper bound on total variation and relates to adversarial robustness. Evidence : theoretical bound (Theorem 7, proved in A.8) + empirical study (Figure 3).
Claim : Training dynamics drive networks toward lower local complexity solutions. Evidence : theoretical bounds (Proposition 9, proved in A.11, Corollary 10) + empirical study (Figure 13).
方法与评估标准
Yes :
- The metrics LC and LR are mathematically well-defined.
- the Stanford Bunny experiment is a good toy problem for illustrating linear regions in an interpretable 2D setting.
- MNIST is appropriate for evaluating adversarial robustness in real-word settings.
However, the choice of noise variance σ affects LC estimation. A sensitivity analysis on different noise levels would improve confidence in the robustness of results. For example, what happen in Figure 5. for other values of σ? Does the effects remains qualitatively the same in the two setups? It is also unclear how this value is chosen : why is the middle figure in figure 4. better than its left or right counterpart ?
理论论述
The proofs seems sound and well-structured, but I didn’t checked closely their validity.
实验设计与分析
The experiments seems valid, although it would be great to give access to a github repository to reproduce the results.
补充材料
I reviewed mostly the sections B and C. There is linking/naming problem in section B.3 : 1 .“on Figure 5” references Section 5 but should reference Figure 3? 2. “as in Figure B.1” → “as in Section B.1” 3. “In Figure 5” → “In Figure 3” There is also the same problem in line 40 of the paper : “We replicate similar experiments in Figure 5.”. Other than that, the supplementary materials provide useful informations.
与现有文献的关系
The study builds upon prior work on the expressivity and complexity of ReLU networks (Montufar et al., 2014, Pascanu et al., 2014, Serra et al., 2018…). While these previous papers were focused on counting/bounding the number of linear regions, this work is the first to introduce a notion of density of linear regions over the input space.
The discussion on grokking aligns with recent work on the training dynamics of these networks and the link with linear regions (Humayun et al., 2024, Liu et al., 2022). The paper empirically confirms findings from Humayun et al., 2024 that grokking corresponds to a reduction in the number of linear regions. The link between weight decay and reduced LC aligns with prior work on representation cost (Jacot, 2023).
Prior works (Croce et al., 2019) showed that larger linear regions correlate with increased robustness. This paper formalizes that idea by proving a theoretical upper bound between local complexity and total variation, showing that networks with lower LC are more robust.
遗漏的重要参考文献
Not to my knowledge.
其他优缺点
Strengths :
- The paper is well-written, clear about his assumptions, every claim is proved.
- The contribution seems to be relevant to the field and so could impact future research.
- The idea of using the density of linear regions instead of their number is original.
Weaknesses :
- The method for estimating LC via bias perturbations could be better motivated, particularly regarding sensitivity to noise variance.
- The paper would benefit from a discussion of how LC compares to standard complexity metrics like VC dimension.
其他意见或建议
List of typos :
- line 78 : “the a network” → “of the network”
- line 155 : “is bears” → “bears”
- line 409 : “properities” → “properties”
- line 1790 : “intialization” → “initialization”
We sincerely thank the reviewer for their thoughtful and detailed comments which have helped us improve our paper, and their positive review. We have addressed your minor errata for the camera-ready version.
The method for estimating LC via bias perturbations could be better motivated, particularly regarding sensitivity to noise variance.
Response: Our theory centers around taking the expectation of a random level set of a function. Our theoretical framework builds on the analysis of the threshold at which a neuron transitions from inactive to active. The biases directly control the level at which a ReLU transitions from being inactive to being active and that is why we focus on bias perturbations. This is illustrated in the proof of Lemma 12. Perturbations of the biases control the level sets directly which makes the resulting formulas more tractable. A more in depth discussion can be found in section 3.1.
The paper would benefit from a discussion of how LC compares to standard complexity metrics like VC dimension.
Response: We agree that relating LC to traditional complexity measures such as VC dimension is an intriguing direction. However, establishing a quantitative connection between LC, which captures the local behavior and density of linear regions in the network, and global complexity measures of a family of functions such as the VC dimension is challenging. We believe that this presents an interesting avenue for future work, and we plan to include a mention to this in the discussion section for the camera-ready version.
It is known that constraining the Lipschitz constant of a neural network enhances its robustness. Following your results on robustness, do you think LC and could be correlated, and if so, in what ways?
Response: Our bounds relating the total variation (TV) of the network over the input space and the local complexity (LC) can shed some light on the relationship between the Lipschitz constant and LC. For instance, a simple bound shows that This suggests that controlling the Lipschitz constant may indirectly influence the local complexity, thereby impacting robustness.
Please let us know if any aspects of our work would benefit from further clarification.
** Sorry the comment was not for this article**
Thank you for your response. However, the content of your comment seems to be unrelated to our previous discussion. Could you please confirm if this was intended for our paper, or if there might have been a mixup?
The paper studies ReLU networks and proposes local complexity that estimates the average density of gradient discontinuities under an input distribution and a parameter set. Three theorems are provided. First, it is shown that local complexity can be estimated by gradients at each neuron. Second, it is established that the local rank can be bounded in terms of local complexity where the local rank is defined as the rank of the average rank of the Jacobian. Third, local complexity can be used to bound total variation of the ReLU network if the Lipschitz constant is given under several assumptions. Since network with a lower total variation can have adversarial robustness, a lower local complexity may be an indicator of robustness. Finally, it is shown that the local complexity can be bounded by the representational cost that can be connected to weight decay, implying that local complexity might be minimized during training.
给作者的问题
- Could you highlight the difference between the definition of local complexity and the one in (Hanin & Rolnick, 2019b)?
- How does affect the local complexity in Theorem 2? This parameter seems to be hidden by .
- Are the constants in Theorem 5 and 7 the same ones defined in Corollary 3?
- Reverse engineering a ReLU network may require the knowledge of all of the discontinuities in Eq. (2). If the local complexity is known, would it be possible to shed light on the hardness of reverse engineering?
- If the weight noise is also used, would this change the theoretical results? For example, would it be possible to still be able to estimate the total variation?
- For a given ReLU network, we can create a corresponding ResNet by adding skip connections. Would this ResNet have lower local complexity?
论据与证据
The claims in the papers are supported by proof.
方法与评估标准
This is a theoretical paper providing insights to ReLU networks. Several results are provided to establish the notion of local complexity and how it can be used to understand robustness.
理论论述
The reviewer did not verify the correctness of proof.
实验设计与分析
The notion of local complexity is a natural way to study the density of linear regions in a ReLU network. The analysis seems fairly reasonable.
补充材料
The reviewer did not verify the proof in the supplementary material.
与现有文献的关系
The results shed light on ReLU networks and the findings are of interest to the broader community of machine learning.
遗漏的重要参考文献
None.
其他优缺点
Strengths:
- The notion of linear regions in ReLU network is an important concept and local complexity provides a way to estimate the density of the linear regions.
- Links to local rank and total variations seem to be interesting and seem to have some implications on the training dynamics of ReLU networks.
Weaknesses:
- It is not clear to the reader why the local rank is defined through the Jacobian. What is its relation to the rank of the weight matrix? In addition, what are the implications of having lower ranks? Would this directly connected to robustness mentioned in Section 5?
- It seems the bound in Theorem 7 is hard to compute as it depends on the maximum Lipschitz constant across different layers of the network. Does it imply that one needs to estimate all Lipschitz constants from all of the subnetworks to estimate the total variations? How does the maximum Lipschitz constant relate to the maximum norm of the gradient ? Some comments on how to estimate the bound would be useful to enhance the clarity of the theorem.
- Proposition 9 illustrates the relation between local complexity and depth. However, it is only valid at the global optimum of the problem in (15) which minimizes the representational cost. Are there more general bounds that describe the local complexity in terms of depth? What was the main difficulty in the analysis?
- Although the local complexity can be connected to total variation and thus develop some understandings of robustness, it is not known how to use it to design a more robust model or motivate architecture design.
其他意见或建议
- Line 199: Is the standard deviation? Should it be In Theorem 2, should it be ?
- Line 290: to estimate the learn a map between
We sincerely thank the reviewer for their thoughtful and detailed comments. We have addressed minor errata for the camera-ready version.
It is not clear to the reader why the local rank is defined through the Jacobian. What is its relation to the rank of the weight matrix? ... Would this directly connect to the robustness mentioned in Section 5?
The rank of the Jacobian at a point quantifies the local dimension of the feature manifold, that is, the number of directions in which the input may be locally transformed. For a ReLU network, the Jacobian is given by where each is a diagonal matrix with entries in indicating which neurons are active. This implies that the rank of the Jacobian is bounded above by the smallest rank among the weight matrices. One might expect that networks mapping data to a low-dimensional manifold could exhibit simpler, and potentially more robust, decision boundaries.
Some comments on how to estimate the bound would be useful to enhance the clarity of the theorem.
We estimate the bound on local complexity by considering the operator norm of the weight matrices. In particular, we compute an upper bound via which is tractable to compute and provides a useful approximation. Additional details are provided in Appendix Section . We show also that the term including is typically .
Are there more general bounds that describe the local complexity in terms of depth?
We also derive a bound that explicitly incorporates network depth via the notion of representation cost (see Proposition 8) which is valid for any parameters. We can substitute the representation cost for the norm of the weights as well.
... it is not known how to use it to design a more robust model or motivate architecture design.
While our current results do not provide a recipe for designing inherently more robust models, our findings suggest that controlling the rank of the weight matrices could be a promising direction for enhancing robustness and guiding architectural choices. We also demonstrate a connection between weight decay and local complexity, which may guide future work.
Could you highlight the difference between the definition of local complexity and the one in (Hanin & Rolnick, 2019b)?
Hanin & Rolnick investigate how the depth and number of neurons affect the expected number of linear regions for a generic distribution of parameters. In contrast, in our definitions and results we are concerned with how the linear regions vary for a specific choice of parameters, and we are concerned with the distribution of linear regions over the data distribution, which are aspects that are not discussed in the work of Hanin & Rolnick. We also explore properties like Local Rank and Total Variation, which are not addressed in their work.
How does affect the local complexity in Theorem 2? This parameter seems to be hidden by .
The choice of will affect the local complexity through the probability density function of the normal distribution, , as well as through the magnitude of noise that is added to the biases, as we take expectation over this in the computation of LC. We show in Figure 4 the effect of increasing this.
Are the constants in Theorem 5 and 7 the same ones defined in Corollary 3?
Yes, these are the same constants.
If the local complexity is known, would it be possible to shed light on the hardness of reverse engineering?
This is an interesting point for further work. To our knowledge, our current framework does not directly address this, though one would expect that a low local complexity makes this easier.
If the weight noise is also used, would this change the theoretical results? For example, would it be possible to still be able to estimate the total variation?
It is possible to extend the definitions to include perturbations of the weights though this may introduce changes in the direction of the boundaries, which are naturally more complicated. In Figure 5 in the appendix, we illustrate how adding perturbations to the weight matrices does not qualitatively change our measure of the local complexity. The computation of the total variation would also be similar.
For a given ReLU network, we can create a corresponding ResNet by adding skip connections. Would this ResNet have lower local complexity?
Consider a simple one-layer example where Its Jacobian is Here, the discontinuities in the Jacobian occur only where is discontinuous. So, the local complexity of the ResNet does not differ markedly from that of the corresponding plain ReLU network.
Please let us know if further clarification is needed.
The reviewer would like to thank the authors for their detailed responses. The rating will be kept unchanged.
The paper proposes a measure of local complexity for piecewise linear functions, corresponding to the density of linear regions over the input data distribution. This measure is connected to several aspects of ReLU networks, including low-dimensional representations, adversarial robustness, and representation cost. Theoretical results are provided to establish these connections, along with brief experimental validation.
The reviewers acknowledge that the paper offers a reasonable theoretical analysis that helps explain phenomena previously observed empirically, and extends earlier theoretical work from a global to a local perspective. Some of the concerns raised during the review process were addressed in the rebuttal phase. However, a few issues remain unresolved, particularly regarding the significance of the theoretical results, the depth of the empirical validation, and the overall level of novelty.
Nevertheless, the merits appear to outweigh the limitations. I therefore recommend acceptance and encourage the authors to consider the feedback and update the final version of the paper accordingly.