Uncovering Critical Sets of Deep Neural Networks via Sample-Independent Critical Lifting
We propose sample-independent critical lifting operator and investigate the sample dependence of output-preserving critical points
摘要
评审与讨论
This paper studies the critical points for neural networks, especially how a critical point in a narrow network can be mapped to critical points in a wide network while preserving the output function. For this purpose, the authors expand previous works and provide a broader characterization of sample-independent lift critical points, and also identify the sample-dependent critical points that emerge for sufficiently large sample sizes. Simulations on some simple examples are provided to illustrate the hypothesis.
优缺点分析
Strengths: The theoretical work is solid, and it expands the understanding of the lifted critical points.
Weakness: My major concern is the importance of the problem itself. Although it is a meaningful mathematical problem, understanding the critical lifting operator is not directly relevant to any practically important problem of deep learning. Conceptually, we would hope to obtain a smaller network that mostly preserves the capacity of a large network, while the reverse direction is relatively trivial and less useful. That being said, the motivation of this paper is unclear, which limits its contribution and impact. It would be good to comment on at least the general picture of why the studied problem is important.
问题
-
How important is the analytic assumption in Assumption 3.1? For the given analysis, would second-order differentiability be enough?
-
I don't understand Proposition 4.1.1, can you provide a more mathematical statement with a proper definition of critical embedding operators?
-
The main contribution claimed in the paper is the characterization of sample-dependent critical points. Why would sample-dependent lifted critical points be more important than sample-independent ones?
-
Essentially, the condition for the sample-dependent critical points is that the sample size needs to be larger than the dimension. But wouldn't it be trivial that one has enough degree of freedom so that the gradient of samples can span the whole parameter space, and therefore can easily include all of the critical points?
局限性
Yes
最终评判理由
Overall, the paper is technically solid. My main concern is about the importance of the problem itself, and the authors didn't provide enough justification for its significance. I would like to keep my score.
格式问题
No concern
We thank the reviewer for providing a detailed review and giving valuable suggestions on improving our paper. Below we first give a comment on motivation and importance of the paper, then address the reviewer’s questions.
Understanding the global convergence and training dynamics of neural networks remains a fundamental challenge. A key obstacle is the prevalence of non-global critical points and manifolds, which hinder efficient training and convergence to global minima. Although recent work has identified high-dimensional critical manifolds—embedded from narrower networks' critical points—the geometry of these sets and their dependence on training data are still poorly characterized. Without this understanding, it is difficult to analyze the distribution of local minima, saddles, and strict saddles, or to estimate escape probabilities, convergence rates, and acceleration strategies near them.
Our work takes a significant step toward characterizing critical sets geometrically in two ways. First, building on existing work, we demonstrate the prevalence of low-complexity critical points lifted from narrower networks, which exhibit favorable generalization properties. The tendency of training dynamics to stagnate at these low-complexity critical points—combined with early stopping—may help networks generalize well regardless of sample noise. Second, we show that saddles exist among sample-dependent lifted critical points, thus establishing a foundation for further studying escape dynamics from these saddles. We particularly emphasize that for one hidden layer networks, all sample-dependent lifted critical points are saddles, thus narrowing the potential presence of local minima to the sample-independent subset.
The assumption that the activation function is analytic is mainly used to 1.establish the linear independence of neurons (Lemma A.1.1) and 2. guarantee that the level sets of the function has zero measure (Lemma A.1.3). These are used in, e.g., Proposition 4.2.1 to show that any critical point of the form (2) is a saddle, and in Theorem 4.2.1 to construct a sample-dependent lifted critical point of the form (2). A twice-continuously differentiable activation satisfying 1. and 2. also works.
In this paper we mentioned three critical embedding operators, namely the null embedding operator, splitting embedding operator and general compatible embedding operator. Intuitively speaking,
(a) The null embedding operator adds neurons of zero input weight. In [1] the authors define null embedding operator for neural networks with bias terms. For unbiased neural networks, we need and to set the output weight to be zero.
(b) The splitting embedding operator copies one neuron and “splits” the output weight; for example, a parameter (1/6a, w, 1/3a, w, 1/2a, w) is obtained from (a, w) via splitting embedding operator.
(c) A general compatible embedding operator generalizes the previous two operators by taking into account, e.g., the composition of them and the permutation of indices in different layers. We will add more formal definitions of these critical embedding operators in the paper.
[1] Y. Zhang, Y. Li, Z. Zhang, et al., “Embedding Principle: a hierarchical structure of loss landscape of deep neural networks”, arXiv: 2111.15527, 2021.
The main contribution of this paper is introducing a critical lifting operator (contribution (a)), discovering sample-independent lifted critical points which do not arise from previously studied embedding operators (contribution (b)), identifying sample-dependent lifted critical points and show that saddles exist among them (contribution (c)). None of them is prioritized: each advances a different understanding of deep learning.
Yes, it is the basic idea of the condition for sample-dependent lifted critical points to exist. More rigorously, it is related to the rank and kernel of the network’s Jacobin matrix (which follows from linear independence of neurons), as well as the structure of loss function (which yields the existence of samples). Moreover, we give lower bounds on sample size for sample-dependent lifted critical points to exist, further clarifying the interplay between critical points and samples.
I would like to thank the authors for their efforts in addressing my questions. However, given the response, I am still not fully convinced that the current theoretical analysis truly provides critical insights towards the understanding of neural network landscapes, as I discussed in my review. At this point, I am not able to suggest acceptance of the current work.
Thank you for your feedback. As your current rating suggest that our paper is not technically sound. We really appreciate if you can explicitly point out our technical flaws, weak evaluation or inadequate reproducibility. Regarding the significance of our results, we are also happy to address any remaining problems. Your response would be incredibly valuable for strengthening our work.
The authors investigate critical points of the loss landscape of multi-layer perceptrons without biases. Specifically, they study so-called lifted critical points that are obtained by embedding a network in a wider network. Some embeddings, like neuron duplication are well-known in the field. Neuron duplication lifts critical points to sample-independent critical points, in the sense that the parameters of the wide network are a critical points for any dataset for which the parameters of the narrow network are a critical point. The main contributions in this paper are: 1) a clear definition of sample-dependent and sample-independent lifted critical points, 2) an example of a three-layer network with zero weights in the first two layers that leads to a sample-independent lifted critical point, which demonstrates that the well-known embeddings are not the only sample-independent lifted critical points, 3) proofs that non-zero-loss sample-dependent lifted critical points with zero output weights exist (given enough data), have zero measure in parameter space, and are saddles (in the case of one hidden layer). The theoretical findings are further supported by a well-designed illustrative experiment.
优缺点分析
Strengths and Weaknesses
Quality
The proofs look sound and well written; it would be good to have them double-checked by a mathematician, however. In particular, the appendix is of higher quality than other works I know in this field.
Clarity
For a specialist, the paper is clear, but it is pretty dense and formal, and may be hard to digest for a broad audience. See specific comments below.
Significance
The thorough investigation of sample-dependent and sample-independent critical points in bias-free multi-layer perceptrons is interesting to me, but the findings are quite specialized. For theoreticians, the example of a sample-independent lifted critical point with zero weights in the first two layers, and the existence of sample-dependent critical points with zero output weights, may not be surprising, but a nice contribution to our knowledge about critical points in neural network loss landscapes. The discussion of open theoretical problems is also valuable. For machine learning practitioners I do not see (yet) practical implications of these results.
Originality
The proof techniques look pretty standard, but the results are novel. The distinction between sample-dependent and sample-independent lifting of critical points is original and useful.
问题
- In the example starting on line 142, the condition \sigma(0) = 0 is missing (this is fine in the appendix).
- The construction of sample-independent critical lifting is smart, but took me a moment to digest. I think it would be helpful to explain the reasoning and intuition behind the construction.
- Font size in the figures is too small; I could not read the numbers when printed on A4.
- Figure 2: The three panels in Figure 2 are a bit confusing; in particular because the tick labels are so small. Would it be possible to show the contour plot of the loss landscape in one wide panel, with a2 in [-0.1, 0.1] and w2 in [-1.2, 1.2], and maybe another, zoomed-in panel with w2 in the range [-0.2, 0.2], if the critical points at w2 = 0 and w2 = 0.1236 cannot be distinguished in the wider panel? Also, I think it would be nice to mark sample-dependent and sample-independent points differently (I guess (0, 0) is sample-independent).
- Figure 3: Could you extend the w range to [-1.2, 1.2]? I am curious to see, if the zero curve of phi "bends back", or if there is another reason to explain the critical point at w2=1.0258 in Figure 2. Also, could you mark the critical points that we see in Figure 2 (at t = 0)?
- line 325: 1 \leq i < j \leq n (should be n instead of m)
- line 329: d, m \in \mathbb N
- line 330: non-zero (instead of non=zero)
- I Would move Definition A.1 and Remark A.1 to the beginning of the appendix, because Lemma A.1.1 relies already on analytic functions.
- line 450: refer the reader to Lemma A.1.3.
- formula below line 466: what do the vertical lines above and below \nabla_\theta H indicate?
局限性
- In the example starting on line 142, the condition \sigma(0) = 0 is missing (this is fine in the appendix).
- The construction of sample-independent critical lifting is smart, but took me a moment to digest. I think it would be helpful to explain the reasoning and intuition behind the construction.
- Font size in the figures is too small; I could not read the numbers when printed on A4.
- Figure 2: The three panels in Figure 2 are a bit confusing; in particular because the tick labels are so small. Would it be possible to show the contour plot of the loss landscape in one wide panel, with a2 in [-0.1, 0.1] and w2 in [-1.2, 1.2], and maybe another, zoomed-in panel with w2 in the range [-0.2, 0.2], if the critical points at w2 = 0 and w2 = 0.1236 cannot be distinguished in the wider panel? Also, I think it would be nice to mark sample-dependent and sample-independent points differently (I guess (0, 0) is sample-independent).
- Figure 3: Could you extend the w range to [-1.2, 1.2]? I am curious to see, if the zero curve of phi "bends back", or if there is another reason to explain the critical point at w2=1.0258 in Figure 2. Also, could you mark the critical points that we see in Figure 2 (at t = 0)?
- line 325: 1 \leq i < j \leq n (should be n instead of m)
- line 329: d, m \in \mathbb N
- line 330: non-zero (instead of non=zero)
- I Would move Definition A.1 and Remark A.1 to the beginning of the appendix, because Lemma A.1.1 relies already on analytic functions.
- line 450: refer the reader to Lemma A.1.3.
- formula below line 466: what do the vertical lines above and below \nabla_\theta H indicate?
最终评判理由
I appreciate the good discussion with the authors. I fell my continues to be justified and I do not change my overall rating.
格式问题
no
We thank the reviewer for providing a detailed review and giving valuable suggestions on improving our paper. We will correct the typos in the revision of this paper. Below we briefly discuss how our work is related to a broader field in machine learning:
Understanding the global convergence and training dynamics of neural networks remains a fundamental challenge. A key obstacle is the prevalence of non-global critical points and manifolds, which hinder efficient training and convergence to global minima. Although recent work has identified high-dimensional critical manifolds—embedded from narrower networks' critical points—the geometry of these sets and their dependence on training data are still poorly characterized. Without this understanding, it is difficult to analyze the distribution of local minima, saddles, and strict saddles, or to estimate escape probabilities, convergence rates, and acceleration strategies near them.
Our work takes a significant step toward characterizing critical sets geometrically in two ways. First, building on existing work, we demonstrate the prevalence of low-complexity critical points lifted from narrower networks, which exhibit favorable generalization properties. The tendency of training dynamics to stagnate at these low-complexity critical points—combined with early stopping—may help networks generalize well regardless of sample noise. Second, we show that saddles exist among sample-dependent lifted critical points, thus establishing a foundation for further studying escape dynamics from these saddles. We particularly emphasize that for one hidden layer networks, all sample-dependent lifted critical points are saddles, thus narrowing the potential presence of local minima to the sample-independent subset.
Thank you for your reply. I agree that this paper reports interesting progress on our understanding of critical points in neural network loss landscapes, and I would like to reemphasize that I think it is original and solid work with a particularly high-quality appendix. However, I still think that it speaks to a pretty specialized audience in its current form. I will therefore leave my scores unchanged.
We want to thank you again for your valuable efforts in reviewing the paper!
The paper study the notions of sample dependent and independent critical lifting operators following the previous work of Zhang et al on critical embeddings. This operator maps a critical parameter of a narrower neural network to a wider network without changing the functional form of the output of the neural net, as well as it keeps the criticality property. If this map from theta_1 to theta_2 holds for the loss of the neural net given any arbitrary dataset, then theta_1 is a sample independent critical set, otherwise it is sample dependent. The authors show an example of sample independent critical points that are not covered by the affine embeddings of the work of Zhang, examples of existing sample dependent critical points in general, that for one layer networks under some assumption all sample dependent critical points have to be saddle, and finally some example for saddle points for multi layer networks. They further plot some toy loss functions to illustrate the concepts they defined pictorially.
优缺点分析
The theoretical results not novel to me in that: The construction of sample independent critical lifting operator in line 145 is trivial and not interesting. It only zeros out two layers and claims the previous affine mappings introduced in work of Zhang do not cover them. Furthermore, they don’t characterize all such sample independent critical lifting operators beyond this trivial example.
The study of sampled dependent critical lifting operator is also not inclusive at all, again they just give a counter example of large enough sample size, that there exist a non-empty sample dependent critical lifting operator. They further show that they such points should be a saddle point in a specific case.
Moreover, the writing is very vague without defining the concepts they use, some arguments are contradictory or illogical. Please see below for few examples:
Other issues: After line 145: why is the index of w and w’ up to k_3? Isn’t k_3 the dimension for a_1 ?
Remark 4.5 for simplicity assume the activation is even or odd. Which result does this assumption hold for exactly?
The argument of Remark 4.5 does not make sense at all, in particular the phrase ‘On the other hand, if w’k \in {w_k, -w_k}{k=1}^m then \theta_{wide} is a sample-independent lifted critical point. Therefore, up to permutation of the entries, a sample-independent lifted critical point from \theta_{narr} takes the form (2),’
The null-embedding, splitting embedding, and compatible embedding from previous work are not even defined here.
Line 153 and 154 very vague: “we cannot avoid the sample-independent lifted critical points which 154 are not produced by these embedding operators”
The notion of ‘saddle’ is not defined.
问题
Can authors clarify what is the importance of these cases they illustrate of sample independent and dependent critical points for the purpose of understanding neural networks optimization, generalization, or representation power?
What is the importance of the authors conjecture on the non-existence of other sample dependent critical points? Generally why is the feature of being sample dependent or independent for a critical point should be studied?
局限性
In addition to the fact that non of their analysis is inclusive and seems to be a sparse set of results and examples, the overall goal of this study and why the community should care about these operators in general is not discussed.
格式问题
The formatting seems consistent.
We thank the reviewer for providing a detailed review and giving valuable suggestions on improving our paper. Below we first give a comment on motivation and importance of the paper, then address the reviewer’s questions.
Understanding the global convergence and training dynamics of neural networks remains a fundamental challenge. A key obstacle is the prevalence of non-global critical points and manifolds, which hinder efficient training and convergence to global minima. Although recent work has identified high-dimensional critical manifolds—embedded from narrower networks' critical points—the geometry of these sets and their dependence on training data are still poorly characterized. Without this understanding, it is difficult to analyze the distribution of local minima, saddles, and strict saddles, or to estimate escape probabilities, convergence rates, and acceleration strategies near them.
Our work takes a significant step toward characterizing critical sets geometrically in two ways. First, building on existing work, we demonstrate the prevalence of low-complexity critical points lifted from narrower networks, which exhibit favorable generalization properties. The tendency of training dynamics to stagnate at these low-complexity critical points—combined with early stopping—may help networks generalize well regardless of sample noise. Second, we show that saddles exist among sample-dependent lifted critical points, thus establishing a foundation for further studying escape dynamics from these saddles. We particularly emphasize that for one hidden layer networks, all sample-dependent lifted critical points are saddles, thus narrowing the potential presence of local minima to the sample-independent subset.
ww'k_3 is both for enumerating the entries of and the 's. The dimension of is or depending on the narrower or wider network.
The result ''By linear independence of neurons …then is a sample-independent lifted critical point''.
We apologize for the typo. Here it should be “Therefore, up to permutation of the entries, a lifted critical point from takes the form (2)”.
In this paper we mentioned three critical embedding operators, namely the null embedding operator, splitting embedding operator and general compatible embedding operator. Intuitively speaking,
(a) The null embedding operator adds neurons of zero input weight. In [1] the authors define null embedding operator for neural networks with bias terms. For unbiased neural networks, we need and to set the output weight to be zero.
(b) The splitting embedding operator copies one neuron and “splits” the output weight among these copies; for example, a parameter (1/6a, w, 1/3a, w, 1/2a, w) is obtained from (a, w) via splitting embedding operator.
(c) A general compatible embedding operator generalizes the previous two operators by taking into account, e.g., the composition of them and the permutation of indices in different layers. We will add more formal definitions of these critical embedding operators in the paper.
[1] Y. Zhang, Y. Li, Z. Zhang, et al., “Embedding Principle: a hierarchical structure of loss landscape of deep neural networks”, arXiv:2111.15527, 2021.
Please refer to the intuitive definition of critical embedding operators. Hopefully this can make the statement clearer.
A saddle of a real-valued, differentiable function is a critical point which is neither a local minimum nor a local maximum.
Under mild assumptions on activation, we are able to discover all sample-independent lifted critical points for one hidden layer neural networks. For example when is odd, we have a linear independence of neurons and their derivatives: {\sigma(w_k x), \sigma(w_k x)x_1, ..., \sigma(w_k x)x_d: 1 <= k <= m} are linearly independent for any and non-zero 's such that for distinct (here denotes the t-th entry of input ). This result implies that a sample-independent lifted critical point \theta' = (a_k', w_k') from \theta = (a_k, w_k) iff.
(a) w_k' \in {\pm w_k}_{k=1}^m for any such that .
(b) w_k' \in {\pm w_k}_{k=1}^m \cup {0} for any such that .
(c) Given , if for some , then for any we have .
In short, the null embedding, splitting embedding and flipping the signs of 's produce all sample-independent lifted critical points. However, the case for deeper networks is unclear to us because the linear independence result is much more complicated. Very few work, e.g. [2] discusses this problem, and even with this result it is hard to explicitly characterize these critical points.
For sample-dependent lifted critical points, we mentioned in Remark 4.5 that it takes the form (2). However, the case for deeper networks is much more complicated. We hope to find a way to characterize them in the future.
[2] L. Zhang, “Linear Independence of Generalized Neurons and Related Functions”, arXiv:2410.03693, 2024.
This paper develops a new theoretical tool to better understand the relationship between critical points across neural networks with different architectures and their dependence on training data. The authors introduce a new sample-independent critical lifting operator that maps parameters from a narrower to a wider network, preserving both output and criticality regardless of data samples. They show that previous embedding operators do not capture all such critical points. Additionally, they identify output-preserving critical sets that, for large sample sizes, generally contain sample-dependent critical points, which are typically saddle points, especially in multi-layer networks.
优缺点分析
Strengths:
-
This paper introduces a new sample-independent critical lifting operator with a theoretical guarantee that enables a more general analysis of the loss landscape structure. This operator maps parameters from a narrower neural network to a set of parameters in a wider network, while preserving both the output function and the criticality of the point, independently of the training data.
-
This paper also provides examples showing that there exist sample-independent critical points that cannot be generated by previously studied embedding operators (such as Embedding Principle), thereby expanding the understanding of the loss landscape in wider neural networks.
-
This paper also identifies a class of output-preserving critical sets containing sample-dependent critical points, primarily saddle points.
Weakenesses:
-
However, constructing the critical lifting operator is restricted to feedforward multilayer perceptron (MLP) neural networks, and seems hard to extend to highly non-trivial and complex tasks.
-
The paper focuses on theory and simple network examples, providing limited experiments on large, complex networks or real-world data.
-
It is hard to follow the proofs without instructions.
问题
-
Is it necessary to discover all sample-independent lifted critical points?
-
Concerning the Sample dependence of critical points, it is not clear to me why your work complements the previous studies. Could the authors explain this point more clearly?
局限性
Yes
格式问题
No
We thank the reviewer for providing a detailed review and giving valuable suggestions on improving our paper. Below we first give a comment on motivation and importance of the paper, then address the reviewer’s questions.
Understanding the global convergence and training dynamics of neural networks remains a fundamental challenge. A key obstacle is the prevalence of non-global critical points and manifolds, which hinder efficient training and convergence to global minima. Although recent work has identified high-dimensional critical manifolds—embedded from narrower networks' critical points—the geometry of these sets and their dependence on training data are still poorly characterized. Without this understanding, it is difficult to analyze the distribution of local minima, saddles, and strict saddles, or to estimate escape probabilities, convergence rates, and acceleration strategies near them.
Our work takes a significant step toward characterizing critical sets geometrically in two ways. First, building on existing work, we demonstrate the prevalence of low-complexity critical points lifted from narrower networks, which exhibit favorable generalization properties. The tendency of training dynamics to stagnate at these low-complexity critical points—combined with early stopping—may help networks generalize well regardless of sample noise. Second, we show that saddles exist among sample-dependent lifted critical points, thus establishing a foundation for further studying escape dynamics from these saddles. We particularly emphasize that for one hidden layer networks, all sample-dependent lifted critical points are saddles, thus narrowing the potential presence of local minima to the sample-independent subset.
Yes from the motivation of our paper. Currently, under a mild assumption on activation, we are able to discover all sample-independent lifted critical points for one hidden layer neural networks, as they are all produced by critical embedding operators. The case for deeper networks is unclear to us yet.
Several works, such as [1, 2], discover that embedding operators can produce sample-independent lifted critical points. We include this as Proposition 4.1.1 in our paper. However, they do not notice 1. the sample (in)dependence property of critical embedding, and 2. the operators cannot produce all sample-independent lifted critical points for deep neural networks. We address them in our paper. We also discover sample dependent critical points which are not discovered by previous works.
[1] Y. Zhang, Z. Zhang, T. Luo, et al., “Embedding Principle of Loss Landscape of Deep Neural Networks”, NeurIPS, 2021.
[2] B. Simsek, F. Ged, A. Jacot, et al., “Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances”, ICML, 2021.
The paper analyzes critical points in bias-free multi-layer perceptrons, focusing on lifting them from narrower to wider networks. It defines sample-dependent and sample-independent cases, presents a three-layer example beyond known embeddings, and proves that certain sample-dependent points with zero output weights exist for large datasets. The findings are supported by illustrative experiments.
This work presents a sound technical contribution to the theory of neural network loss landscapes, supported by careful analysis and a well-prepared appendix. While two reviewers considered it borderline acceptable, others questioned its significance beyond illustrative examples. The rebuttal clarified key definitions and motivation but did not fully address concerns regarding broader impact. Overall, the work advances theoretical understanding, yet would benefit from broader contextualization and clearer exposition to engage a wider audience. The authors are encouraged to incorporate the valuable feedback provided by the reviewers.