Hardness of Learning Neural Networks under the Manifold Hypothesis
摘要
评审与讨论
It is widely believed that the low-dimensional structure in natural data is the key to the success of modern machine learning methods. Characterizing this structure and designing algorithms that leverage it, and the complementary problem of understanding when learning under this structure is hard, are thus important problems. In this work the authors provide a hardness result for learning data lying on manifolds using a shallow neural network in the statistical query model. This result is based on reductions from learning Boolean functions, by constructing space-filling manifolds that cover exponentially many quadrants of the Boolean cube.
优点
The construction of manifolds that are hard to learn is novel, and the paper is clearly written.
缺点
In terms of novelty and impact, I am concerned that it is not particularly surprising that one can construct space-filling manifolds that are hard to learn, while at the same time other highly structured classes of manifolds can be learned easily.
In terms of the empirical results, it's unclear to what extent they corroborate the theoretical results. There is no attempt for instance to check empirically whether the scaling of the sample complexities in the hardness result matches the experiment. The novelty of the experiments estimating the intrinsic dimension of standard image datasets is also unclear.
One weakness of the hardness results is that they only apply to shallow networks. Note that for data lying on curves, there are results proving that networks of sufficient depth and width can generalize when trained with gradient descent [1]. These results essentially show that when depth is large compared to a scale set by the curvature of the class manifolds and the distance between them, they can be distinguished by a neural network trained in the NTK regime (with polynomial training time and sample complexity). It would be interesting to better understand how these results can be reconciled, and I suspect either shallow depth or the additional noise inherent in SQ access make the setting considered in the submission harder.
[1] Wang, Tingran, et al. "Deep networks provably classify data on curves." Advances in neural information processing systems 34 (2021): 28940-28953.
问题
I'm curious about whether SQ access is crucial for obtaining such hardness results, or whether the authors believe they can be strengthened to the PAC setting.
Could the authors comment about whether their results should still hold for deeper networks. I am particularly curious given the results of [1] mentioned above.
局限性
Limitations have been addressed.
We thank the reviewer for their insightful comments. We answer their questions and comments below.
There is no attempt for instance to check empirically whether the scaling of the sample complexities in the hardness result matches the experiment.
If we understand the reviewer's point correctly, they are referring to the runtime complexity of the interpolation-based argument in section 3.1. As mentioned in the text, we do not believe the scaling of this algorithm is tight. We only wanted to show that it can be polynomial. Hence, it is unlikely that the experiments which use a neural network to learn the function will match the scaling of this bound.
The novelty of the experiments estimating the intrinsic dimension of standard image datasets is also unclear.
We wanted to use this to stress that our formal results may need to be expanded to better apply to realistic datasets. Our goal was primarily to show that image datasets are heterogeneous in nature and, for example, do not have just a single intrinsic dimension.
One weakness of the hardness results is that they only apply to shallow networks.
Indeed, our formal results are stated for bounded depth networks. Though, we should note that the lower bounds in Section 3.2 apply to single hidden layer networks, so they also apply trivially to any depth more than that as well.
Note that for data lying on curves, there are results proving that networks of sufficient depth and width can generalize when trained with gradient descent [1]. These results essentially show that when depth is large compared to a scale set by the curvature of the class manifolds and the distance between them, they can be distinguished by a neural network trained in the NTK regime (with polynomial training time and sample complexity). It would be interesting to better understand how these results can be reconciled, and I suspect either shallow depth or the additional noise inherent in SQ access make the setting considered in the submission harder.
Thank you for pointing out this work; we will cite it. The question studied in the paper you reference though is rather different. They look at classification problems where the classes are different manifolds. We look at learning a neural network where the input distribution is over a manifold. An analogous classification task in our case would be given by the output of a neural network and may no longer even be a regularized manifold.
We should remark that there are other papers that study fitting a manifold or classifying data that is a manifold itself. We cite them in the related works section of our paper.
I'm curious about whether SQ access is crucial for obtaining such hardness results, or whether the authors believe they can be strengthened to the PAC setting.
SQ access is not crucial in general. In fact, our results also hold under cryptographic hardness assumptions. We should remark that proving efficient PAC learnability is generally very hard. We stress the word efficient, because in general, PAC learnability is framed as a question of only sample complexity and not runtime complexity.
In fact, without any assumptions or restrictions to the learning model, a proof that one cannot learn a network in polynomial time would imply that . So, we have to make restrictions. Ours is the SQ setting which lets us quantify hardness in number of queries (or via cryptographic hardness reductions).
Could the authors comment about whether their results should still hold for deeper networks.
As mentioned earlier, our lower bounds in Section 3.2 apply trivially to any network that has more than a single hidden layer. The upper bounds in Section 3.1 likely will not apply if the depth of the network grows with input dimension . Here, other techniques likely need to be used as you mention.
I thank the authors for their comments. I would like to keep my score.
The paper establishes bounds on the learning hardness for a class of low-dimensional manifolds in the statistical query (SQ) model. It extends existing hardness results for neural network training in Boolean and Gaussian input models to more general geometries. These theoretical results are validated through synthetic experiments. Additionally, the proposed method offers a framework for studying the geometry of data manifolds.
优点
- The paper is clearly written and easy to read. The authors effectively state their motivations and results before delving into the mathematical details, enhancing the paper's readability.
- The paper makes theoretical contributions by extending the proofs of hardness in the Statistical Query (SQ) for Boolean and Gaussian input to the geometric setting. This broadens the understanding of the learnability of neural networks beyond traditional data distributions.
- The theoretical results are validated through experiments, and the paper also provides new inspiration for studying the geometry of data manifolds.
缺点
- The synthetic experiment itself does not seem convincing enough. The authors only use hypersphere as the input data. It remains unclear whether the results hold for other geometries besides the hypersphere.
问题
The results appear very similar to those in papers discussing the sample complexity of learning a manifold via a ReLU network. This paper, however, reformulates the problem from a Statistical Query (SQ) perspective. How do the authors differentiate their results from those existing studies of sample (or network) complexity?
局限性
See weaknesses. Overall, this is an interesting paper.
We thank the reviewer for their positive comments and feedback. We address their questions below.
The synthetic experiment itself does not seem convincing enough. The authors only use hypersphere as the input data. It remains unclear whether the results hold for other geometries besides the hypersphere.
We believe the reviewer is referring to Figure 3a, but in Figure 3b we actually include another synthetic experiment which is with the space-filling curve as the input distribution rather than a hypersphere. For positively bounded manifolds, we chose the hypersphere because it is a canonical instance of a bounded positive curvature manifold. Other bounded positive curvature manifolds are in some sense contained within a hypersphere as can be formally shown via Bishop-Gromov inequalities. Additional experiments in Appendix C using unit hyperspheres are devoted to sanity check an intrinsic dimension estimation routine before being applied to MNIST data. The focus of that experiment is more on describing real-world manifolds, which likely lie in the middle heterogeneous regime that have varying intrinsic dimensions.
The results appear very similar to those in papers discussing the sample complexity of learning a manifold via a ReLU network. This paper, however, reformulates the problem from a Statistical Query (SQ) perspective. How do the authors differentiate their results from those existing studies of sample (or network) complexity?
Sample complexity is not unrelated to runtime complexity and SQ complexity, but nonetheless is still a separate question. In fact, the bounded depth and width neural networks we study have polynomial sample complexity since their VC dimension is always polynomial in the input dimension (see for example Bartlett et al. Almost Linear VC-Dimension Bounds for Piecewise Polynomial Networks). This does not imply that one can learn these functions efficiently in polynomial time. This question of learnability is the focus of our study.
I have read the authors' reply, and my concerns have been addressed. I still believe this is a strong paper and should be accepted.
The present work considers the hardness of learning under the manifold hypothesis, i.e., the identification of low-dimensional structure by reconstructing low-dimensional manifolds from data and how the latter impacts on the computational complexity of learning algorithms. The authors investigate the data geometry in order to derive minimal assumptions that render the learning problem, e.g., for the class of feedforward neural network, efficiently learnable. Hardness results for a class of low-dimensional manifolds are provided in the framework of statistical query models or under cryptographic hardness assumptions. The latter shows that assuming the manifold hypothesis solely is insufficient for guaranteeing learnability and learning is proved hard under input manifolds of bounded curvature. However, given additional assumptions, e.g., on the volume of the manifold can alleviate the fundamental limitations and is proven to ensure learnability. The authors conclude by providing empirical evidence that illustrate the learnability results through computational experiments on neural network training in the learnable and provably hard regimes.
优点
To the best of my knowledge, this work is novel and can be described as a valuable contribution to the field of manifold learning opening up new research perspectives. It provides interesting insights and results of a theoretical and empirical nature on the problem of the learnability of neural networks from a data geometry perspective by drawing connections to statistical query and cryptographic settings. The results assumptions are based on intrinsic manifold properties which is another plus. Interesting examples for the learnable and provably hard case increase the value of this work. The submission is clearly written, well organized and shows a complete status which made reviewing an enjoyable task.
缺点
Any of the following concerns can be assumed mediocre and the first two are also discussed transparently.
(1) The sampleable approach restricts the class of included manifolds.
(2) The presented approach requires prior knowledge about the distributions over the sequence of efficiently sampleable manifolds. This is a disadvantage in real-world data problems and is prone to biases, e.g., when the true unknown manifold is inferred from the data.
(3) Only one class, namely feedforward neural networks, is considered.
Minor
- l 115: to to
问题
-
The learnability results are provided with respect to the broad class of feedforward neural networks. Can you provide arguments that open or prevent your approach from analyzing other neural network classes, such as the class of residual or recurrent neural networks?
-
l. 80: You state that your formal hardness results apply to single-layer networks. Does the hardness result automatically apply to multi-layer networks or is the single-layer regime a necessary restriction?
-
Does your analysis capture hyperbolic spaces, i.e., dimensional Riemannian manifold of constant sectional curvature equal to since the latter are shown to be adequate for image learning (see Khrulkov et al.: Hyberbolic Image Embeddings, CVPR, 2020)?
局限性
Limitations are provided regarding real data manifolds that may fall into the heterogeneous regime and those regarding the sampleable manifold approach that restrict the class of included manifolds. However, I would have liked to see an additional discussion – if any – of the limitations imposed by the chosen class of feedforward neural networks.
We thank the reviewer for their insightful questions and positive feedback. We answer their questions below.
Any of the following concerns can be assumed mediocre and the first two are also discussed transparently. (1) The sampleable approach restricts the class of included manifolds. (2) The presented approach requires prior knowledge about the distributions over the sequence of efficiently sampleable manifolds. This is a disadvantage in real-world data problems and is prone to biases, e.g., when the true unknown manifold is inferred from the data. (3) Only one class, namely feedforward neural networks, is considered.
We tried our best to find the broadest setting that is still closely tied to practice. Regarding point (3), we found this class of manifolds a good starting point as it is canonical in some sense and encompasses other types of architectures as well (e.g. CNNs can be written as fully connected networks of large enough width). In the revised manuscript, we will expand the discussion section to emphasize these restrictions of our setting.
Typo: l 115: to to
Thank you. This is now corrected.
The learnability results are provided with respect to the broad class of feedforward neural networks. Can you provide arguments that open or prevent your approach from analyzing other neural network classes, such as the class of residual or recurrent neural networks?
We suspect many of the techniques for lower bounds extend to other network classes. For example, a recent work (ICLR 2024: On the Hardness of Learning Under Symmetries) analyzes such a question for Gaussian inputs in various equivariant and invariant architectures. Replacing the Gaussian input assumption with one of a manifold is an interesting question. We do not immediately see why proof techniques would fall apart, but it is worth exploring.
You state that your formal hardness results apply to single-layer networks. Does the hardness result automatically apply to multi-layer networks or is the single-layer regime a necessary restriction?
If we understand the question correctly, hardness does indeed also apply to multilayer networks. If an algorithm cannot learn a single hidden layer network, then it cannot learn multi-layer ones that encompass even more functions. E.g., one can simply ignore added layers by setting weights to identity and passing the input after the first layer through the later layers.
Does your analysis capture hyperbolic spaces, i.e., dimensional Riemannian manifold of constant sectional curvature equal to since the latter are shown to be adequate for image learning (see Khrulkov et al.: Hyberbolic Image Embeddings, CVPR, 2020)?
We suspect our analysis would be able to capture hyperbolic spaces, though the details would depend on the formal structure of these spaces. Changes to the setup would also need to be made. For example, we restrict manifolds via the reach, which is different than sectional curvature.
Dear authors, thank you for your response and your efforts to address my concerns.
I would like to emphasize once again that your results add valuable insights to the field of manifold learning and I hope that this work will be expanded in the near future.
The authors do not foresee any major obstacles in extending their work to other classes of manifolds and neural networks, which addresses my main concerns. However, to be fully convinced, I need a more detailed elaboration. For this reason, I maintain my score.
The paper relates the sample complexity required for learning a manifold to its curvature – the curvature is characterized by its “reach” that is defined as the minimum distance from a point on the manifold to some point that has multiple nearest neighbors on the manifold. For a manifold in a n-dimensional hypercube they show that if the reach is \omega(sqrt n) then the manifold can be learned with polynomial sample complexity. On the other hand if it is o(sqrt n) then expotentially many samples may be required. This is proved by constructing a space filling curve that has exponential complexity. They obtaining lower bounds under SQ and cryptographic hardness assumptions.
优点
The paper presents a nice clean connection between the "worst case" curvature and sample complexity of learning a manifold. There is a tight split of sqrt(n) for the curvature around which the problem becomes hard.
It is well written with clearly defined theory that flows smoothly.
缺点
That at high curvature exponentially many samples would be required is not too surprising.
The worst case curvature may not be the main determinant of the hardness of learning a manifold – there may be real world manifolds with very high curvature somewhere in the manifold but still learnable with low sample complexity.
问题
Do you really need the SQ and the cryptographic hardness assumptions for the exponential lower bound? Cant you just get unconditional hardness bounds as with a space filling curve that goes to a large fraction of the quadrants, the number of degrees of freedom is exp(n) -- so number of such functions is exp(exp(n)). So shouldn't automatically the sample complexity be exp(n)?
The worst case curvature may not be the main determinant of the hardness of learning a manifold – there may be real world manifolds with very high curvature somewhere in the manifold but still learnable with low sample complexity. For example if you take a V-shaped manifold with the very small angle at the vertex then this has a high curvature. Any thoughts on a tighter characterization of the complexity of learning a manifold?
Might there be other simpler ways of lower bounding the number of functions of high curvature for example maybe if you take polynomials of high degree then the number of coefficients could be exponentially many and the curvature would increase with the degree?
局限性
Yes they have.
We thank the reviewer for their positive comments and feedback. We address their questions below.
Do you really need the SQ and the cryptographic hardness assumptions for the exponential lower bound? Cant you just get unconditional hardness bounds as with a space filling curve that goes to a large fraction of the quadrants, the number of degrees of freedom is exp(n) -- so number of such functions is exp(exp(n)). So shouldn't automatically the sample complexity be exp(n)?
The number of degrees of freedom is indeed in the sense of the number of quadrants touched. However, note that we restrict to finite depth and width neural networks, so it is not true that there are many functions. There are many weights in the neural network, so the number of possible functions is at most . There are additional restrictions as well such as the form of the nonlinearity, bounded weights, etc. which limit the number of unique functions. This is why we can't just use a counting argument as stated in your question.
The worst case curvature may not be the main determinant of the hardness of learning a manifold – there may be real world manifolds with very high curvature somewhere in the manifold but still learnable with low sample complexity.
We agree with the reviewer, but do want to point out one caveat. Our results look at runtime complexity in the SQ formalism or via cryptographic hardness reduction. Sample complexity is a not unrelated but different question from runtime complexity.
For example if you take a V-shaped manifold with the very small angle at the vertex then this has a high curvature. Any thoughts on a tighter characterization of the complexity of learning a manifold?
We agree with the reviewer. These types of manifolds have localized regions with high curvature. They are probably learnable but fall outside the formal scope of our results. It is an interesting next question to explore how to formalize such manifolds.
Might there be other simpler ways of lower bounding the number of functions of high curvature for example maybe if you take polynomials of high degree then the number of coefficients could be exponentially many and the curvature would increase with the degree?
If we understand the question correctly, there may be other ways to form classes of manifolds with high curvature, e.g. via polynomial equations or algebraic varieties. We suspect the results and techniques we use extend to such settings, though we do not have any formal results thereof. It is an interesting question nonetheless and one we hope to explore later.
This paper studies the hardness of learning neural networks when their input data lies on a low-dimensional manifold, establishing learnability when the reach of the manifold is large, and constructing an example requiring exponentially many samples when the reach is small. Reviewers agree the paper is well written, and that the results are novel and of interest to the NeurIPS community. I thus recommend acceptance.