A Theoretical Study of Neural Network Expressive Power via Manifold Topology
摘要
评审与讨论
The paper provides results on:
- an upper bound on the network size required to approximate the indicator function of a union of balls / solids
- a lower bound for the training set size necessary to learn the homeomorphism between a complex data manifold and union of balls / solids, approximating it by a simplicial homeomorphism
- disentangling topological and geometric complexities reasoning on two properties rather than a single quantity.
The main theorem brings these together.
优点
I like the ideas in this paper relating data, the network and the classification task. Although preliminary, the results can be used to derive further bounds that encompass other aspects of deep learning .
缺点
-
Since the setting is not defined until the Main Thm, the statement remains very hard to grasp. For example, which manifold is this theorem talking about? Some manifold constructed from the network weights/architecture or the data manifold? This doesn't become clear until the problem setup. Why don't we define the RELU network already?
-
The following two works (uncited currently) present the connection between the intrinsic dimension and the topology of weight trajectories:
- Andreeva et al. Topological Generalization Bounds for Discrete-Time Stochastic Optimization Algorithms. NeurIPS 2024
- Birdal et al. Intrinsic Dimension, Persistent Homology and Generalization in Neural Networks. NeurIPS 2021
This is also why I believe that the claim in Ln. 72 should be softened, e.g. we can say 'data topology' instead of topology.
-
This paper only investigates network size in regards to data topology. However, other crucial elements such as the optimizer, task and the network architecture are disregarded. I don't see how the simplistic view of this paper can fully capture the intricate nature of expressive power.
-
Betti numbers are quantities that are somehow hard to work with. Can we use for example silhouettes or other descriptors extracted from the persistent homology? This also better relates the paper to the existing literature. Note that PH is also computed through gradual thickening.
-
Can we have a better name for 'topological complexity' as this notion is extremely general and does not explain the particular notion in this paper.
-
Unfortunately, we cannot see any empirical validation of the bounds presented. Could we validate at least the approximation errors, maybe on toy set-ups? Do the estimated quantities correlate well with the actual expressive power?
问题
Please see the weaknesses.
We sincerely thank the reviewer for their detailed feedback and thoughtful questions. Below, we address the specific points raised in the review.
- Clarity of the problem setup and manifold. We appreciate the reviewer’s feedback regarding the clarity of our problem setup. In response, we have revised the manuscript (e.g., Line 72) to explicitly state that our study focuses on the data manifold. Regarding the placement of the ReLU network definition, we deliberately introduced it within the problem setup section to minimize the use of mathematical notation in the introduction. We believe this approach enhances the paper’s readability by using conceptual language to explain foundational concepts, such as neural network size, before delving into more technical details. This structure aims to make the paper more accessible to a broader audience.
- Investigation on other crucial elements such as the optimizer, task and the network architecture. We acknowledge the reviewer’s concern about the narrow focus on network size in relation to data manifold, without considering other factors such as the optimizer, task, and network architecture. Our work focuses on providing foundational insights into the interplay between data manifold (topology and geometry) and network expressiveness. We agree that a broader framework incorporating factors like optimization dynamics and task-specific requirements would be essential for a more comprehensive understanding of expressiveness. This is an exciting avenue for future research, and we will highlight this limitation and potential extensions.
- Betti numbers are quantities that are somehow hard to work with. For a detailed response, we kindly refer the reviewer to Point 3 of our summary response.
- Can we have a better name for 'topological complexity' The term topological complexity (Definition 2 in our paper) is directly adopted from [1], where it is defined as the summation of Betti numbers across all dimensions. In another work [2], a vector of Betti numbers (representing different dimensions) is referred to as homological complexity. As Betti numbers represent the ranks of homology groups, we have maintained consistency with prior terminology. However, we are open to suggestions for refining the terminology further to improve clarity.
- Empirical validation of the bounds. For details on the empirical validation of the bounds, we kindly refer the reviewer to Point 1 of our summary response.
We hope these responses address the reviewer’s concerns and provide clarity on the contributions and limitations of our work. Thank you again for your valuable feedback, which has been instrumental in improving our manuscript. Please feel free to reach out with any additional questions or suggestions.
[1] Gregory Naitzat, Andrey Zhitnikov, and Lek-Heng Lim. Topology of deep neural networks. The Journal of Machine Learning Research, 21(1):7503–7542, 2020.
[2] William H Guss and Ruslan Salakhutdinov. On characterizing the capacity of neural networks using algebraic topology. arXiv preprint arXiv:1802.04443, 2018.
I don't remain fully convinced with the author response. First, I do not think that data-manifold and network expressivity can be completely decoupled. So how would this apply to practical networks? Second, the newly provided experiment does not constitute a remotely practical example, whereas most bounds out there have already provided evaluations on a variety of large networks. What differs this work remains unclear. Connected to this, it also seems like the authors stay away from discussing their work within the recent body of works that use topology to bound generalization (suggested papers not cited). I cannot justify this choice. Finally, the paper could use topological complexity but it should be aware of the contexts that it is defined in and talk about them in the paper.
I would love to hear authors' thoughts before lowering my rating.
We thank the reviewer for their additional feedback and the opportunity to address your concerns further. Upon reflection, we realized that we misunderstood some of your points in the initial review. We would like to provide further clarification:
- Decoupling of Data-manifold and Network Expressivity. It is true that, in practice, the data-manifold and network expressivity cannot be fully decoupled due to the actual optimization process of a neural network. However, in terms of existence, these two can be separated from the optimization, reframing the problem as constructing a network capable of approximating a given function—in our case, the indicator function of the data-manifold. Empirical studies [1,2] have observed that training on datasets with more complex data-manifolds requires larger networks for effective classification. Building on this observation, our work employs two metrics—reach and total Betti number—to quantify the complexity of the data-manifold and study their impact on network size.
- Evaluation on Large Networks. We appreciate the reviewer highlighting the difficulty of evaluating our work on large networks. This limitation stems from the challenge of recovering the latent manifold in real-world datasets, as our bound depends on the reach and total Betti number of the manifold. We acknowledge this as a limitation of our work and propose developing computationally practical metrics as a direction for future research. At the same time, we note that existing works proposing bounds (e.g., sample complexity bounds or neural network size bounds) based on manifold assumptions—such as [3,4,5,6]—are typically not verified even on synthetic experiments due to the unknown nature of the manifold and the significant sample requirements for recovery. Despite these challenges, the manifold assumption—that real data lies on a low-dimensional manifold—remains a critical and compelling area of research, as it provides a potential solution to the curse of dimensionality.
- Discussion on Related Works. We sincerely apologize for not addressing the related works mentioned by the reviewer in our initial response. This omission was due to our focus on addressing what we believed to be the reviewer’s primary concern—the strength of our claims—while intending to include the discussion of related works in the final version. We now realize it would have been more appropriate to provide this discussion earlier and have since revised the related works section of the manuscript (lines 157–161) to include these papers. The suggested works propose persistent homology generalization bounds that incorporate the topology of the training trajectory. They introduce a comprehensive metric encompassing the neural network, optimization process, and network activations. In contrast, our work focuses on approximating the indicator function of the data-manifold using a neural network and studying the relationship between network size and the properties of the input data-manifold. We deeply regret any inconvenience caused by this oversight and appreciate the reviewer’s understanding and thoughtful feedback.
- Topological Complexity. We appreciate the reviewer’s emphasis on this point. We have revised the manuscript (line 260) to clarify and emphasize that the topological complexity we employ is specifically defined for the data-manifold.
Thank you again for your detailed feedback and thoughtful suggestions. We look forward to your further comments.
[1] Gregory Naitzat, Andrey Zhitnikov, and Lek-Heng Lim. Topology of deep neural networks. The Journal of Machine Learning Research, 21(1):7503–7542, 2020.
[2] William H Guss and Ruslan Salakhutdinov. On characterizing the capacity of neural networks using algebraic topology. arXiv preprint arXiv:1802.04443, 2018.
[3] Chen M, Jiang H, Liao W, Zhao T. Efficient approximation of deep relu networks for functions on low dimensional manifolds. Advances in neural information processing systems. 2019.
[4] Sam Buchanan, Dar Gilboa, and John Wright. Deep networks and the multiple manifold problem. In International Conference on Learning Representations, 2021.
[5] Narayanan H, Mitter S. Sample complexity of testing the manifold hypothesis. Advances in neural information processing systems. 2010.
[6] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 2008.
The paper presents a novel bound on the size of ReLU neural networks which, as presented in the paper, are fully-connected feed forward neural networks (NNs) with ReLU activations. Previous work has focused on geometric descriptors in order to offer bounds on the size of NNs, but this work's novel contribution adds topological descriptors in order to calculate upper bounds on the size. In particular, the size is bounded by a function of the square of the sum of the Betti numbers of the manifold, among other variables.
The paper restricts itself to characterization of manifolds that are described as being from the family of thickened 1-manifolds, which are essentially the combinations (i.e. connected sums and disjoint unions) of spheres and tori. Defining an overall network size then is simply the aggregation of the network sizes on these individual ‘pieces’.
The main result is given as a constructive proof and offers a method for creating a ReLU neural network capable of approximating the indicator function of the class labels within a given error of with probability . It’s important to point out that the class of manifolds which this works for are assumed to be from disjoint sub-manifolds of the classes, e.g., where it’s assumed that the objective is binary classification.
优点
The authors offer some originality in their use of topological descriptors to give upper bounds on the size of neural networks, which is an oversight in other methods that seem to ignore the topology of data manifolds altogether.
The paper is well written in that it makes use of various results from previous authors with relevant related work. The relevant citations are present and their inclusion has a clear purpose. The authors also give ample justification for their results in the form of mathematical proofs.
Being able to determine an upper bound on the size of a neural network given a data manifold (for classification problems) is significant, given that the size of a neural network is a major factor in determining its efficacy, trainability and usability. Therefore, in that respect this paper makes a significant contribution in a narrow domain (see Weaknesses).
缺点
Weaknesses in this particular method spring from two main sources: 1) computational complexity of computing Betti numbers and 2) its applicability and usability.
Betti Numbers: a well known problem with computing Betti numbers is their computational time-complexity, which is dependent on the size of the input and the dimension to which we wish to compute. In the paper the authors define the total topological complexity of the manifold as , where is the -th Betti number of the manifold. Given that we must compute the Betti numbers out to dimension is a serious limitation assuming that our manifold is even a few thousand points big. Therefore, computing the proposed upper bound given that we must first compute is already infeasible for most real world applications.
The paper describes the main process as working for classification problems only. While the number of classification problems is significant, this is a serious limitation for anyone wanting to perform regression or any other objective. Further, given that this process is already computationally infeasible, it seems that practitioners are better off by simply increasing the size of their model, training it and then iterating if results aren’t good enough, instead of taking more time to find out the worst case size that their network would need to be.
问题
-
In the paper you define the size of the network as where . Why do you not take into account the biases and layer when determining the size of the network?
-
From Proposition 3 we can approximate the simplex by only taking samples from . However, the number of samples is dependent upon the volume of the -dimensional -balls in the covering. If is very large, then the volume of the balls is almost zero (depending on the metric used). Therefore, could also be very large and even bigger than the number of samples present. What should be done in this case? Have you tried to use this method with high-dimensional data?
-
As was mentioned, it is currently intractable to calculate high dimensional Betti numbers for even a few thousand sample points. Have you used or do you recommend using some sort of approximation or heuristic to calculate the Betti numbers?
-
Could you elaborate on any potential theoretical or analytical insight from employing this method, given that it may not be practical in all scenarios?
-
Also, is this method easily extensible to other objectives like regression?
We sincerely thank the reviewer for their constructive questions and valuable insights. Below, we address the specific points raised in the review.
- Computational Complexity of Betti Numbers Please kindly refer to Point 3 in our summary response.
- Applicability to Other Objectives (e.g., Regression) We thank the reviewer for this insightful observation. Our current framework is tailored to classification tasks, focusing on the expressiveness required to approximate indicator functions of manifolds, where . However, the methodology can indeed be extended to regression problems by considering the approximation of continuous functions defined on the manifold, where . The approximation of such functions has been studied in [1], which provides a foundation for exploring this extension within our framework (disentangling of data geometry and data topology). Furthermore, it is widely recognized that approximating discrete indicator functions can be more challenging due to their discontinuities. Indicator functions require the network to partition the manifold sharply, which often demands a higher level of expressiveness and complexity. In contrast, continuous functions are smoother and can often be approximated with simpler architectures.
- High-Dimensional Data and Sampling Challenges We appreciate the reviewer’s concerns regarding Proposition 3 and the challenges of high-dimensional sampling. The “curse of dimensionality” is a well-recognized limitation in machine learning theory, highlighting the importance of the manifold assumption—that high-dimensional data often lies on a lower-dimensional manifold. This assumption is fundamental to theoretical studies like ours. In Proposition 3, the dimension refers to the intrinsic dimension of the manifold, which is typically much smaller than the ambient dimension. While the sparse sampling characteristic of real-world data can pose challenges for directly recovering the underlying manifold, we note that the principal curve framework (discussed in Point 2 of our summary response) assumes a simplified data manifold. This approach makes it feasible to estimate manifold properties even with limited samples, such as MNIST, addressing some of the practical limitations.
- Clarification on Network Size Our definition of network size is adapted from [2], where it is defined as the number of hidden neurons, or equivalently, the sum of the hidden layer dimensions, rather than the total number of parameters. This definition yields a more concise value that is both tractable and well-suited for analytical purposes. The term refers to the dimension of the output space, rather than the hidden layer dimensions. For this reason, it is not included in the calculation of the network size under our definition. This distinction ensures consistency with prior work and facilitates clear theoretical analysis.
- Theoretical Insights Beyond Practicality Our work provides a formalization of the interplay between geometry (e.g., condition number) and topology (e.g., Betti numbers) in determining the expressiveness of neural networks. This formalization underscores the influence of topological complexity in shaping neural network architectures. By focusing on classification tasks, we introduce a tractable theoretical framework that disentangles the impact of manifold topology and geometry on network size. Importantly, this framework extends beyond classification and can be applied to other machine learning tasks, such as regression or feature learning (see Point 3 in our response to Reviewer CbQd).
We hope our responses address the reviewer’s concerns and clarify the scope and contributions of our work. Thank you again for your thoughtful feedback, which has been invaluable in improving our submission. Please let us know if there are any further questions or points requiring clarification.
[1] Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. In Advances in Neural Information Processing Systems, 2019.
[2] Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units. In ICLR, 2018.
Thank you for your reply to my previous questions.
I better understand now your choice of notation in Definition 4.
I am still unconvinced, however, regarding the practicality of using the sum of the Betti numbers in your upper bound. I am aware that using k-means to reduce the number of points for computation has been done, but using this method is, of course, not guaranteed to work on an arbitrary data manifold. Further, since your method proposes an upper bound, decreasing the number of terms in the sum of the Betti numbers could decrease the square of the sum significantly in some cases, thus lowering your original upper bound. This begs the question then, how tight is this bound and if you change your definition of by approximating it to two terms, then how does your theorem change?
I currently remain skeptical regarding the practical contributions of this paper, so I will keep my current score. I still think, however, that in spite of the proposed method's practicality, it has value for its theoretical contribution.
We thank the reviewer for recognizing our theoretical contribution. Regarding your question about the total Betti number, our bound is derived based on the thickened 1-manifold family. Within this family, the manifold exhibits Betti numbers only in dimensions 0 and 1. Therefore, is represented precisely by two terms.
In practice, when calculating Betti numbers from a point cloud, as shown in [1] and in the proof of our Theorem 2, constructing an Alpha complex with a radius smaller than results in a complex that is homeomorphic to the original manifold. However, achieving this requires a large number of samples to correctly recover the manifold. In real-world datasets, the data is often sparse, which poses challenges for manifold recovery. With this in mind, We acknowledge the limitations of the practical contributions in this work. Nevertheless, this paper seeks to lay the foundation for understanding the interplay between data manifold topology and neural network size. As part of future work, we will focus on developing a practically computable metric for manifold topology.
[1] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 2008.
Thank you for your response. Yes, indeed your manifolds are only disjoint unions of 1-manifolds, therefore the sum would be precise in that case. Unfortunately though, I feel that further evidence is needed to make the case for treating arbitrary data manifolds as being from the family of thickened 1-manifolds. Perhaps, showcasing your theorem on various datasets is in order.
Again, I'm not completely convinced about the practicality of the proposed method, but I would recommend this paper for publication based on its theoretical contribution and rigor.
We sincerely thank the reviewer for their positive feedback on our theoretical contribution and rigor. We also acknowledge the limitations in our empirical validation. The primary challenge lies in the fact that, for real-world data, prior knowledge about the latent manifold—including its dimensionality and condition number—is unavailable. Moving forward, we aim to develop a practically computable metric to address this issue.
This paper discusses geometrical and topological aspects of the latent manifold in relation with the expressibility power of neural network.
The main results state as follows. Theorem 1 gives an upper bound on the network size required to approximate the indicator function of a certain manifold linked topologically to the data. Theorem 2 given another bound on complexity by again the network size.
The paper is reasonably written and enough readable, examples are presented with figures to help the intuition, concepts like Betti numbers are communicated in a satisfactory way also to non specialists.
However since the authors mix differential geometry techniques as the condition number linked to the curvature of a manifold with respect to its embedding in R^D with purely topological concepts, as the Betti numbers, which are a topological invariant of a manifold, thus making the reader confused about the purpose of the paper.
Moreover results are presented without giving a clear connection with possible applications in the machine learning domain, not even with a small practical example that could elucidate, as proof of concept, a future application of their main result.
优点
-
This is a novel topological approach to the expressive power of neural networks.
-
The main result (Theorem 2) is an improvement of existing bound pushing the result in ICLR 2016.
缺点
-
I have some concerns about practical relevance, since I am unable to see the importance of thickened 1-manifolds in machine learning especially in connection with the latent manifold. Perhaps (see questions below) the authors could explain a simple example, like as proof of concept, to elucidate their finding in some simple application. There is no evidence in this paper that the topological structure of thickened 1-manifolds plays a role in a better practical description of the latent manifold.
-
Though the authors generalize the work "Depth-width tradeoffs in approximating natural functions with neural networks" published in ICLR 2016, they are not providing any experimental evidence that what they do go beyond the results in there from a practical point of view.
-
Technical remark: the condition number is linked with the Riemannian structure (curvature etc), while this paper is focussing mostly on topology. Once we take homeomorphism, we are changing the condition number of a manifold, then I am not sure about the soundness of the arguments regarding condition number and topology.
问题
-
Can you provide with some easy example elucidating your main results? Even proof of concept is ok.
-
In particular (refinement of question 1), can you provide a specific example or case study demonstrating how thickened 1-manifolds relate to real-world machine learning problems? Can you illustrate how the theoretical results apply to a practical ML task? This would help elucidate the importance of this concept in machine learning.
-
Regarding weakness (2). Can you provide with a specific experiment on benchmark datasets for binary classification that can demonstrate the practical advantages of the authors' approach compared to the ICLR 2016 work? This would provide an empirical validation now missing.
-
Can you explain better how topology (Betti numbers) and differential structure (condition number) can be considered together as means of bounding the complexity of a network for a task? (here binary classification)
-
In particular (refinement of question 4), how do you reconcile the use of condition number (a geometric property) with topological invariants as the Bettin numbers? More specifically, explain how the homeomorphism affects the condition number and how this impacts their theoretical results.
伦理问题详情
There are no ethical concerns.
We sincerely thank the reviewer for their detailed feedback and valuable insights. Below, we address the specific points and questions raised in the review.
- Practical Relevance and the Role of Thickened 1-Manifolds. We appreciate the reviewer’s concern regarding the practical relevance of thickened 1-manifolds. Thickened 1-manifolds were chosen as they provide a simple yet representative family of manifolds for theoretical analysis, and their structure aligns well with the latent representations observed in real-world datasets. As discussed in Point 2 of our summary response, the concept of principal curves [Hastie and Stuetzle, 1989] aligns with our framework, where the thickened 1-manifold represents a plausible approximation of the data’s supporting set. To provide further clarification, we have added a synthetic experiment (see Point 1 of our summary response). This experiment uses a binary classification task on synthetic datasets generated from thickened 1-manifolds, verifying the tightness of our bound. Additionally, we highlight the feasibility of estimating Betti numbers of real-world datasets in Point 3 from our summary response.
- Reconciling Topology (Betti Numbers) and Geometry (Condition Number). Consider a manifold . A binary classifier can be modeled as a function , where the manifold’s topological and geometric properties are simplified into a one-dimensional real representation. To achieve successful classification, the classifier must inherently account for all the manifold properties—both topological and geometric—that were collapsed in this process. This is why both topology and geometry are crucial for determining the expressive power of a neural network. However, for manifold reconstruction, the function acts as a homeomorphism. In this process, the topology of the manifold is preserved, but geometric characteristics, such as the condition number, are altered. This is why works in manifold recovery [1,2] only present geometric characteristics in their results. In our theoretical framework, we disentangle the contributions of manifold geometry and topology to the classifier . We achieve this by constructing a neural network that first learns a homeomorphism mapping the original manifold to a simplified manifold with trivial geometric features (e.g., a uniform curvature). The fact that a homeomorphism alters the condition number of a manifold is a critical insight linking the condition number to the size of . We then construct another network , which works as a classifier. Since the geometric features of have been simplified (e.g., reduced to structures like balls and tori), the size of is primarily determined by the topology of , such as its Betti numbers. The overall classifier is thus a composition of and , combining their respective roles in handling the manifold’s topological and geometric complexities.
- Relation Towards the ICLR 2016 Work. The paper “Depth-width tradeoffs in approximating natural functions with neural networks” focuses on the role of depth in neural networks. Specifically, it demonstrates that a 2-layer neural network with a size exponential in cannot approximate the indicator function of a -dimensional unit ball, whereas a 3-layer neural network can achieve it with a size of d^2 . Their work emphasizes the depth-width tradeoff in a specific setting. In contrast, our work investigates the relationship between network size and the topology and geometry of the data manifold. While there is some overlap in focus, our work is not a generalization of the ICLR 2016 paper. Instead, we adapted their theory regarding the approximation of a -dimensional unit ball as a proposition to support our theorem, which examines how the topology of the data manifold (e.g., Betti numbers) influences neural network size. Therefore, our work explores a broader scope that includes both geometric and topological properties, and as such, direct comparison with the ICLR 2016 work is not applicable. However, we have provided empirical validation of the tightness of our bound, as discussed in Point 1 of the summary, further highlighting the practical relevance of our theoretical contributions.
We hope these clarifications address the reviewer’s concerns and help illustrate the scope and practical implications of our work. Thank you again for your valuable feedback, which has significantly helped us strengthen our submission.
[1] Hariharan Narayanan and Partha Niyogi. On the sample complexity of learning smooth cuts on a manifold. In COLT, 2009.
[2] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 2008.
I thank the authors for their clarifications to all the points raised by the referees. I have particularly appreciated the validation on a proof-of-concept example, that now they have added to their paper. I now understand the play between the topology (Betti's numbers) and the geometry (curvature linked with condition number). However I maintain to 5 my overall judgement, since I still think that more validation is necessary.
We appreciate the reviewer’s constructive suggestions and will focus on conducting further validations to explore the interplay between manifolds and various network designs.
The paper investigates a NN's expressive power in relation to the data manifolds underlying geometry and topology. They provide theoretical bounds on the size of ReLU classifiers in terms of the manifold's Betti and condition numbers. The condition number measures how much a manifold "curves" in its ambient space.
优点
The paper is very well written. The introduction surveys the field and the current state of the art. The authors provide a nice overview of their problem with many references. The authors provide explicit and intuitive background on all the mathematical objects they intend to use. Their description of their family of thickened 1-manifolds is clear. They clearly state their problem and provide a reasonable answer.
缺点
The theory seems to be a bit backwards. It appears the authors found something they could prove about a small class of manifolds, and then tried to write a paper about this class. They claim that the thickened 1-manifolds are representative of data manifolds, but provide no empirical nor theoretical proof here.
The geometric and topological characteristics they use are also somewhat random. The reach (or condition number) is a min over the entire manifold, and as such only captures geometry at a single point. The authors bemoan how difficult it is to incorporate topological information because of its discrete nature, but the data is discrete so they should be able to use TDA to assist them here.
问题
Why is TDA not used in this paper? Is there empirical evidence that real datasets can be modelled as 1-manifolds? There are no experimental results in the paper.
We sincerely thank the reviewer for their supportive feedback and valuable comments. Below, we address the specific points raised.
-
Is there empirical evidence that real datasets can be modeled as 1-manifolds? Please refer to Point 2 in our summary, where we provide reasons to support our belief that the proposed thickened 1-manifold family is representative of data manifolds.
-
The geometric and topological characteristics they use are also somewhat random. We appreciate the reviewer’s concern. We chose reach (or condition number) and total Betti numbers as the geometric and topological characteristics based on established literature and studies. The reach represents the minimum over the entire manifold and captures the overall flatness of a manifold, rather than focusing on a single point. This measure has been widely used in fields such as surface reconstruction [1] and manifold recovery [2, 3]. The total Betti numbers characterize the topological complexity of the manifold. These have been employed in [4], where an empirical study explores the variation in topological complexity across neural network layers. We believe these metrics provide meaningful insights into the interplay between manifold properties and neural network size.
-
Why is TDA not used in this paper? This is an important question. We agree that topological data analysis (TDA) tools, such as persistent homology (PH), capture much richer information than Betti numbers and could potentially offer more precise bounds for network size. However, despite promising empirical evidence linking PH with model performance [5], to the best of our knowledge, no theoretical network size bound based on PH has yet been established. We view this paper as an initial step towards exploring this direction.
We hope these clarifications address your concerns, and we thank you again for your insightful comments, which have helped us improve our work. Please let us know if there are any further points we can address.
[1] Nina Amenta and Marshall Bern. Surface reconstruction by voronoi filtering. In Proceedings of the fourteenth annual symposium on Computational geometry, pp. 39–48, 1998.
[2] Hariharan Narayanan and Partha Niyogi. On the sample complexity of learning smooth cuts on a manifold. In COLT, 2009.
[3] Partha Niyogi, Stephen Smale, and Shmuel Weinberger. Finding the homology of submanifolds with high confidence from random samples. Discrete & Computational Geometry, 2008.
[4] Gregory Naitzat, Andrey Zhitnikov, and Lek-Heng Lim. Topology of deep neural networks. The Journal of Machine Learning Research, 21(1):7503–7542, 2020.
[5] Guss, W. H., & Salakhutdinov, R. (2018). On characterizing the capacity of neural networks using algebraic topology. arXiv preprint arXiv:1802.04443.
Thanks to the authors for their responses. However i am still fairly unconvinced by point #2 in their summary. Citing an unpublished arxiv paper and the classic principle curves work does not establish much to me. Perhaps they should restrict their domain of applicability to 1-dimensional trajectories or something like this. So I would maintain the same score.
We appreciate the reviewer’s constructive suggestion. We acknowledge that citing an unpublished arXiv paper [1] may affect the overall credibility. However, [1] is referenced solely as empirical evidence demonstrating that principal curves can be computed from real-world data, such as MNIST. This has been clearly supported by their experimental results. We thank the reviewer again for their valuable feedback.
[1] Cui, Elvis Han, and Sisi Shao. “A Metric-based Principal Curve Approach for Learning One-dimensional Manifold.” arXiv preprint arXiv:2405.12390 (2024).
I thank the authors for their clarifications to all the points raised by the referees. I have particularly appreciated the validation on a proof-of-concept example, that now they have added to their paper. Thanks to their clarification, I now understand the play between the topology (Betti's numbers) and the geometry (curvature linked with condition number). I have then raised to 3 soundness, however I maintain to 5 my overall judgement, since I still think that more validation is necessary.
This paper explores the relationship between neural network size and the topology and geometry of data manifolds, providing a theoretical framework that integrates these manifold properties into network size bounds. The authors present an upper bound on neural network size based on Betti numbers and the reach (inverse of the condition number), representing topological and geometric aspects of the data manifold, respectively. This work aims to address the gap in the literature regarding how topological complexity influences network size, which has been previously underexplored compared to geometric factors. The theoretical results, exemplified through the construction of topological representatives, offer a novel perspective on data manifolds and potential implications for future neural network architectures.
优点
- The theory presented is well-structured and mathematically elegant, making effective use of both topological and geometric complexity measures, notably Betti numbers and the reach. This duality could be beneficial for refining future analyses in manifold learning and deep learning.
- The construction of a neural network classification framework that factors in manifold topology is innovative, especially in how it leverages topological representatives for improved upper bounds.
缺点
- Relevance to Neural Representations: While the theoretical results are indeed intriguing, there is limited discussion on their practical relevance for the study of neural representations. Without a thorough analysis of how these bounds translate into tighter or more representative metrics, the applicability of these results to real-world neural networks may be difficult to gauge. Further connections to neural representations in active fields like feature or metric learning would enhance this work’s value to the ICLR community.
- Upper Bound Tightness: The paper provides an upper bound on network size influenced by manifold topology and condition number, but it does not clarify the tightness of this bound. The condition number could potentially introduce loose dependencies on manifold dimensions, especially since only an upper bound is given without a complementary lower bound or empirical justification. It would be useful if the authors could discuss the conservatism of their bound and any assumptions that might affect its applicability.
- Lack of Empirical Examples: Although the paper includes theoretical constructs (such as Figure 3), there is a noticeable lack of concrete examples or even toy simulations that showcase the behavior of neural networks with varying topological complexities. Empirical demonstrations—even simple ones—would support the theoretical claims and help readers better understand the practical implications of the upper bound.
- Connection to Broader Research Themes: Given the potential implications of this work for a range of ML themes (e.g., representational learning, metric learning, compositional modeling, reinforcement learning), the authors could benefit from speculating on or directly addressing possible applications of their results. It would be insightful to hear how they envision the integration of these theoretical insights with active areas like uncertainty quantification, large-scale learning, and non-convex optimization.
问题
- On the Upper Bound Dependency: Could the authors clarify whether the dependence on the manifold dimension d in the Main Theorem is a consequence of the upper bound’s conservativeness? Specifically, might the condition number be introducing an artificial inflation in the bound?
- Empirical Justification: Are there plans to include practical or toy examples of neural network training on manifolds with varying topological complexities, beyond the “construction” cases in the paper? Such examples could help verify the utility of the theoretical bounds.
- Potential Applications: Could the authors share any insights or speculations on how their findings might extend to practical applications? For example, how might this framework influence neural network design choices in contexts such as feature or representational learning?
We sincerely thank the reviewer for their detailed feedback and thoughtful suggestions. Below, we address the specific points and questions raised in the review.
-
Empirical justification of the upper bound. Please refer to point 1 in our summary, where we provide an example of network size with varying topological complexity.
-
On the Upper Bound Dependency. We appreciate the reviewer’s observation regarding the potential influence of the condition number on dependencies related to manifold dimensions. The dependency introduced by the condition number arises primarily due to the high sample requirements necessary to accurately recover the manifold. However, in real-world data, samples tend to be sparse, and the estimated condition number is typically small. Under the manifold assumption that the intrinsic dimension is low, and given that the condition number estimated from real-world datasets is generally small, the impact of the term is less significant than it might appear at first glance. We believe these considerations help contextualize the dependencies in our bounds and their applicability to practical scenarios.
-
Relevance to other research topics, such as representational learning and metric learning,. Our work introduces a framework for studying the relationship between data manifolds and neural network size. While our theoretical results are focused on classification tasks, where data topology plays a critical role, this framework can be extended to other domains, such as regression and feature learning. For example, in [1], the authors empirically investigate the topology changes of neural activations across layers. Building on this idea, our framework and the thickened 1-manifold assumption could be adapted to analyze the number of neurons required to reduce a specific level of topological complexity in these settings. We believe this adaptability underscores the broader relevance of our work and its potential applications in representational learning, metric learning, and beyond.
We hope these clarifications address your concerns, and we thank you again for your insightful comments, which have helped us improve our work. Please let us know if there are any further points we can address.
[1] Gregory Naitzat, Andrey Zhitnikov, and Lek-Heng Lim. Topology of deep neural networks. The Journal of Machine Learning Research, 21(1):7503–7542, 2020.
I thank the authors for addressing my questions.
I found the extra experiments, although simple, complementing the theory nicely. While it is understandable that it is hard to perform systematic experiments in the short rebuttal period, I encourage the authors to consider a few more settings (as sanity check & comparison) such as more depth and the effect of different orientation of the manifolds etc. in the next version of the paper.
While other reviewers questioned about the impact of this paper, I personally think this connection between network size and Betti number is of mathematical interest and fundamental. I’ll raise my rating from 5 to 6 to reflect my updated evaluation on this paper.
We sincerely thank the reviewer for recognizing our contributions and acknowledging the mathematical significance and fundamental nature of the connection between network size and Betti number. We also greatly appreciate your feedback on our experiments. Moving forward, we will work on further validations to explore the interplay between manifolds and various network designs.
We thank all the reviewers for their time and insightful comments. Your feedback is invaluable in enhancing the clarity and impact of our work. We have revised the manuscript based on your constructive feedback and included a new experimental section (Section 5) to provide empirical validation of the bound. We would like to address three common concerns from the reviewers as follows.
-
Empirical validation of the derived bounds. We added an empirical evaluation in Section 5 to show that the bound is tight regarding the topology, i.e., the network size is quadratic to the total Betti number for specific cases. In Figure 4 of the revised manuscript, we present an analysis using manifolds characterized as solid genus-g tori, with g ranging from 1 to 10. We deploy a 5-layer ReLU network to classify points sampled from the tori and points sampled from background. The required network size for this task is determined by the product of the network depth and the minimum width at which the training accuracy exceeds 0.95 at convergence. Figure 4(c) presents a regression line charting the relationship between network size and the squared topological complexity, . This regression underscores a pronounced linear association between network size and , with a correlation coefficient .
-
Effectiveness of representing data manifolds as the family of thickened 1-manifolds. We argue that the proposed thickened 1-manifold family is highly applicable to real-world datasets. Data such as images are commonly assumed to reside on lower-dimensional manifolds. However, estimating the manifold's intrinsic dimension is challenging due to the limited sample size relative to the high ambient dimension. Despite this, assuming that the data lie on a thickened 1-manifold is not unreasonable. Recent approaches, such as [1], have successfully learned thickened and deformed 1-manifolds from real-world data (e.g., MNIST), building on the concept principal curves [2]. Principal curves serve as a foundation where data are assumed to be sampled from a probability distribution centered around these curves.
-
Tractability of Betti numbers on real-world dataset. We acknowledge that computing manifold Betti numbers can be computationally expensive. However, within the framework of our proposed thickened 1-manifold family, only the 0- and 1-dimensional Betti numbers are required. Moreover, Betti numbers are typically computed from the k-nearest neighbor (k-NN) graph of the data points, and constructing a k-NN graph is computationally efficient. These factors make the computation feasible. For example, calculating the 0- and 1-dimensional Betti numbers for all 10 classes in the MNIST dataset (70,000 points in total) takes approximately 9 seconds. Therefore, we believe it is practical to compute Betti numbers from data, provided this computation is not repeatedly performed during training.
[1] Cui, Elvis Han, and Sisi Shao. "A Metric-based Principal Curve Approach for Learning One-dimensional Manifold." arXiv preprint arXiv:2405.12390 (2024).
[2] T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84(406):502–516, 1989.
We sincerely thank all the reviewers for their thoughtful feedback and valuable insights. We have provided clarifications and addressed your concerns. Could you kindly let us know if our responses adequately address your concerns, or if there are any additional points that require further clarification?
The current paper explored how the network size scales up with respect to the topological complexity and curvature of the data manifold via a constructive proof. All reviewers acknowledged that the theoretical contribution is novel and interesting, and that the authors took a promising direction on incorporating the topological complexity of data. The reviewers are still concerned on the validity of modeling real data as thickened 1-manifolds, and on the practical applicability of computing the proposed bounds. Due to these limitations, the paper needs more iterations before getting published. Moreover, the authors are encouraged to strengthen the empirical verification as the reviewers have suggested, and properly position this work in the literature in future revisions.
审稿人讨论附加意见
All reviewers have indicated their concerns after the rebuttal, especially in the author-reviewer discussions.
Some reviewers suggested experimental verification and as a response the authors provided a toy experiment in sec 5. It provides some intuition but the reviewers felt more validation is necessary.
Most reviewers are not convinced about using the thickened 1-manifold to model real data, and how to compute the Betti numbers in practice when the data is high dimensional and sparse. The authors have acknowledged them as limitations of their method. Addressing them needs some major rewriting.
Related work should be more carefully discussed as there are existing work on applying topological properties to neural network expressiveness as the reviewers have suggested.
Reject