Minimum Width for Deep, Narrow MLP: A Diffeomorphism Approach
We present a geometric framework that reduces the problem of finding the minimum width for universal approximation by deep, narrow MLPs to a dimension-based function w(d_x,d_y) yielding tight upper and lower bounds under the uniform norm.
摘要
评审与讨论
This works studies the optimal network width, in a minimax sense, for deep but narrow MLPs under the uniform norm, with a primary focus on the regime when the output dimension is (around or at least twice) larger than the input dimension. Toward understanding of the optimal width, the authors draw interesting connections to a geometrical function that describes the minimum dimension needed for restoring a continuous function from a `lifting' embedding. They provide (as the main result) upper and lower bounds on the optimal width, for different classes of activations, using this quantity . Then the authors provide a few bounds on this proposed quantity and translate them into the bounds for the MLP optimal widths using the main result.
优缺点分析
Strengths:
- The proposed bottleneck-like quantity is interesting and insightful, and it closely connects to the minimum UAP width of interest.
- The paper is nicely written and well explained.
- The results give a nice addition to this line of research on deep shallow MLPs.
- The main results are stated in terms of and are general, while later results and remarks in Section 4.3 and 4.4 find nice connections to existing bounds on the minimum width under various settings.
Weaknesses:
- In Eq.(8) it uses that the projection and the inclusion (or essentially the identity) can be represented by deep narrow MLPs. Is this true for general activations except for ReLU or Leaky-ReLU? Is Condition 1 necessary and/or sufficient? This may deserve an elaboration in the text if I am not missing something trivial. This question also relates to Line 183 in the proof of Theorem 4.4.
- (minor) The authors should mention that starting from Eq.(8) it assumes .
Typos:
- Line 67 Related Works
- Eq.(31) should be instead of Leaky-ReLU
- is not explicitly defined in Definition 4.13
- instead of = in Line 300
问题
- Please see weaknesses.
- Can the authors refer to the specific place of misconception and its usage in the proof as mentioned in Remark 4.17?
局限性
yes
最终评判理由
My questions are addressed in the response. Overall, I believe this is a nice addition to this line of research and is worth being accepted; but I am less familiar with the novelty and the contribution of this work among the existing literature.
格式问题
no
Identity function) Yes, Condition 1 provides a sufficient condition for this, as established in Lemma 4.1 of [1]. We will add this clarification to the paper.
) Thank you for pointing this out. We will include this condition in the revised version.
Typos) Thank you for identifying the errors. We will correct them accordingly.
Misconception) In the proof of Lemma 2.4 of [2], the equation does not imply that is a diffeomorphism.
References
[1] Kidger, Patrick, and Terry Lyons. "Universal approximation with deep narrow networks." Conference on learning theory. PMLR, 2020.
[2] Li, Li’Ang, et al. "Minimum width of leaky-ReLU neural networks for uniform universal approximation." International Conference on Machine Learning. PMLR, 2023.
I appreciate the response from the author(s) and this interesting theoretical work. I have no further questions.
This paper presents a new framework for developing novel upper and lower bounds on the minimum width for deep MLPs to universally approximating continuous functions under the uniform norm. The paper develops this framework by first proving that a deep narrow MLP can approximate any -diffeomorphism and then utilizing that result to show how any continuous function can be approximated by composing linear transformations and diffeomorphisms. Using this framework they provide a novel upper bound on the minimum width of . Their results hold for a variety of activation functions including the Leaky-ReLU and ReLU.
优缺点分析
Strengths
- To the best of my knowledge the diffeomorphism approach is a novel framework and this approach may pave the way to apply these tools from geometric topology to other deep learning theory problems.
- The paper contributes to a clear gap in the literature on universal approximation with deep narrow MLPs. In particular, using their framework they demonstrate that a width of max(2d_x + 1, d_y) always suffices for UAP of deep (narrow) MLPs. This improves upon earlier works which placed the upper bound between max(d_x + 1, d_y) and d_x+d_y.
- The paper
- While the paper makes use of very technical math the contributions of the paper are clear and the high level proof strategy is easy to follow.
Weaknesses:
- While I understand that this is a theoretical contribution the paper has a very high technical barrier making the proofs inaccessible to most members of the Neurips community. For instance, the term "diffeomorphism" is never defined in the paper.
- The optimal results are for very specific choices of d_x and d_y. It is unclear whether this framework can be useful for obtaining optimal minimum widths for arbitrary d_x and d_y.
问题
- How is Definition 3.1 different from saying that A is dense in B?
局限性
Yes.
最终评判理由
I ultimately recommend a 5 for this work. I believe the work provides a strong theoretical contribution and opens the door to a new set of mathematical tools for understanding neural networks. The authors alleviated all of my questions and concerns in the rebuttal.
格式问题
None.
Regarding Technical Barrier) Thank you for pointing this out. We will add definitions of several technical terms, including diffeomorphism, embedding, and certain topological and algebraic concepts used in Section 4.5.
Usefulness of our results) It is true that, at the current stage, the optimal width is achieved only in specific cases. However, our framework—which reduces the problem of identifying the optimal width to analyzing the quantity —is fundamental. Many existing lower bounds (at least for networks with Leaky ReLU activation) can be naturally interpreted and easily rederived within our framework.
For instance, the lower bound presented in Kim et al. (2024) can be reproduced by constructing a non-injective continuous function with -dimensional inputs and -dimensional outputs. We believe that future work on width-based universal approximation should build upon this foundational framework.
Definition 3.1) Regarding Definition 3.1, note that is not required to contain , so the definition is slightly different from the standard notion of density.
Thank you for the response and clarifications. My concerns have been addressed, I believe this paper should be accepted and have updated my score.
The paper investigates the minimum width required for deep, narrow Multi-Layer Perceptrons (MLPs) to achieve the universal approximation property (UAP) under the uniform norm. The authors propose a framework that reduces the problem of determining the minimum width to a purely geometric function w(d_x,d_y), dependent solely on the input and output dimensions d_x and d_y.
The work bridges gaps in existing literature by providing tighter bounds and exact values for the minimum width, particularly in cases where the output dimension is roughly twice the input dimension.
优缺点分析
Strengths:
1.The paper is technically rigorous, with well-constructed proofs and a clear methodology. The use of geometric and topological tools to derive bounds is innovative and solidifies the theoretical foundations of the work.
2.The authors provide intuitive explanations for their results, such as the role of diffeomorphisms in approximation, which aids understanding.
3.The introduction of the geometric function and the framework for analyzing minimum width through diffeomorphisms are novel contributions.
Weaknesses:
1.The paper focuses on the uniform norm, which may limit its applicability to other types of norms commonly used in machine learning.
2.The assumptions made about the activation functions (e.g., C^1 near a point with non-zero derivative) may not hold for all types of activation functions used in practice.
3.The paper does not provide extensive empirical validation of the theoretical results, which could be beneficial for demonstrating the practical implications of the findings.
问题
see weakness.
局限性
yes
最终评判理由
I keep my original rating due to unsatifactory response and the possibility of duplicate submission.
格式问题
n.a.
Question 1) It is true that our paper focuses on the uniform norm. However, the uniform norm lies at the core of robustness of neural networks and provides a meaningful setting in which to study the universal approximation property. Most existing works on the universal approximation property (UAP) primarily study the uniform norm or the norm.
Question 2) It is not accurate to claim that the assumptions made about the activation functions may not hold. All commonly used activation functions in practice satisfy our assumptions. In fact, the assumption is so mild that it is actually difficult to construct an activation function that violates it. We would appreciate it if you could provide an example of an activation function that does not satisfy the assumption.
Question 3) Our study is aimed at providing a theoretical validation of the minimum width required for universal approximation, and it does not attempt to provide empirical validation.
The authors solved most of my concerns. However, 1) It is exaggerated to say "All commonly used activation functions in practice satisfy our assumptions." Although, I cannot provide an example of an activation function not satisfying the assumption, however, it does not mean there not exist one in the world. This is also exaggerated statement. 2) I know this paper aims to provide a theoretical validation of the minimum width w.r.t. universal approximation, however, the theory is under the MLP neural network scenario. Since neural networks (NNs), deep neural networks (DNNs), and Transformers—like most machine learning models—typically require empirical validation to demonstrate the practical efficacy of their underlying theories, experimental verification remains essential. Therefore, these responses alone are insufficient to improve my score.
Regarding the first question) We acknowledge that the statement may appear somewhat strong, as proving the existence of an activation function that violates the condition could indeed be seen as analogous to demonstrating the existence of a black swan—an inherently challenging task.
That said, we respectfully believe that raising the possibility of such an activation function, without providing a concrete example, offers limited practical value in this context. We would also like to note that the assumption in question has been widely accepted in the literature, particularly following the work of Kidger and Lyons [1]. For further details, please refer to Lemma 4.1 in their paper.
Regarding the second question) As the reviwer suggested, we have conducted experiments to empirically validate our theoretical findings. Specifically, we constructed a dataset by sampling input vectors in two dimensions and generated corresponding outputs as defined in Equation (89). We then trained two narrow neural networks, and , with hidden widths of two and three, respectively.
We evaluated both networks on the cycle input described in Equation (92). The results show that sometimes produces an output curve that winds around the origin twice, as expected, while consistently fails to do so.
Although, we should note that the empirically proving that a function cannot be approximated by a neural network is inherently difficult, as it is challenging to distinguish whether the failure stems from the network’s approximation capacity or limitations of the training procedure.
It seems that the same paper may have been submitted to Elsevier. For your reference, the link is provided below: [https://arxiv.org/pdf/2308.15873]
Could you kindly verify whether this constitutes a duplicate submission and clarify the status?
We confirm that this submission is not a duplicate and is not under consideration elsewhere.
Please proof that the paper [https://arxiv.org/pdf/2308.15873] is not under other submission. Also, why the paper is missed? Please explain it.
Unfortunately, there is no formal mechanism available to verify the submission status of a paper on arXiv.
Could the reviewer clarify what is meant by "the paper is missed"?
The meaning of this paper being "missed" is that I could download and view it from the link [https://arxiv.org/pdf/2308.15873] a few days ago, but now the link is empty—indicating the author recently removed it from arXiv. Notably, the author only responded after the deletion, having ignored my earlier inquiries, which suggests undisclosed motives. I have no way of knowing if the paper is currently under submission elsewhere.
The confusion may stem from the fact that your link appears as https://arxiv.org/pdf/2308.15873] (with an extra closing bracket) rather than the correct https://arxiv.org/pdf/2308.15873.
We respectfully request that the evaluation be based solely on the submitted manuscript, in accordance with the double-blind review policy.
Again, this paper is not under submission to any other venue.
In order to adhere to the double-blind review policy, we are unable to provide further comments on this matter.
If you would like to adhere to the double-blind reviewer policy, why do you submit it onto arxiv?
First, we would like to clarify that, in accordance with the double-blind review policy, we are not in a position to confirm or deny any authorship of the arXiv paper in question.
That said, even if we were the authors of the preprint referenced, we note that the NeurIPS 2025 Call for Papers explicitly states:
“The existence of non-anonymous preprints (on arXiv or other online repositories, personal websites, social media) will not result in rejection.”
In my understanding, the statement "The existence of non-anonymous preprints (on arXiv or other online repositories, personal websites, social media) will not result in rejection" simply means that NeurIPS will not reject a paper just because it appears as a non-anonymous preprint online.
This should NOT be interpreted to mean that: The arXiv paper itself is double-blind compliant, or The submission automatically satisfies double-blind review requirements.
While the policy prevents automatic rejection due to preprint visibility, it does NOT address whether reviewers might be influenced by prior knowledge of the non-anonymous work. The double-blind review process still requires authors to properly anonymize their NeurIPS submission, regardless of any pre-existing non-anonymous versions.
Please do not use the double-blind review process as grounds to influence reviewers' assessment of the paper.
We respectfully note that the NeurIPS 2025 Reviewer Guidelines explicitly state:
“Please do not attempt to find out the identities of the authors for any of your assigned submissions (e.g., by searching on arXiv). This would constitute an active violation of the double-blind reviewing policy.”
(https://neurips.cc/Conferences/2025/ReviewerGuidelines)
We understand that the existence of a non-anonymous preprint does not automatically make a submission double-blind compliant, and we have taken care to fully anonymize our submitted manuscript in accordance with the NeurIPS policy. However, any assessment of our work should be based solely on the anonymized submission, without attempts to associate it with external, non-anonymous materials.
I just reviewed the discussion about the arXiv paper. To avoid any potential information leaks, I decided not to open it. My understanding is that the reviewer found a paper with the same title under a certain submission template. This alone does not prove whether or not you have submitted the paper to another venue or journal—many people use templates simply because they like the style.
However, if the paper is the same but the authors are different, then we are dealing with plagiarism, and the paper should be rejected at a later stage.
My main question to the authors is the following:
Has any part, or the entirety, of this work been submitted to another venue or journal? This includes submissions in the pre-review and revision phases.
If the authors can open the link provided by the reviewer, do they believe they should cite it?
If it is an unpublished version of their own paper, they should not cite it.
If it is a timestamped paper by someone else that includes their results, they must either provide proof of concurrent work—ideally published in a journal, which is better suited for such cases.
If it is work produced afterward, then it is unethical to claim novelty.
I have tried to adhere to traditional principles here.
Would the authors be subject to desk rejection if it is clear they have violated the rules? Please be aware that the reviewer may be criticized by the AC if this turns out to be their own mistake.
Let us keep everyone accountable to their corresponding responsibilities.
Thank you for the careful and thoughtful comment.
We would like to address the reviewer’s concerns as follows:
1. Has any part, or the entirety, of this work been submitted to another venue or journal? This includes submissions in the pre-review and revision phases.
At present, apart from the NeurIPS submission, it is not under submission in any venue, nor is it in any pre-review or revision phase. Specifically, it was previously submitted to another journal, where it was rejected, and has since been revised for the NeurIPS submission.
2. If the authors can open the link provided by the reviewer, do they believe they should cite it?
We believe that the linked paper should not be cited in our submission.
3. Would the authors be subject to desk rejection if it is clear they have violated the rules?
To the best of our knowledge, our submission fully complies with the NeurIPS submission and review policies. We are not aware of any violations that would warrant desk rejection.
We appreciate your understanding that the double-blind review policy limits our ability to convey certain details with complete accuracy. We would like to provide the Senior Area Chair with more precise information.
Emergency Review
Summary
The paper establishes the minimum hidden-layer width required for a deep, narrow MLP with a given activation function to be a universal approximator on . The main result is that:
where is a topological term and depends on the smoothness of the activation function.
The authors prove that this width is both sufficient and necessary. The proof is constructive for Leaky ReLU networks, involving approximations of diffeomorphisms via affine coupling flows and known geometric theorems. Extensions to ReLU and activations are made using results by Hanin (2019) and Kidger–Lyons (COLT 2020). Lower bounds are proven using tools from differential topology, and a detailed topological construction is provided for the case using winding numbers and Hurwicz theorem.
优缺点分析
Pros
- Closes a known theoretical gap by identifying the minimum width required for universal approximation with deep, narrow MLPs.
- Covers a broad class of activation functions, including ReLU and other activations.
- The upper bound is constructive and leverages elegant geometric and topological tools.
- The lower bounds use rich machinery (immersion theorems, algebraic topology), giving strong evidence of necessity.
- Clear modular structure in the proof pipeline (diffeomorphisms composition universal approximation).
Cons
- The key assumptions are tailored to cases where the output dimension is roughly twice the input dimension (e.g., ); this limits generality.
- The exposition could benefit from more explicit motivation for the specific techniques used (e.g., affine coupling flows) and a better connection to the broader literature on UAP.
- The topological lower bounds, especially in Section 4.5, are difficult to follow and would benefit from an appendix with extended intuition and proof sketches.
- The role of as a correction term is technically derived but lacks a strong intuitive justification in the main text.
- The manuscript does not deeply discuss recently concurrent or slightly older related works on minimal width UAP.
Overall Assessment
This is a strong theoretical paper that makes meaningful progress on a long-standing question regarding the width needed for universal approximation by deep narrow networks. The constructive nature of the proof and use of geometry/topology is elegant and technically non-trivial. However, the exposition could improve, particularly in motivating the approach, situating it within the broader UAP literature, and expanding on difficult sections.
Overall Recommendation
Score: (Weak Accept)
Confidence: 4 (Expert in this area)
Rationale: The result is solid and interesting, though the work is incremental relative to other recent contributions. A stronger contextualization and more reader-friendly explanations would elevate the impact.
问题
Questions for the Authors
-
Could you clarify why affine coupling flows and single-coordinate diffeomorphisms are particularly suited for universal approximation proofs? What’s the intuition behind choosing these building blocks (as also used in Teshima et al., NeurIPS'20)?
-
In what technical sense do your constructions depart from Teshima et al.? Do you rely on fundamentally different structural decompositions or just a different universal target?
-
The overhead term appears due to Hanin (2019) and Kidger–Lyons (2020). Are these (ReLU) and () corrections provably unavoidable, or could improved constructions close this gap?
-
Could the results extend to architectures that include skip connections, convolutional layers, or residual blocks? Do you expect the minimum width bound to improve or worsen?
-
Could you provide intuition for what the function captures beyond a topological obstruction? Does it have geometric interpretation in network function space?
-
Are there technical obstacles to extending the lower bound arguments to the case of odd ? Could methods from obstruction theory or embeddings help?
Suggestions for Related Work to Cite
The following works seem highly relevant but are not cited or discussed:
- ReLU Network with Width Can Achieve Optimal Approximation Rate (ICML 2024)
- New Advances in Universal Approximation with Neural Networks of Minimal Width (arXiv 2024)
- Minimal Width for Universal Property of Deep RNN (JMLR 2023)
- Approximating Continuous Functions by ReLU Nets of Minimal Width (Yarotsky, 2017)
- Constructive Universal Approximation and Finite Sample Memorization by Narrow Deep ReLU Networks (arXiv 2025)
Including a discussion of how your results compare with these would clarify your contribution.
Typos and Clarity
- Line 383, Eq. 44 (Definition A.3): Does ensure invertibility?
- Line 232: "than that both" "than both"
- Line 242: "To provided" "To provide"
局限性
yes
格式问题
None
Question 1 and 2) While Teshima et al. use affine coupling flows and single-coordinate transformations as fundamental building blocks, our approach relies solely on MLPs, which highlights the additional difficulty of our setting. Also
The choice of affine coupling flows and single-coordinate diffeomorphisms is motivated by technical considerations, allowing us to build upon the existing work of Teshima et al. This approach reduces the problem of approximating an entire diffeomorphism to approximating specific functions in which only one coordinate is modified. This structure facilitates approximation by deep narrow neural networks.
Question 3()) We believe that the overhead term can be reduced. There is no particular reason to believe that the Leaky ReLU activation function is inherently more powerful than other commonly used activation functions. In fact, for ReLU networks, it has been shown that a width of is sufficient, whereas our method currently requires an additional term.
Question 4(Other network architectures) We believe that skip connections can improve the minimum width requirement. Consider a structure of the form , where . Suppose we aim to approximate an arbitrary function . Let be a function that is close to . Then, a function of the form can be a diffeomorphism for sufficiently small , and can be approximated by . Therefore, the summation can be approximated by deep narrow MLPs with skip connections. When composed with suitable affine transformations at the final layer, this construction enables the approximation of arbitrary continuous functions. This suggests that skip connections help the universal approximation while potentially reducing the required network width.
For ResNet blocks, we believe skip connections lead to a similar reduction effect. For instance, under the norm, the optimal minimum width for approximating a function with -dimensional input and output is for MLPs [3], whereas for ResNet blocks, it can be as low as [2].
For convolutional neural networks, defining the notion of minimum width is more subtle. In the CNN literature (e.g., [1]), the number of channels is often used as a proxy for width. In that context, an affine transformation is constructed from an input shape to an output shape, and the problem effectively reduces to computing the minimum width for an input dimension and output dimension as the representing power of deep narrow CNN is almost same to that of deep narrow MLPs.
Believing that the optimal minimum width for MLPs is asymptotically
then the corresponding width for CNNs would follow the same form:
This implies similar minimum width for the CNNs.
We expect analogous improvements in representational power for RNNs using similar reasoning.
Question 5) The quantity represents the minimum intermediate dimension required to preserve the information of an arbitrary function with -dimensional input and -dimensional output.
For example, consider the identity function and suppose it is decomposed as , where and .
Since the identity function is injective, must also be injective.
This leads to a contradiction, because the intermediate dimension (in this case, 1) is too small to preserve the complexity of the original function.
More specifically, the difficulty of diesntagling manifold structure of self-intersecting function is captured by . The more self-intersections the target function has, the harder it becomes to disentangle it in the intermediate dimension.
Question 5 & 6) Yes, the issue is that intersections are not robust under perturbations. We believe that the optimal minimum width satisfies when is odd because there exists a 2-to-1 self-intersection structure of cycles. However, we are currently unaware of any tools that can preserve such intersection structures under small perturbations in the uniform norm. We are also uncertain whether obstruction theory provides a suitable framework for handling this issue. If the reviewer is aware of any relevant concepts or techniques, we would greatly appreciate any suggestions.
Related Works) Thank you for the valuable references. We will add it to related works section.
Typos) Thank you for pointing out the typos. We will correct them. Regarding Definition A.3, should be chosen such that the resulting element is a diffeomorphism.
References
[1] Hwang, Geonho, and Myungjoo Kang. "Universal Approximation Property of Fully Convolutional Neural Networks with Zero Padding." arXiv preprint arXiv:2211.09983 (2022).
[2] Lin, Hongzhou, and Stefanie Jegelka. "Resnet with one-neuron hidden layers is a universal approximator." Advances in neural information processing systems 31 (2018).
[3] Shin, Jonghyun, et al. "Minimum width for universal approximation using squashable activation functions." arXiv preprint arXiv:2504.07371 (2025).
Thanks for the response. I will keep my score
This paper establishes new upper and lower bounds on the minimum width required for deep, narrow MLPs to achieve universal approximation within continuous function spaces. The new upper bounds now depend on , sharpening earlier results. The argument is built on diffeomorphism-based constructions, bridging geometric topology and neural approximation theory.
优缺点分析
Strengths
- The new upper bound improves prior results: it now scales with rather than .
- The proof is built on diffeomorphism theory, which is original (to my knowledge) in the UAP study and provides new insights.
- The technical development seems solid. I did not verify every detail, but I checked the main ideas and several key lemmas.
Overall, I am impressed by the work.
Weaknesses / Questions
-
Lines 153–154 introduce the notation for the first time, and seem to define it as the “required width for approximation” by neural networks. However, Eq. (12) formally defines as the minimum width needed to approximate arbitrary continuous functions using diffeomorphisms. This mismatch is confusing. Please clarify.
-
In Lemma 4.3, does the embedding need to be piecewise linear, or is smoothness sufficient? Please state explicitly what regularity is required.
-
Lemma 4.1 is crucial. Once it is established, the main result follows fairly directly from known results in topological algebra (if I understood correctly). However, its full proof is scattered throughout the appendix, which makes it difficult to follow. Could you provide a more organized proof outline or a concise roadmap in the main text?
问题
Please see "Weakness/Questions"
局限性
Yes.
最终评判理由
My comments are mostly minor. After discussion with the authors in the rebuttal phase, I decide to keep my scores.
格式问题
NA
Question 1) Sorry for the confusion. The statement “required width for approximation” in Lines 153–154 refers to the minimal width required to approximate any arbitrary continuous function using diffeomorphisms, as formulated in Equation (12), not the width of neural networks. We will revise the statement from “Now, we quantify the required width for approximation” to “Now, we quantify the required geometric width for approximation” to clarify this point.
Question 2) Lemma 4.3 requires only smoothness.
Question 3) Thank you for pointing this out. While we appreciate your observation, we respectfully note that the topological and algebraic components of the proof are not entirely straightforward, as they involve several subtle challenges. We will add the following proof outline to clarify our approach:
In [1]., it was shown that any diffeomorphism can be approximated by a composition of single coordinate transformations. Therefore, it suffices to prove that deep narrow MLPs can approximate any such single coordinate transformation (see Definition A.3 for the formal definition). This is established in Lemma B.1. With the exception of the Leaky-ReLU case, the proof is relatively direct.
For the Leaky-ReLU case, we progressively extend the class of functions that can be approximated. Using Lemma B.3, we show that any increasing scalar function can be approximated by width-1 Leaky-ReLU networks. This result implies that any width- neural network with increasing activation functions can be approximated by a Leaky-ReLU network of the same width (see Corollary B.4).
Building on this, we prove in Lemma B.5 that any ACF (see Definition A.4) can be approximated by deep narrow MLPs. Finally, Lemma B.6 shows that any single coordinate transformation can, in turn, be approximated by an ACF.
[1] Teshima, Takeshi, et al. "Coupling-based invertible neural networks are universal diffeomorphism approximators." Advances in Neural Information Processing Systems 33 (2020): 3362-3373.
Thank you for the response. I will maintain my positive scores.
This paper studies the minimal width required for universal approximation using narrow but deep networks. The paper introduces a relatively new proof technique relying on the ability of DNNs to represent diffeomorphisms, allowing to describe a width that is both necessary and sufficient for universal approximation.
We therefore are happy to accept this paper at NeurIPS, in line with the recommendations of all but one of the reviewers. The only negative review had little concrete criticism and mostly focused on the possibility of a double submission. As pointed by the SAC, this question was settled by the SAC/PC which having found no evidence of a double submission have then decided to disregard the comment.