7.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.3

置信度

创新性2.8

质量3.0

清晰度2.8

重要性2.5

NeurIPS 2025

Optimal Minimum Width for the Universal Approximation of Continuously Differentiable Functions by Deep Narrow MLPs

Geonho Hwang

OpenReview PDF

提交: 2025-05-04更新: 2025-10-29

TL;DR

In this paper, we investigate the universal approximation property of deep, narrow multilayer perceptrons (MLPs) for $C^1$ functions under the Sobolev norm, specifically the $W^{1, \infty}$ norm.

摘要

In this paper, we investigate the universal approximation property of deep, narrow multilayer perceptrons (MLPs) for $C^1$ functions under the Sobolev norm, specifically the $W^{1, \infty}$ norm. Although the optimal width of deep, narrow MLPs for approximating continuous functions has been extensively studied, significantly less is known about the corresponding optimal width for $C^1$ functions. We demonstrate that the optimal width can be determined in a wide range of cases within the $C^1$ setting. Our approach consists of two main steps. First, leveraging control theory, we show that any diffeomorphism can be approximated by deep, narrow MLPs. Second, using the Borsuk-Ulam theorem and various results from differential geometry, we prove that the optimal width for approximating arbitrary $C^1$ functions via diffeomorphisms is $\min(n + m, \max(2n + 1, m))$ in certain cases, including $(n,m) = (8,8)$ and $(16,8)$, where $n$ and $m$ denote the input and output dimensions, respectively. Our results apply to a broad class of activation functions.

关键词

Universal ApproimationSobolev normContinuously differentiable functionsdeep narrow MLPsdeep learning

评审与讨论

审稿意见

评分: 5置信度: 42025-06-11

The paper attempts to find the minimum width necessary such that a set of neural networks of arbitrary depth are still universal approximators. Instead of being universal in the set of continuous functions, only continuously differentiable functions are required to be approximable. The norm used for approximation is the uniform Sobolev norm of first order.

The first central result is Theorems 4.1 in which it is demonstrated that arbitrary diffeomorphisms can be approximated in the aforementioned sense requiring a width of $d$ or $d+1$ . It is then shown that compositions of diffeomorphisms and affine embeddings can be approximated based on a certain embedding number. This yields an abstract approximation result for general C^1 functions in Theorem 4.6. The embedding numbers are then calculated in Section 5. Corresponding lower bounds are also identified, albeit only in special cases (Theorem 5.9)

优缺点分析

The results are certainly interesting, the proof idea based on first apprpoximating diffeomorphisms is insightful and the arguments are mathematically sound.

On the other hand, I see a couple of weaknesses:

The novelty could be explained a bit better. To what extend is the argument given here new? Has it never been used in similar approaches, e.g. when proving universality statements in other norms? There is also quite some literature on showing universality of Resnets, where similar ideas are likely developed. I think these should be discussed in the related work section.
The significance is not that strongly supported. In some sense, this only extends known universality results to a more obscure norm on a more obscure function space. Why should one care about this? Are C^1 functions particularly important in some application? Is the W^{1,\infty} norm used too assess the quality in prominent learning problems?
Some statements seem intentionally vague when they do not need to be. In particular, the claim "we prove that the optimal width for approximating arbitrary $C^1$ functions is [...] in infinitely many cases." This use of "infinitely many" is misleading, because these are actually very few cases for which a certain combinatorial condition is satisfied. A more honest statement would be more appropriate.
The proof of Lemma A.1 is slightly inaccurate. The composition of W^{1,\infty} functions is generally not well defined, because they are only defined almost everywhere. The issue is that, for f \circ g, the function g could map a non-zero set to a zero set. This can easily be remedied by assuming that in any composition one takes a continuous representative of the outer function which is possible since f in W^{1,\infty}. Hence, Lemma A.1 works perfectly, well, it just needs a footnote somewhere explaining how to define the composition.
Some minor issues:

In the formula under line 143, f(x) should be replaced by f.
In line 162 the "a" before Lipschitz should be removed.
Some brackets are wrong in line 276.
Line 309-310 seem to be copy-pasted from a different paper, because they seem hardly fitting for this manuscript.

问题

Could the authors add an overview of results in the literature on resnets and explain the novelty of their proof idea?
Can the significance of the results be better described?
Can they improve the misleading statements?

局限性

yes

最终评判理由

The reviewers have very precisely addressed my concerns. I think the paper is very good, but not ground-breaking. Therefore, I think the score 5 is appropriate.

格式问题

none

作者回复

2025-07-28

Novelty) Thank you for your comment. The idea of approximating diffeomorphisms using deep narrow MLPs and control theory was first introduced in [1]. And later in [2], the authors applied this property to study universal approximation under the uniform norm. However, their conclusion was partially incorrect due to a flawed treatment of diffeomorphisms. In [9], the concept of approximating continuous functions using diffeomorphisms is rigorously estabilished. However, the authors do not provide lower bounds that are applicable in practical settings. Moreover, due to their proof technique being tailored to the uniform norm, the result cannot be directly generalized to the Sobolev norm setting.

In our work, we adopt a framework based on control theory and extend the results to the Sobolev norm setting. We believe that introducing the Sobolev norm into the minimum width literature is itself a novel contribution, as it offers a useful tool—such as the robustness of topological structures—for establishing lower bounds. Demonstrating this property is also one of our key contributions. Moreover, we provide a rigorous characterization of the relationship between the optimal minimum width and $\Omega(n,m)$ in the Sobolev norm.

Most importantly, by employing concepts from real projective spaces and the Borsuk–Ulam theorem, we establish—for the first time—a nontrivial optimal lower bound in cases that had not been previously proven. To the best of our knowledge, this is the first instance in which such a bound is derived using geometric concepts, including real projective spaces and the Borsuk–Ulam theorem. We believe this is where the novelty of our work lies.

Regarding the ResNet case, we will add relevant references [3–5] to the related work section. In particular, [4] employs a similar control-theoretic technique for approximation. If there are additional references we have missed, we would appreciate your suggestions.

Significance) Thank you for pointing this out. We will add the following sentences to the introduction:

However, there has been a scarce number of papers that study norms involving derivatives of functions in the setting of deep narrow MLPs. Many deep learning techniques directly penalize the difference between the derivative of the target function and that of the network. These include Sobolev Training [6], Physics-Informed Neural Networks (PINNs) [7], and Generative Adversarial Networks with Gradient Penalty [8].

Vague Conditions) We will revise the statement as follows:

In some cases, the lower bound coincides with the upper bound, thus identifying the optimal minimum width. This includes interesting cases such as (8,8) and (16,8). The exact pairs to which our result applies can be found in Theorem 5.9.

The proof of Lemma A.1) Thank you for identifying the error. We will revise the proof and add a footnote as suggested by the reviewer.

Minor Issues) We will correct these as noted.

References

[1] Duan, Yifei, et al. "Vanilla Feedforward Neural Networks as a Discretization of Dynamical Systems." arXiv preprint arXiv:2209.10909 (2022).

[2] Li, Li’Ang, et al. "Minimum width of leaky-ReLU neural networks for uniform universal approximation." International Conference on Machine Learning. PMLR, 2023.

[3] Lin, Hongzhou, and Stefanie Jegelka. "Resnet with one-neuron hidden layers is a universal approximator." Advances in neural information processing systems 31 (2018).

[4] Aizawa, Yuto, Masato Kimura, and Kazunori Matsui. "Universal approximation properties for an ODENet and a ResNet: Mathematical analysis and numerical experiments." arXiv preprint arXiv:2101.10229 (2020).

[5] Tabuada, Paulo, and Bahman Gharesifard. "Universal approximation power of deep residual neural networks through the lens of control." IEEE Transactions on Automatic Control 68.5 (2022): 2715-2728.

[6] Czarnecki, Wojciech M., et al. "Sobolev training for neural networks." Advances in neural information processing systems 30 (2017).

[7] Raissi, Maziar, Paris Perdikaris, and George E. Karniadakis. "Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations." Journal of Computational physics 378 (2019): 686-707.

[8] Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International conference on machine learning. PMLR, 2017.

[9] Hwang, Geonho. "Minimum width for deep, narrow mlp: A diffeomorphism approach." arXiv preprint arXiv:2308.15873 (2023).

2025-08-05

Thank you very much for clearly addressing all my points. I have updated my score. I don't think it could be further increased at this point without fundamentally changing the paper though.

审稿意见

评分: 4置信度: 32025-06-26

The paper studies the minimum width requirement for deep narrow MLPs to approximate continuous differentiable ( $C^1$ ) functions. It shows (Theorem 5.1) that when the width is at least $\min(d_x+d_y, \max(2d_x+1, d_y))$ , deep MLPs can approximate any $C^1$ functions under the Sobolev norm. The quantity is also shown to be a tight lower bound when the input and output dimensions $d_x$ and $d_y$ satisfy certain conditions (Theorem 5.9). These bounds together characterize the optimal minimum width for the universal approximation of deep narrow MLPs for a wide variety of input/output dimension configurations.

优缺点分析

Strength:

The paper provides sufficient technical background for readers to digest the main results.
The proof based on diffeomorphism is well presented. Converting the problem of approximating continuously differentiable functions to discretizing an ODE by MLP layers is interesting.
Showing a tight lower bound matching the upper bound has great theoretical significance.

Weaknesses:

The results are asymptotic in depth and do not provide explicit dependence on the property of the target function, making the result less significant. See Limitations.
There are too many different notations, which could be confusing. For example, $n,m,d,d_x,d_y,d_0, d_N$ are all used for dimensions, sometimes different notations are referring to the same thing. I understand that some notations are for general background definition, but at least Section 1 ( $d_x,d_y$ ) and Section 6 ( $n,m$ ) can be made consistent.

typo:

line 276: redundant parentheses in $\Omega(n, m)$ .

问题

The tight lower bound applies to continuously differentiable functions with finite $W^{1,\infty}$ norm, which is more restrictive than continuous functions with finite Lipschitz norm. Is this additional assumption (differentiability) necessary for the lower bound, or is it just a proof artifact so that you can invoke the results from control theory/differential topology?
If we assume higher-order smoothness, i.e., $W^{k,\infty}$ with $k>1$ , is it possible to get a better bound that depends on $s$ ?

局限性

Partially addressed. The authors have mentioned several limitations, e.g., the lower bound only applies to input/output dimensions satisfying the conditions in Theorem 5.9. However, there are some other limitations that are not mentioned:

The result is non-constructive and asymptotic in depth, and there is no explicit dependence on the precision, so we do not know the rate of approximation, i.e., how deep we require the network to be to approximate a function, which is important for understanding the extent of universal approximation. There are works on constructive approximation that provide more explicit dependence, while they typically require wider networks [1].
Similar to 1, the result does not characterize the dependence on the smoothness of target functions, which is again crucial for understanding the fine-grained behavior of universal approximation. In principle, one would expect that smoother functions are easier to approximate by MLPs, which is shown in e.g. [2] by considering the order of smoothness (e.g. $k$ in Sobolev $W^{k,\infty}$ ).

[1] Yarotsky, Dmitry. "Error bounds for approximations with deep ReLU networks." Neural networks 94 (2017): 103-114.

[2] Schmidt-Hieber, Johannes. "Nonparametric regression using deep neural networks with ReLU activation function." The Annals of Statistics 48.4 (2020): 1875-1897.

最终评判理由

I have no additional questions and the authors' response well addresses my previous questions/concerns. I will keep my score for recommendation.

格式问题

I did not notice any major formatting issues.

作者回复

2025-07-28

Notations) Thank you for pointing out. We will ensure that the notation $d_x, d_y$ is used consistently throughout the paper.

Typo) Thank you for pointing out the typo. We will fix it.

Essentialness of $C^1$ ) We believe the assumption is essential.
Since $L^{\infty}$ is not separable (i.e., it does not have a countable dense subset), neural networks with continuous activation functions(which form only a countable-dimensional function class) cannot approximate all functions in this space.
Similarly, the space of continuous functions with bounded derivatives is also non-separable.
As a result, neural networks with $C^1$ activation functions cannot approximate certain functions in that space.
Therefore, it is necessary to restrict the function space to a $C^1$ -function space with a countable basis.

Extension to Higher-Order) Thank you for the insightful question. We believe that if $k \geq 2$ , the same optimal minimum width as in the case $k = 1$ (our setting) still holds. Since the topology of $W^{k,\infty}$ becomes finer for $k \geq 2$ , the lower bound remains valid. For the upper bound, we expect that Lemma B.7 can be generalized to the $k \geq 2$ case, albeit with some tedious calculations.

We believe that the main obstruction to proving the upper bound lies in Lemma B.2. To approximate affine transformations using nonlinear activation functions in the $W^{k,\infty}$ norm, additional arguments are needed. For instance, in the case $k = 2$ , if there exists a point $x$ such that $\sigma'(x) \neq 0$ and $\sigma''(x) = 0$ , a result analogous to Lemma B.2 can be established.
Such an activation function may potentially be constructed by composing simpler activation functions.

Extending these results to the case $k > 2$ would be an interesting direction for future research.

Limitations) Thank you for pointing out the limitations. We will add to the paper noting that our result is non-constructive and asymptotic in nature.

2025-08-05

I thank the authors for their response, which well addresses my concerns and questions. I keep my recommendation for acceptance.

审稿意见

评分: 5置信度: 32025-06-29

This paper studies the minimum width $w_{\min}$ for which the class of deep, wide networks (i.e. neural networks consisting of any number of hidden layers, each of width at most $w_{\min}$ ) can universally approximate $C^1$ functions in the $W^{1,\infty}$ Sobolev norm. Using tools from control theory and differential geometry, the authors demonstrate an upper bound on $w_{\min}$ . This upper bound is also shown to be optimal (i.e., a matching lower bound exists) for some combinations of input dimension $n$ and output dimension $m$ .

优缺点分析

Strengths:

This result is interesting, nontrivial and, to my knowledge, novel. The role of depth in neural network expressiveness remains an important open question, and a more complete understanding of the approximation power of fixed-width, infinite-depth networks seems relevant in addressing this. The consideration of universal approximation of the functions themselves and their derivatives also seems novel and I can see that this could be relevant for many applications.

Weaknesses:

The problem should be motivated more explicitly. At least a few more sentences of the introduction should be devoted to making the case of why the fixed-width, infinite-depth regime is interesting to study, and why controlling approximation of derivatives is interesting (are there particular applications where this is of interest? Does derivative control imply anything about generalization? etc.)
The presentation/exposition in the paper could be improved. Some specific comments:
- Some of the notation throughout seems cumbersome and not especially necessary. For instance, would be clearer and easier if $\nabla_{d_x, d_y, n}^{\Sigma}$ were just defined in words as "the set of MLPs with activation function(s) $\Sigma$ of arbitrary depth and hidden layer width at most $m$ ." Then the symbolic definition in equation (16) and the definition of $\mathcal{N}_{d_0, \dots, d_N}^\Sigma$ can just be dropped.
- Regarding notation, it would also be clearer to express the problem statement and the meaning of universal approximation in words rather than using the containment/closure notations. For instance, instead of, or in addition to, defining $w_{\min}^{W^{1, \infty}}$ symbolically as in (25), it should be explicitly stated in words that this is the minimum width for which MLPs of this width and arbitrary depth can approximate $C^1$ functions to any accuracy in the $W^{1, \infty}$ norm. This makes the paper's main result much easier to digest.
- The authors should not assume that the reader has much familiarity with concepts from differential geometry. Terms like immersion, submersion, embedding and transversality should be defined in-text in the most elementary way possible (partials of all orders are continuous, Jacobian is full rank, etc.). The formal definition of vector field flow probably doesn't need to go in the main paper; it would be better to motivate it intuitively when describing the proof strategy for Theorem 4.1 and safe the formal definition for the appendix. If any of the concepts can be illustrated with a figure, this would be helpful.

问题

I am confused about the domains of approximation on which the result holds. What does it mean for MLPs to be able to universally approximate $C^1(\mathbb{R}^n; \mathbb{R}^m)$ functions with respect to $W_{\textrm{loc}}^{1,\infty}(\mathbb{R}^n; \mathbb{R}^m)$ ? The local Sobolev space $W_{\textrm{loc}}^{1,\infty}(U; \mathbb{R}^m)$ in the paper is only defined for compact domains $U \subset \mathbb{R}^n$ . Is the idea that any $C^1(\mathbb{R}^n; \mathbb{R}^m)$ can be universally approximated in the $W_{\textrm{loc}}^{1,\infty}(U; \mathbb{R}^m)$ norm on any compact domain $U \subset \mathbb{R}^n$ ?

局限性

Yes

最终评判理由

I thank the authors for their response and for their answers to my question about the approximation domain. I maintain my recommendation for acceptance, encouraging the authors to make the revisions/clarifications that we discussed.

格式问题

None

作者回复

2025-07-28

Regarding the motivation) Thank you for the valuable advice. We will add the following paragraphs to the introduction:

The choice of neural network architecture plays a crucial role in determining performance. However, in practice, architectural decisions are often made through trial and error. It is therefore important to provide theoretical guidance on what should be avoided and how to select appropriate width and depth based on the input space, target function, and specific tasks. In particular, from the perspective of approximation power, identifying the minimum width required for universal approximation is a key question.
However, there has been a scarce number of papers that study norms involving derivatives of functions in the setting of deep narrow MLPs. Many deep learning techniques directly penalize the difference between the derivative of the target function and that of the network. These include Sobolev Training [1], Physics-Informed Neural Networks (PINNs) [2], and Generative Adversarial Networks with Gradient Penalty [3].

Regarding the presentation) Thank you for the detailed advice.

We will defer the equational definition of deep narrow MLPs and the definition of the set of MLPs to the appendix.
We will also incorporate a verbal explanation of $w$ , as suggested by the reviewer.
Additionally, we will include definitions of immersion, submersion, diffeomorphism, embedding, and transversality in the appendix.
The formal definition of the flow of a vector field will be defered to the appendix.
We will focus mainly on explaining the concepts or strategies of the proofs, rather than including all detailed proofs.

Domain of the approximation) Thank you for the insightful question. In our setting, $U$ is not assumed to be compact; it is typically an open set, such as $\mathbb{R}^n$ . The space $W_{\textrm{loc}}^{1,\infty}(U; \mathbb{R}^m)$ is defined via local behavior: for a precompact set $V\Subset U$ , we consider the restriction of functions to $V$ and $W^{1,\infty}(V; \mathbb{R}^m)$ norm.

In other words, the two statements are equivalent:

$C^1(\mathbb{R}^n; \mathbb{R}^m)$ can be universally approximated in the $W_{\textrm{loc}}^{1,\infty}(\mathbb{R}^n; \mathbb{R}^m)$ norm.
For any precompact set $V\Subset \mathbb{R}^n$ , $C^1(V; \mathbb{R}^m)$ can be universally approximated in the $W^{1,\infty}(V; \mathbb{R}^m)$ norm.

In short, this is simply a precise way of stating that we can approximate any $C^1$ function in the Sobolev norm on any compact subset.

We will clarify this point in the revised version of the paper to avoid confusion and to make the domain of approximation more explicit.

References

[1] Czarnecki, Wojciech M., et al. "Sobolev training for neural networks." Advances in neural information processing systems 30 (2017).

[2] Raissi, Maziar, Paris Perdikaris, and George E. Karniadakis. "Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations." Journal of Computational physics 378 (2019): 686-707.

[3] Arjovsky, Martin, Soumith Chintala, and Léon Bottou. "Wasserstein generative adversarial networks." International conference on machine learning. PMLR, 2017.

2025-08-05

审稿意见

评分: 5置信度: 32025-06-30

This paper studies the minimal width required for an MLP to be universal for functions from $\mathbb{R}^n \rightarrow \mathbb{R}^m$ . The authors improve upon existing results in two ways. First, they consider universality with respect to the $C^1$ -norm. Second, they use tools from algebraic topology to obtain new upper and lower bounds for a variety of values of $n$ and $m$ .

优缺点分析

Strengths: The problem of determining the minimal width required for universality is an interesting and important theoretical problem. I believe the authors have made significant progress on this problem by relating it to a quantity, which they call $\Omega(n,m)$ , that appears to be a natural object of study in algebraic topology. The authors also extend existing results by considering universality with respect to the $C^1$ -norm instead of just $L_p$ -norms.

Weaknesses: Some of the writing is a bit unclear. For example, the authors consider neural networks where the activation layer in each layer can change within a class of activation functions (such as the leaky ReLU). I personally think it would be more clear if the results focused on the case where the activation function is fixed. The varying activation function considered does not seem to be essential and it makes the paper a bit harder to follow.

问题

In Theorem 4.1, when you say that $\sigma = LR$ , you mean that the activation function can be chosen to be a Leaky ReLU with a different value of $\beta$ at each layer, right? Otherwise, if $\sigma$ is a fixed Leaky ReLU activation function this bound appears to contradict the lower bound proved in [Johnson 2018]. This led to a bit of confusion while reading.

Has the quantity $\Omega(n,m)$ been directly studied in the algebraic topology literature before? I see that numerous related problems have been extensively studied. I think it might be nice to include a table with all of the known results and open problems related to $\Omega(n,m)$ .

局限性

yes

最终评判理由

This is an interesting theoretical paper. The authors have answered all my questions and I have no concerns. I stick by my assessment of accept.

格式问题

No paper formatting concerns.

作者回复

2025-07-28

Regarding the varying activation functions) Thank you for the valuable advice. In fact, the use of varying activation functions was primarily motivated by the elegance of Equation (33) in Theorem 4.6:

We believe that, for certain activation functions, a set of varying activation functions could potentially be replaced by a single fixed activation function. However, we have not been able to find a concrete example demonstrating this. We consider this an interesting direction for future research. We will also improve the writing related to the activation function to better convey the content.

Regarding the contradiction to [Johnson2018]) Yes, in the case of $\sigma = \mathrm{LR}$ , the activation function can be a Leaky ReLU with a different value of $\beta$ at each layer. However, this does not imply a contradiction with the result of [Johnson 2018], even when the activation function is fixed. This is because Theorem 4.1 concerns the approximation of diffeomorphisms specifically, rather than arbitrary continuous functions.

Algebraic Topology Literature) To the best of our knowledge, there is no existing reference that explicitly defines the quantity $\Omega$ . However, several studies offer valuable insights that contribute to understanding or computing the value of $\Omega$ . We have made an effort to compile and list all known values of $\Omega$ in the paper, based on our current understanding of geometry and topology.

2025-08-05

I thank the authors for their response. I don't have any further questions.

最终决定Accept (poster)

2025-09-17

This paper studies the minimum width for deep narrow networks for approximating $C_1$ function under the Sobolev norm $W^{1,\infty}$ norm. The novelty is in studying minimal width for $C_1$ function as continuous functions has been studied previously.

Authors leverage control theory and differential geometry to provide a minimum width in this setup.

All reviewers agreed on the good quality of the work :

Reviewer LpbM and other reviewers suggested adding motivation to why the $C_1$ function are important, authors gave in the rebuttal motivations such as Physics inspired neural networks, Sobolev Training and gradient regularizers in GAN. Note citation [3] that authors provided does not do any gradient regularizer, the citation in GAN with gradient penalty is rather: Improved Training of Wasserstein GANs. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, Aaron Courville On gradient regularizers for MMD GANs. Michael Arbel, Danica J. Sutherland, Mikołaj Bińkowski, Arthur Gretton and many other works such as Sobolev GAN etc ...
points raised by Reviewer 5C4H on presentation and the clarity , to the authors please incorporate this feedback in the final manuscript.
Reviewer g2vW raised similar concerns on vague statements and found an error in the proof of Lemma A.1 which does not compromise the lemma. Please fix these issues.

Overall this is a good paper that contributes to the understanding of deep narrow networks.