Optimal Neural Network Approximation for High-Dimensional Continuous Functions
摘要
评审与讨论
This paper proposes a universal approximation theorem that uses a -wide neural network to uniformly approximate a continuous function on a -dimensional compact set arbitrarily well. It is shown that a single-layer neural network needs at least a width of to be a universal approximator on .
优点
- The paper makes the width dependency explicit instead of asymptotic, which is missing from many UAT papers.
- The paper contains a sharpness test. While I believe the test is weak (see the
weaknessessection), it is a good practice for a theoretical paper.
缺点
- The construction of the universal approximator only has a theoretical meaning. While this is the case for many UAT results, the particular construction in this paper is based on extending a continuous function on a Cantor set to the interval . This can never be done stably in practice, which impairs the significance of the paper to the ICLR community.
- The work is very incremental. It is a straightforward assembly of Yarotsky's result and Shen's result. It is not made clear in the paper what is the originality of this particular work. While the paper contains in-text proofs, some of the proofs are not self-contained and are hard to interpret without referring to Shen's result.
- The sharpness part of the paper (
section 3) is questionable. In particular, if I understood correctly, then it is shown that for a single-layer NN, we need at least a width of to do universal approximation. However, as far as I understood, the UAT result (Theorem 3) is for multi-layer NNs. Hence, sharpness is actually not shown. In particular, in a single-layer case, the statement made insection 3is almost obvious, because the possible level sets of such a NN form a subspace of dimension , which cannot be used to do universal approximation. - The activation functions used in the UAT are not standard in the deep learning society.
- The paper has not been carefully proofread by the author(s). The statements of the theorems are not always clear, and there are a lot of typos. To name a few:
- The definition of in section 1.4 does not make sense without variables on the right-hand side.
- In the second paragraph of
section 2, the uniform norm is not properly denoted and the family is not in math mode. - In the third paragraph of
section 3,most of all standard activation functionsis not a standard expression. - In the statement of
Theorem 2, it is unclear what and are without referring to the paragraph before. Then, in the proof, it is said thatIt will be enough for us to prove that for any $\epsilon > 0$ there exist functions $N_1, N_2$. The functions and are prescribed and used in the statement ofTheorem 2. Why should you find those functions in your proof? - In
Theorem 3, it is unclear what is the quantifier onwith a combination of super expressive activation functions and EUAF. Do you prefix a set of possible activation functions or it depends on the function you will approximate? Can they be used arbitrarily in the NN you construct, or do you have to use a certain activation function for a specific neuron? - In
Theorem 4, it is never mentioned what the activation functions are. Thereforeis misspelled on page 5. The condition is not in math mode in the proof ofTheorem 4. In the same paragraph,$g$ a continuousneeds to be corrected. The word ''cannot'' should not be spelled as ''can not.''
问题
- What is the depth of your NN in each of your theorems?
- What is the particular NN architecture that you have in mind? I assume it is fully-connected feed-forward, but you should have a mathematical formula for your NN in a theoretical paper. In addition, do you have bias terms in the NN and does the last layer have an activation function?
- In the second paragraph of
section 2, you wroteeach hidden neuron equipped with some fixed activation from the family \mathcal{A}. Do you mean the activation function is fixed throughout the NN? What about the activation functions in your UAT theorem? - Not a question but a remark. It is not safe to claim that the UAT result in the paper ''alley the curse of dimensionality,'' because the curse of dimensionality is more about the difficulty in sampling the function than approximating it in theory.
In this paper, the authors study the universal approximation problem of deep neural networks. The main contribution of this paper is to derive that when the input domain is , it will require number of neurons to approximate any continuous function with arbitrary precision for DNNs with a combination of super-expressive activation functions and EUAF (elementary universal activation function). The authors construct neural networks of neurons to achieve this and show that the bound is optimal.
优点
Originality: The related works are adequately cited. The authors construct neural networks of neurons to approximate any continuous function with input domain and show that the bound is optimal. This is an interesting result, which will certainly help us have a better understanding of the universal approximation property of deep neural networks from a theoretical way. I have checked the technique parts and found that the proofs sound solid.
Quality: This paper is technically sound.
Clarity: This paper is clearly written. I find it is easy to follow.
Significance: I think the results in this paper are not very significant, as explained below.
缺点
Although the bound is optimal and the construction on achieving neurons is given, I found that the setting of activation functions is artificial in some sense. For example, in the main Theorem 3, the authors require a combination of super-expressive activation functions and EUAF, which makes the DNN not very practical. It would be interesting to derive the results for one fixed activation function and for more architectures used in practice.
问题
It would be interesting to derive the results for one fixed activation function and for more architectures used in practice.
Existing studies have shown that if we use the elementary universal activation function (EUAF) as the activation function, there exists a constant depth and width such that for any , there exists an NN with the specified depth and width that can approximate the true function with an error . This paper showed that the width can be improved to by using a super expressive activation function as the activation function. Furthermore, this paper argues that width is optimal for a particular class of true functions.
优点
- Improvement of the width required to approximate the true function from to is significant.
缺点
- There is significant room for improvement in the organization and writing of the paper.
- Only a part of the claim in Theorem 4 is proved. Specifically, for general , the proof is only given to the case of the form .
- I have a question about the proof of Theorem 4, as I have a counterexample for it. Let , , and . Then, the solution of the system is . However, when we have .
问题
- P2, Section 1.4: is introduced for the space filling function . However, since is only guaranteed to be surjective, may not exist.
- In the paragraph between Theorems 1 and 2, Yarotsky's result is cited to show the existence of . On the other hand, Shen's result is used in the proof of Theorem 2. I want to clarify which is correct.
- For the proof of Theorem 2, only a proof sketch is given. The proof should be mathematically rigorous enough.
【Minor Comments】
- Abstract: It is proved that function space [...] with 11 hidden layers.: When I read this part for the first time, I thought this was the paper's contribution. However, this results from an existing study and should be clearly stated as such.
- Section 1.2: I suggest that the section title is aligned with the content (e.g., Known UAP Results)
- Section 1.4: It is unclear what the expressions inside/outside function refers to. I suggest reconsidering the expression.
- Section 2: ->
- Section 2: A ->
- Section 2.1: Therefore -> Therefore
- Section 2.2: ->
- Section 2.2: EUAf -> EUAF
- Section 4: Section titles are duplicated.
- Some references do not mention the publication year (e.g., Hannin and Sellke in Section 1.2, Kupers in Section 1.4.)
伦理问题详情
N.A.
In this paper, the authors consider computing the optimal width for neural networks of bounded depth to approximate functions in . Considering a special class of activation functions, a set of super expressive activation functions defined by Yarotsky, they construct a neural network of bounded depth with width that can approximate any continuous functions to arbitrary error in sup norm. They further provide examples of multivariate functions where width is minimal for the neural networks to approximate these functions.
优点
- The construction is simple to follow and the statements of the theorems are somewhat easy to understand.
缺点
- It seems that the theorems are obtained by combining previously obtained results by Shen et al., and Yarotsky, and there are not many innovations. Per se, this is not necessarily bad, but combined with the other weaknesses, it makes this paper much less appealing.
- Is a minimum width not trivial? If we only have access to ridge functions , then the neural network only depend son a -dimensional projection of the data and none of section 3 seems to be necessary.
- The problem is very niche, and seems to be a purely mathematical curiosity rather than anything relevant for practice. Non-standard activations and small width neural networks are hard to train. Hence, I wonder how much the ICLR community will enjoy that paper.
- The paper seems to have been written in a rush. There are many typos, either in the text or in the equations. The introduction, discussions and related work are poorly written and very hard to follow. A lot of statements are imprecise, which makes understanding the motivations, difficulties and achievements hard to parse for someone that is not already well aware with this literature. Overall the paper is extremely difficult to read because of all these points. I encourage the authors to dedicate a lot of time improving the writing and clarity of the text, possibly with the help of a writing assistant software.
问题
See weaknesses:
- Improve the text and clarity.
- Is Section 3 not trivial?
- Is there a substantial technical contribution?
This paper is a clear reject. The paper had been reviewed before and reviewer comments have not been addressed. In particular, one of the former reviewers provided counter-examples to claims in the paper and these have not been commented on/have not been addressed. I suggest that the authors take reviewer comments into account for future submissions and do not resubmit the paper with disproven claims in it.
为何不给更高分
n/a
为何不给更低分
n/a
Reject