PaperHub
6.1
/10
Poster4 位审稿人
最低3最高4标准差0.4
4
3
3
3
ICML 2025

Floating-Point Neural Networks Can Represent Almost All Floating-Point Functions

OpenReviewPDF
提交: 2025-01-24更新: 2025-07-24

摘要

关键词
Universal approximationFloating-Point Arithmetic

评审与讨论

审稿意见
4

This paper explores the classes of functions over various floating-point arithmetics that can be represented using feedforward neural networks (FNNs) employing different activation functions, constrained by the corresponding floating-point arithmetic.

In particular, the authors present a necessary condition, referred to as distinguishability (Lemma 3.2), and a sufficient condition (Condition 1) for an activation function, under which there exists an FNN representing the same function (Theorem 3.4). Furthermore, they provide sufficient conditions for activation functions to meet distinguishability (Section 3.3) and demonstrate that, for several common floating-point arithmetics and activation functions constrained by said arithmetics, they satisfy distinguishability and Condition 1 over the entire range of floating-point numbers (Corollaries 3.9 and 3.11). To provide a comprehensive view, they include an example of an activation function, the rounded cosine, which is not generally expressive when limited by floating-point arithmetic (Lemma 3.3).

update after rebuttal

No changes, I stand with my initial review.

给作者的问题

  • As I am not familiar with the work directly related, can you make clear how the arguments of Park et al. differ from yours? I assume they rely on similar "constructive arguments"?

论据与证据

All claims are provided with rigorous definitions and proofs.

方法与评估标准

As the contributions are purely theoretical, the methods by which they are proven make sense.

理论论述

While I did not examine the core proofs in granular detail, there is no reason to doubt their correctness given the precise definitions and claims the paper provides.

实验设计与分析

Not applicable.

补充材料

I did not check the supplementary material carefully.

与现有文献的关系

As stated by the authors themselves (p. 1, right column, last paragraph), there are two works that are very similar to this one, specifically Hwang et al. (2024) and Park et al. (2024). However, I believe the contribution of this submission is substantial as it represents a natural continuation and generalisation.

遗漏的重要参考文献

None that I am aware of.

其他优缺点

strengths:

  • The paper presents clear and precise arguments, making its case convincingly.
  • It is very rigorous in defining the floating-point arithmetic considered, making the results well-defined.
  • It provides a relatively comprehensive picture of the expressive capabilities of FNNs within floating-point frameworks.

weaknesses:

  • The paper might benefit from offering more intuitions or informal explanations for its results. Overall, it is somewhat challenging to read for those not deeply engaged with the formal or theoretical aspects of machine learning.

其他意见或建议

The abstract is slightly unclear because it does not explicitly specify what "almost all" pertains to. While this level of ambiguity might be acceptable for the title, I believe that the abstract should offer more clarification in this regard for better comprehension.

作者回复

We thank the reviewer for their positive evaluation and thoughtful feedback. We address all comments of the reviewer below.

The paper might benefit from offering more intuitions or informal explanations for its results. Overall, it is somewhat challenging to read for those not deeply engaged with the formal or theoretical aspects of machine learning.

Following the reviewer’s comment, we will include the following informal explanations describing our intuition behind our proof at the beginning of Section 4.

“We construct the target four-layer network as a linear combination of indicator functions of points (Lemma 4.1). Given zz in the domain, our construction of the indicator function of zz consists of three parts. In the first part, we create an injective layer consisting of an affine transformation followed by an activation function based on the distinguishability of the activation function. Due to the injectivity, all inputs in the domain have distinct output vectors after passing this part. Here, we use y=(y1,,yk)y=(y_1,\dots,y_k) to denote the output of the first part of zz. In the second part, we construct 2n2n binary step functions f1,1,f1,2,,fn,1,fn,2f_{1,1},f_{1,2}, \dots,f_{n,1}, f_{n,2} where fi,1(x)=(C1C0)1[xyi]+C0f_{i,1}(x)=(C_1-C_0)1[x\ge y_i]+C_0 and fi,2(x)=(C1C0)1[xyi]+C0f_{i, 2}(x)=(C_1-C_0)1[x\le y_i]+C_0. Namely, all fi,jf_{i,j} are C1C_1 if and only if the input to the network is zz. To implement such fi,jf_{i,j}, we exploit the round-off error in floating-point operations and show that we can increase the gap between arbitrary two numbers by sequentially adding the same floating-point numbers to those two numbers (see Lemma 4.6). In the third part, we construct a function that outputs C1C_1 or C2C_2 if all fi,jf_{i,j} are C1C_1 and outputs C0C_0 otherwise. We then apply the activation function after this function so that our indicator function is either zero (i.e., σ(C0)\sigma(C_0)) if the input is zz or some non-zero value (i.e., σ(C1)\sigma(C_1) or σ(C2)\sigma(C_2)) otherwise.”

The abstract is slightly unclear because it does not explicitly specify what "almost all" pertains to. While this level of ambiguity might be acceptable for the title, I believe that the abstract should offer more clarification in this regard for better comprehension.

We thank the reviewer for pointing out a confusing expression. Following the reviewer’s comment, we will change the last sentence in the abstract from

“Our result shows that with practical activation functions, floating-point neural networks can represent almost all floating-point functions.”

to

“Our result shows that with practical activation functions, floating-point neural networks can represent floating-point functions from a wide domain to all finite or infinite floats, where the domain is all finite floats for the activations sigmoid and tanh, and it is all floats of magnitude less than 1/81/8 times the largest float for the activations ReLU, ELU, SeLU, GELU, Swish, Mish, and sin.”

As I am not familiar with the work directly related, can you make clear how the arguments of Park et al. differ from yours? I assume they rely on similar "constructive arguments"?

The network construction of Park et al. (2024) and that of our work are both based on a linear combination of indicator functions, as in our Lemma 4.1. However, Park et al.’s construction of indicator functions is specialized to ReLU and binary step activation functions (since they consider only these two activations), whereas our construction covers general activation functions satisfying Condition 1.

  • For example, in (Park et al., 2024), they use the vanilla binary step function as an indicator function; for ReLU, they construct an indicator function using the following intuition: 1[xz]ReLU(1zz(xz))ReLU(1zz(xz))1[x\ge z] \approx \text{ReLU}\left(\frac1{z-z^-}(x-z^-)\right) - \text{ReLU}\left(\frac1{z-z^-}(x-z)\right).

  • On the other hand, our construction of indicator functions is completely different: we mainly use the properties of a floating-point summation and the round-off errors from floating-point additions (Lemma 4.6).

We would be happy to clarify any concerns or answer any questions that may come up during the discussion period.

审稿意见
3

This paper investigates the expressive power of floating-point neural networks, focusing on their ability to represent floating-point functions under practical constraints such as finite precision and inexact arithmetic operations. The authors establish necessary and sufficient conditions for activation functions to ensure that floating-point neural networks can represent almost all floating-point functions. The main algorithmic contribution lies in the construction of a four-layer network that can represent any floating-point function under the given conditions, leveraging properties of floating-point arithmetic and activation function.

给作者的问题

None.

论据与证据

The claims made in the submission are generally well-supported by clear and convincing evidence. The authors provide rigorous mathematical proofs and detailed analyses to substantiate their claims regarding the necessary and sufficient conditions for activation functions in floating-point neural networks. Overall, the evidence is strong, but empirical validation would enhance the paper's impact.

方法与评估标准

The proposed methods and evaluation criteria are well-suited for the problem at hand, focusing on the theoretical underpinnings of floating-point neural networks and their ability to represent floating-point functions. The authors employ a rigorous mathematical approach, leveraging properties of floating-point arithmetic and activation functions to establish necessary and sufficient conditions for universal representation.

理论论述

The authors provide detailed and rigorous arguments to support their claims, particularly in establishing the necessary and sufficient conditions for activation functions in floating-point neural networks.

实验设计与分析

The paper primarily focuses on theoretical contributions and does not include experimental designs or empirical analyses.

补充材料

No supplementary material.

与现有文献的关系

The key contributions of this paper are closely related to and build upon several important strands of research in the broader scientific literature on neural networks and their expressive power. The authors extend the classical universal approximation theorems, which traditionally assume real-valued parameters and exact mathematical operations, to the more practical context of floating-point arithmetic with finite precision and inexact operations.

遗漏的重要参考文献

None.

其他优缺点

This paper demonstrates significant strengths in its originality and theoretical depth. However, the paper also has some weaknesses that could be addressed to further strengthen its contribution.

-Lack of empirical validation. While the theoretical results are robust, the absence of experimental results or case studies makes it difficult to assess the practical implications and real-world performance of the proposed methods.

-While the paper covers a wide range of activation functions, it does not explore the potential trade-offs or practical challenges associated with implementing these functions in real-world systems, such as computational efficiency or hardware constraints.

其他意见或建议

None.

作者回复

We thank the reviewer for their positive evaluation and thoughtful feedback. We address all comments of the reviewer below.

Lack of empirical validation. While the theoretical results are robust, the absence of experimental results or case studies makes it difficult to assess the practical implications and real-world performance of the proposed methods.

Since our universal approximator construction consists of an affine combination of indicator functions (Section 4), following the reviewer’s comment, we have empirically verified our indicator function constructions (Lemma 4.1) using GELU networks and 16-bit floating-point arithmetic (float16). Specifically, we constructed indicator functions fz(x):=1[x=z]f_z(x) := 1[x=z] of points z[1,1]_Fz \in [-1,1]\_{\mathbb F} (whose domain is [1,1]_F[-1,1]\_{\mathbb F}), where F\mathbb F denotes the set of 16-bit floats. We sampled 120120 random zz from [1,1]_F[-1,1]\_{\mathbb F} (i.e., 120120 indicator functions), and constructed the indicator functions fzf_z using GELU networks. For each zz, we checked whether our construction generates the same output as in 1[x=z]1[x=z] for 10001000 random inputs xx drawn from [1,1]_F[-1,1]\_{\mathbb F}. According to our experimental results, the outputs of our constructions coincided with those of the true indicator functions, confirming that our construction is correct.

The code is available at the following anonymous link:

https://anonymous.4open.science/r/floating-points-indicators-A112/

To run the code, use the following command (trials: the number of indicator constructions, verification_steps: the number of verification steps for each constructed indicator) with NumPy version 1.24.0:

python indicator_test.py --activation gelu --trials 120 --verification_steps 1000

While the paper covers a wide range of activation functions, it does not explore the potential trade-offs or practical challenges associated with implementing these functions in real-world systems, such as computational efficiency or hardware constraints.

Theorem 3.4 (our main result) holds for all floating-point activation functions that satisfy Condition 1 and a distinguishability condition. Corollaries 3.9 and 3.11 show that the following are examples of such activation functions: the correctly-rounded versions of many popular real-valued activation functions, such as id (the identity function), sin, tanh, ReLU, ELU, SeLU, GELU, Swish, Mish, and Sigmoid. We believe there would be no practical challenges in implementing these functions in real-world systems:

  • The following functions are already implemented in real-world systems and widely used: id\lceil \text{id} \rfloor and ReLU\lceil \text{ReLU} \rfloor are available in deep learning frameworks (e.g., PyTorch, TensorFlow, JAX) as torch.nn.Identity()(x), torch.nn.ReLU()(x), etc; sin\lceil \text{sin} \rfloor and tanh\lceil \text{tanh} \rfloor are available in correctly-rounded math libraries (e.g., RLibm, CORE-MATH, CRLibm) as sin(x) and tanh(x).

  • To our knowledge, the other functions listed above (i.e., the correctly-rounded versions of ELU, SeLU, etc.) are not yet implemented in real-world systems. Nevertheless, we believe these functions will also be implemented in the near future, especially given recent active research on correctly-rounded math libraries in various areas such as computer arithmetic (e.g., [Sibidanov+ 2022]) and programming languages (e.g., [Lim+ 2021]).

[Sibidanov+ 2022] The CORE-MATH Project, ARITH 2022.
[Lim+ 2021] High performance correctly rounded math libraries for 32-bit floating point representations, PLDI 2021.

We would be happy to clarify any concerns or answer any questions that may come up during the discussion period.

审稿人评论

The author's response partially addressed my previous concerns. Therefore, I will maintain my weak accept score.

审稿意见
3

The paper describes a condition, under which the discrete neural networks (all NNs running on digital computers with finite precision of real numbers) are universal approximators. I think this is important line of work, because the proof from 90s assumes real numbers, but, computers cannot store real numbers. With respect to the latest popularity of low-bits representations (16bit Floats are common, we even see 4bit Float), the importance is even more pronounced. While these proofs do not have an immediate impact on the practical applications, I believe it is important to know the limits and properties of tools we use, to know if we do the right thing.

给作者的问题

n/A

论据与证据

The paper claims that in order the neural network to approximate all functions, it needs to be able to separate all point. This condition is very similar to the condition used in the proof of universal approximation theorem based on Stone-Weirstrass theorem, and therefore expectable and understandable.

The main claim is Theorem 3.4, which states that if conditions are satisfied, then any function can be approximated with four layer neural network. The proof is based on expressing the function as an affine combination of indicator functions, which is plausible.

方法与评估标准

No experiments.

理论论述

I confess that I did not have enough time to go through the proofs. All I can say that based on my knowledge of match, they looks very plausible. The tools seems appropriate.

实验设计与分析

N/A

补充材料

The supplementary material is absolutely necessary for understanding. In my opinion, the authors could help the reader by repeating the relevant definitions from the main text, such that the reader is spared from going back and forth.

与现有文献的关系

While the universal approximation theorems do not have immediate impact on practice, I personally consider them important. The importance for me is that the problem I am solving is within the solution space, although the precise number of units is not known.

遗漏的重要参考文献

N/A

其他优缺点

It is very hard to read. Sometimes, I have a feeling symbols are not well defined and explained. For example in Definition 3.5, I do not know what η+\eta^+ and η\eta^- is and where it come from.

其他意见或建议

N/A

作者回复

We thank the reviewer for their positive evaluation and thoughtful feedback. We address all comments of the reviewer below.

It is very hard to read. Sometimes, I have a feeling symbols are not well defined and explained. For example in Definition 3.5, I do not know what η+\eta^+ and η\eta^- is and where it come from.

For better readability, we will add a table describing notations and definitions in the Appendix of the final version. η+\eta^+ and η\eta^- denote the smallest float that is larger than η\eta and the largest float that is smaller than η\eta, respectively. Although we defined these notations in Section 2.1, we will recall their definitions after Definition 3.5. We will also recall seldom-used notations in the final draft.

The supplementary material is absolutely necessary for understanding. In my opinion, the authors could help the reader by repeating the relevant definitions from the main text, such that the reader is spared from going back and forth.

Following the reviewer’s comment, we will make the Appendix self-contained in the final version, by redefining notations and definitions at the beginning of the Appendix along with a table summarizing these notations and definitions.

We would be happy to clarify any concerns or answer any questions that may come up during the discussion period.

审稿意见
3

This article explores the representation of arbitrary floating-point functions using neural networks under floating-point arithmetic. It proves that, given certain conditions on the activation function, such representation is feasible for many floating-point functions. Throughout the proof, the article emphasizes the importance of the distinguishability of the activation function.

给作者的问题

  1. In real-world applications, computations may be performed in a multithreaded environment. How does the non-associativity of floating-point arithmetic affect the results in such cases?
  2. In practice, the cos⁡ function is not periodic in computer arithmetic. Could this impact the conclusion of Lemma 3.3?

论据与证据

The claims made in the submission are supported by clear and convincing evidence.

方法与评估标准

The proposed methods make sense for the problem at hand.

理论论述

I only checked the statements but not the proofs. I believe the claims are correct.

实验设计与分析

Null

补充材料

I didn't review the supplementary material.

与现有文献的关系

This paper extends the results of Hwang et al. (2024) and Park et al. (2024). However, I don't see how the generalizations of this paper are helpful for understanding neural networks in real-world applications. Especially, the discussion of network width is lacking in this paper.

遗漏的重要参考文献

From a broader perspective, there has been extensive research on floating-point weights, including studies on binary networks and related topics.

e.g. Yuan, C., & Agaian, S. S. (2023). A Comprehensive Review of Binary Neural Network. Artificial Intelligence Review, 56(11), 12949-13013.

From a proof-based standpoint, the networks considered in this paper need to be extremely wide, which is closely related to the concept of "network in network." When the network is sufficiently wide, a subnetwork that achieves universal approximation can be selected.

e.g. Ramanujan et al. (2020). What’s Hidden in a Randomly Weighted Neural Network? CVPR 2020.

其他优缺点

  1. The floating-point functions and floating-point weights considered in this paper use the same precision, which differs from real-world applications. In practical neural networks, some layers use 8-bit weights, while others use 16-bit. Whether this study can be extended to mixed-precision scenarios?

  2. In Theorem 3.4, the activation function and the domain MM of the floating-point function are coupled, making the conclusions somewhat ambiguous. What happens if the domain is the entire FF? Additionally, the abstract mentions the concept of “almost all,” which is confusing—under what mathematical sense is this term being used?

  3. The main result of this paper proves the existence of a four-layer network. Is this conclusion too specific? Can it be extended to networks with arbitrary depth, such as two-layer or five-layer networks?

  4. The paper does not provide an estimate for network width, which is crucial for practical applications. If the required width is extremely large, then even if the network has function representation capabilities, it may not be practically useful.

其他意见或建议

To achieve a given level of precision, is it better to use a wide network with low precision or a narrow network with high precision? It would be valuable for the authors to explore this question in the paper.

作者回复

We appreciate the reviewer for their valuable comments. Our response is below.

I don't see how the generalizations of this paper are helpful for understanding neural networks in real-world applications.

Classical universal approximation theorems establish that sufficiently wide neural networks can approximate any continuous function. Although achieving universal approximation requires an impractically large number of parameters, these results remain important: as Reviewer ZVKU points out, they help researchers understand the theoretical limits and fundamental properties of the tools they use.

Our work extends these results by considering neural networks with general activation functions operating under floating-point arithmetic. In contrast, recent related works consider narrower or less practical settings: [Park+ 2024] focuses only on ReLU and Step activations, while [Hwang+ 2024] analyzes networks under fixed-point arithmetic. Given that modern networks use diverse activation functions (beyond ReLU and Step) and rely on floating-point arithmetic (as opposed to exact-real or fixed-point arithmetic), our results provide important new insights into the fundamental limits and capabilities of modern networks.

The paper does not provide an estimate for network width, which is crucial for practical applications.

As mentioned above, we focus on showing the existence of a floating-point network that can exactly represent a target floating-point function, as in existing universal approximation results (under real-valued parameters and operations). In this regard, the networks constructed in our paper are not for practical use. That said, we clarify the network size as follows. In the paper, we construct a σ\sigma-network of width O(nd2(p+q)(d+1))O(nd2^{(p+q)(d+1)}), where dd is the dimension of an input and nn is a constant depending on the activation function σ\sigma (e.g., n10n \leq 10 for ReLU, ELU, etc.). However, this size is to cover a general class of activations, and we believe a smaller width may suffice to obtain the same or similar results. Analyzing the minimum required width would be an important future direction.

On mixed-precision.

Our construction can be extended to mixed-precision setups. E.g., if the precision in the last layer can represent the target outputs and the domain is distinguishable under the precision of the first layer, then our construction can still represent any floating-point function even with low-precision intermediate layers, as long as they satisfy Condition 1 under the low precision.

In Theorem 3.4, the activation function and the domain of the floating-point function are coupled, making the conclusions somewhat ambiguous. What happens if the domain is the entire F?

The coupling between the activation function σ\sigma and the domain is necessary, as shown in Lemma 3.2: if the domain (e.g., F\mathbb F) is not σ\sigma-distinguishable with range F\mathbb F (Definition 3.1), then universal approximation using σ\sigma networks is provably impossible. If the domain is the entire F\mathbb F, Theorem 3.4 states that universal approximation is possible whenever F\mathbb F is σ\sigma-distinguishable with range [2emax,2emax][-2^{e_{max}},2^{e_{max}}]. This condition is satisfied, e.g., for σ=Sigmoid,tanh\sigma = \lceil\text{Sigmoid}\rfloor, \lceil\tanh\rfloor (Corollary 3.11).

On the depth of universal approximators.

Our current construction can be extended to five or more layers, but not to two or three layers. This is because our indicator function uses more than two layers and we use an additional layer to represent a linear combination of indicator functions. Investigating the minimum number of layers for universal approximation under floating-point arithmetic is an interesting research direction.

On multithreaded environment.

In our construction, we control round-off errors from floating-point operations by considering a single, fixed order of these operations (Sections 2.2–2.3). We expect that our results can be extended to multithread setups if the order of the operations in each layer is fixed (i.e., it does not depend on input values, memory size, etc.). However, if the order is not fixed, extending our results might be non-trivial due to the non-associativity of floating-point arithmetic.

The cos⁡ function is not periodic in computer arithmetic. Could this impact the conclusion of Lemma 3.3?

The cos\lceil\cos\rfloor function in Lemma 3.3 is the rounded version of cos\cos. Thus, cos\lceil\cos\rfloor is already aperiodic and the conclusion of Lemma 3.3 is valid for cos\lceil\cos\rfloor.

On “almost all” in the abstract.

Due to the 5000-character limit, we refer to our response to the same question from Reviewer EifD.

On related works.

Following the reviewer’s comment, we will include suggested related works on binary networks and sparse networks.

We would be happy to clarify any concerns or answer any questions, that may come up during the discussion period.

审稿人评论

You didn’t respond to the question I raised earlier: “To achieve a given level of precision, is it better to use a wide network with low precision or a narrow network with high precision?” Let me add some context here.

Universal approximation is the foundation for many theoretical results, but it is still far from practical applications. I hope you can expand on this topic more thoroughly — for instance, the estimation of network width should be discussed in detail in the paper. From the bounds you provided, it can be seen that the required width grows exponentially, which is a significant contrast to the real-valued weight setting. See (Yarotsky 2017) and (Zhang et al. 2024). In the case of real-valued weights, researchers have also used constructive methods to prove universal approximation and provide error estimates. I think it's necessary for you to compare your method with those used in the real-valued setting.

Yarotsky, Dmitry. 2017. “Error Bounds for Approximations with Deep ReLU Networks.” Neural Networks.

Zhang, Shijun, Jianfeng Lu, and Hongkai Zhao. 2024. “Deep Network Approximation: Beyond Relu to Diverse Activation Functions.” Journal of Machine Learning Research 25(35): 1–39.

作者评论

We thank the reviewer for their additional comments. In our first response, we were unable to answer the question about precision and network size due to the 5000-character limit. We respond to this question and the additional comments below.

Wide network with low precision vs. narrow network with high precision?

Assume that Condition 1 and the distinguishability of the domain hold for both high-precision and low-precision setups. Under this assumption and our network construction, a network using low-precision arithmetic uses smaller width than one using high-precision arithmetic:

  • In our setup, using high-precision arithmetic implies high-precision inputs and outputs, i.e., the constructed network must process more input values and return more accurate estimates of the target values. To achieve this, our construction uses larger width as arithmetic precision increases.

  • Thus, when the input/output precision is low, using low-precision arithmetic leads to smaller width under our proof strategy. Further, even when the output precision is high, a universal approximator can be built using the mixed-precision arithmetic discussed in our first response. Specifically, by using low-precision parameters for all layers but the last and high-precision parameters for the last layer, we can construct a network with smaller width than in the high-precision case.

On estimation of network width.

As we answered in our first response, our construction uses width O(nd2(p+q)(d+1))O(nd2^{(p+q)(d+1)}), where dd is the input dimension, p,qp,q are the numbers of bits for mantissa and exponent, and nn is a constant depending on the activation function. Here, we use width O(nd2p+q)O(nd2^{p+q}) for each indicator function, and the entire network consists of O(2(p+q)d)O(2^{(p+q)d}) indicator functions, where each indicator is used for each input. This count results in width O(nd2(p+q)(d+1))O(nd2^{(p+q)(d+1)}).

We also note that for a floating-point network of constantly bounded depth, width 2Θ((p+q)d)2^{\Theta((p+q)d)} is necessary to represent all floating-point functions from a bounded domain (e.g., [1,1]_Fd[-1,1]\_{\mathbb F}^d) to F,\mathbb F\cup\\{-\infty,\infty\\}. Consider a floating-point network of width WW and constantly bounded depth. Then, it contains at most O(W2)O(W^2) parameters. Since each parameter can take at most 2p+q2^{p+q} distinct values, this network can represent at most 2O((p+q)W2)2^{O((p+q)W^2)} distinct functions. However, since the number of functions from [1,1]_Fd[-1,1]\_{\mathbb F}^{d} to F,\mathbb F\cup\\{-\infty,\infty\\} (i.e., 2Θ((p+q)d)2^{\Theta((p+q)d)} inputs and Θ(2p+q)\Theta(2^{p+q}) outputs) is 2(p+q)2Θ((p+q)d)2^{(p+q)2^{\Theta((p+q)d)}}, the network requires width at least 2Θ((p+q)d)2^{\Theta((p+q)d)} to represent all those functions. We will include detailed discussion in the paper.

On exponential width growth and real-valued weight setup.

We thank the reviewer for providing related papers. We believe that exponential growth in width also arises in the real-valued setting: under constantly bounded depth, the networks constructed in the papers you mentioned use width εΘ(d)\varepsilon^{-\Theta(d)} (Corollary 2 in [Zhang+ 2024] and Theorem 1 in [Yarotsky 2017]), where ε\varepsilon denotes the target approximation error. Moreover, it has been shown that such exponential growth is necessary for ReLU networks [Yarotsky 2018]. Our results in the floating-point setting closely mirror these existing results in the real-valued setting.

Comparison with existing real-valued constructions.

In our method, we construct indicator functions and represent the target function as an affine combination of them. A similar indicator-function-based approach was used in [Jones 1990], where the binary step function is approximated using a sigmoidal activation function. Such a proof extends to activation functions that can easily approximate the step function, e.g., ReLU and its variants. In contrast, we implement indicator functions by exploiting round-off errors in floating-point operations (see the proof of Lemma 4.6), which enables us to handle general floating-point activation functions.

Many recent constructions leverage the benefits of depth in ReLU networks: constant-width networks of depth LL can represent highly oscillatory functions that cannot be approximated by constant-depth networks of width poly(L)\text{poly}(L). Using this, [Yarotsky 2018] showed that deep narrow architectures are more parameter-efficient than shallow wide ones. We believe analyzing such benefits of depth under floating-point arithmetic is an important research direction. We will include detailed discussion in the paper.

[Jones 1990] Constructive approximations for neural networks by sigmoidal functions, Proceedings of the IEEE.

[Yarotsky 2017] Error Bounds for Approximations with Deep ReLU Networks, Neural Networks.

[Yarotsky 2018] Optimal approximation of continuous functions by very deep ReLU networks, COLT.

[Zhang+ 2024] Deep Network Approximation: Beyond Relu to Diverse Activation Functions, JMLR.

最终决定

This paper contributes to the understanding of universal approximation properties of neural networks with floating point arithmetics. The reviewers largely agreed that the paper is mathematically sound, well-written, and on a timely topic - namely bringing theoretical properties (universal approximation) together with practical restrictions (floating point arithmetics). However, there was some discussion about the novelty and significance of the results, with mainly one reviewer objecting against acceptance because they see the contribution as too narrow. In the end, however, we could reach consensus that this work constitutes a meaningful contribution to ICML, by proving fundamental results which hopefully will, on the long run, trigger more research on the topic. I urge the authors to take the comments by the reviewers seriously. Among others, they should incorporate a decent overview of quantitative UA theorems in the real-valued case and mention analoguous results for the floating point case as an open question.