/10

Poster4 位审稿人

最低3最高4标准差0.4

ICML 2025

Minimum Width for Universal Approximation using Squashable Activation Functions

Jonghyun Shin,Namjun Kim,Geonho Hwang,Sejun Park

提交: 2025-01-20更新: 2025-07-24

摘要

The exact minimum width that allows for universal approximation of unbounded-depth networks is known only for ReLU and its variants. In this work, we study the minimum width of networks using general activation functions. Specifically, we focus on squashable functions that can approximate the identity function and binary step function by alternatively composing with affine transformations. We show that for networks using a squashable activation function to universally approximate $L^p$ functions from $[0,1]^{d_x}$ to $\mathbb R^{d_y}$, the minimum width is $\max\\{d_x,d_y,2\\}$ unless $d_x=d_y=1$; the same bound holds for $d_x=d_y=1$ if the activation function is monotone. We then provide sufficient conditions for squashability and show that all non-affine analytic functions and a class of piecewise functions are squashable, i.e., our minimum width result holds for those general classes of activation functions.

关键词

universal approximationminimum width

评审与讨论

审稿意见

评分: 42025-03-11

The paper investigates the exact minimum network width required for universal approximation of $L^p$ functions on a cube $[0,1]^{d_x}$ by neural networks utilizing squashable activation functions. Squashable functions are defined in this paper as activation functions that, when composed alternatively with affine transformations, can approximate both the identity function and binary step function. The paper proves that the exact minimum width required is $\max\\{d_x, d_y, 2\\}$ , with special conditions when $d_x = d_y = 1$ . It provides conditions under which common activation functions (non-affine analytic and certain piecewise functions) are squashable, significantly extending prior results.

给作者的问题

In practice, both $L^p$ and $\sup$ norms are too weak because derivatives are not guaranteed to converge. For example, can we state the universality in a Sobolev space?

论据与证据

Claim: For squashable activation functions, the exact minimum width for universal approximation is $\max\\{d_x, d_y, 2\\}$ .
- Evidence: The claim is rigorously supported by explicit constructions and proofs (Theorem 2 as a corollary of Lemmas 6, 7, 8, 9, and 10).
Claim: Non-affine analytic functions and certain piecewise functions (e.g., leaky-ReLU, HARDSWISH) are squashable.
- Evidence: Detailed proofs provided for these classes (Lemmas 4, 5).

方法与评估标准

The paper’s methodological framework, which includes defining the concept of squashable activation functions and establishing theoretical criteria, is appropriate and sensible for addressing the minimum width problem for universal approximation. Evaluation criteria through formal proofs are rigorous and clear.

理论论述

The main theoretical claim is the precise characterization of minimum width for universal approximation for squashable functions. The correctness of proofs appears solid and well-structured.

实验设计与分析

n/a

补充材料

Yes, reviewed proofs in the appendix including Lemmas 3, 4, 5, and detailed constructions for lemmas.

与现有文献的关系

The paper builds significantly upon prior literature about universal approximation theory, particularly works focusing on ReLU-like activations. It substantially generalizes the known exact minimum width from ReLU-specific results (Park et al., 2021b; Kim et al., 2024; Cai, 2023) to a broader class including analytic and piecewise continuously differentiable activations.

遗漏的重要参考文献

The paper appears thorough in citing relevant literature. No obvious missing essential references were identified.

其他优缺点

Strength: Figures 2-4 help understanding the idea behind the proof.

其他意见或建议

good job

作者回复

2025-03-31

We thank the reviewer for their positive evaluation and thoughtful feedback. We answer the reviewer’s question below.

In practice, both and norms are too weak because derivatives are not guaranteed to converge. For example, can we state the universality in a Sobolev space?

We thank the reviewer for this interesting question. For the Sobolev $W^{1,q}$ -norm with small $q<d_x$ , by the Sobolev embedding theorem, the Sobolev space $W^{1,q}(\mathbb{R}^{d_x})$ can be embedded into $L^p(\mathbb{R}^{d_x})$ with $q=pd_x/(d_x-p)$ . Hence, in this case, universal approximation under $W^{1,q}$ -norm reduces to our result. However, we do not know whether our minimum width bounds extend to $q\ge d_x$ . We think finding the minimum width under general Sobolev spaces is a very interesting problem and would like to pursue it as a future research direction.

We would be happy to clarify any concerns or answer any questions that may come up during the discussion period.

审稿意见

评分: 32025-03-11

The paper under review proves universal approximation of neural networks with a fixed squashable activation function and minimal width equal to the maximum of input and output dimension on compact sets with respect to the L^p-norm, hence any L^p-function with values in R^{d_y} can be approximated in this norm up to an arbitrarily small error. This result extends prior results, e.g., of Cai (ICLR 2023) that derived the same width, however with a family of activation functions (leaky ReLu with variable slope), from that one can choose one in each activation. The paper applies an encoding-decoding scheme similar to Kim, Min and Park (ICLR 2024), however in one dimension less. to this aim the authors show the existence of approximately space filling curves implemented by neural networks as decoder. Furthermore, an approximate encoder network is designed that maps regions in the input space to the portion of the curve, that is decoded to an approximation of the function value of a continuous function on this input space. Therefore, the composition of encoder and decoder provide a suitable approximation of the given function in L^p. As continuous functions are dense in L^p, this result suffices to prove the claim. The authors also state sufficient conditions for an activation function to be slashable.

# will lower the score by one as I reassess novelty, but I stilll give a weak accept because this paper has still something to ell in terms of generalizing activations.

给作者的问题

How does one see that ReLu is not squashable in order to explicitly show that there is no contradiction to Lu et al (NeurIPS 2017)? A formal argument given in the appendix would be solicited.

论据与证据

All claims are supported by detailed mathematical proofs. in addition, the authors explain and illustrate te underlying ideas in a transparent manner.

方法与评估标准

The methods are from mathematical analysis and approximation theory.

理论论述

The theoretical claims are credible and supported by detailed proofs, in which I did not find any errors.

实验设计与分析

dies not apply

补充材料

Several proofs are given in the appendix.

与现有文献的关系

The paper deals with the classical topic of universal approximation.

遗漏的重要参考文献

This is only a preprint, should however be mentioned: [1] Dennis Rochau et al., New advances in universal approximation with neural networks of minimal width, https://arxiv.org/abs/2411.08735 Namely, this paper also deals with minimal width and minimal internal width for LReLu functions with variable slope. The results of the paper under review, however, are stronger as they work with a fixed activation function.

其他优缺点

The paper more or less settles a classical topic on universal approximation with minimal width in a very satisfactory manner. Namely that the minimal width for neural networks with fixed slashable activations are universal approximators in L^p the theoretically minimal width w=min{d_x,d_y}. This is attractive as it covers almost all used activation functions up to ReLu and the standard FCNN. The ideas are expressed very clearly and concisely and are supported by nice illustrations. I therefore think that the paper stands for a natural endpoint of recent developments in this direction.

I don't see any major weaknesses.

其他意见或建议

It does not occur to me what is the sense of the first assertion in lemma 7. Isn't that obvious?
Lemma 10, the symbol for the ball should be introduced.
The authors should comment shortly on their minimal internal depth 1 achieved, as this has implications for the universal approximation theory of autoencoders, cf [1]

作者回复

2025-03-31

We thank the reviewer for their positive evaluation and thoughtful feedback. We address all comments of the reviewer below.

It does not occur to me what is the sense of the first assertion in lemma 7. Isn't that obvious?

As the reviewer pointed out, $w_{\sigma}\ge\max\\{d_x,d_y\\}$ is a trivial lower bound. We wrote it to formally show that our upper bound on the minimum width is tight. We will write that this bound can be easily derived in the final draft.

Lemma 10, the symbol for the ball should be introduced.

The notation $\mathcal B$ for the ball is defined in the first paragraph of Section 2 (Line 118). Nevertheless, for better readability, we will add a pointer to this definition after Lemma 10 in the final draft.

How does one see that ReLu is not squashable in order to explicitly show that there is no contradiction to Lu et al (NeurIPS 2017)? A formal argument given in the appendix would be solicited.

Here, we clarify the following claim: “Our minimum width $\max\\{d_x,d_y,2\\}$ for squashable functions does not contradict the lower bound $d_x+1$ for ReLU networks in (Lu et al., 2017) since ReLU is not squashable.” If this comment is not what the reviewer considered, please let us know.

We note that the lower bound $d_x+1$ in (Lu et al., 2017) considers the entire Euclidean space $\mathbb R^{d_x}$ while we consider a compact subset in $\mathbb R^{d_x}$ . In fact, the tight minimum width for ReLU on a compact domain is known to be $\max\\{d_x,d_y,2\\}$ (Kim et al., 2024), which is smaller than the minimum width $\max\\{d_x+1,d_y\\}$ for ReLU on $\mathbb R^{d_x}$ (Park et al., 2021). Namely, we can have the minimum width $\max\\{d_x,d_y,2\\}$ for ReLU regardless of its non-squashability.

This is only a preprint, should however be mentioned: [1] Dennis Rochau et al., New advances in universal approximation with neural networks of minimal width. The authors should comment shortly on their minimal internal depth 1 achieved, as this has implications for the universal approximation theory of autoencoders, cf [1]

We appreciate the reviewer for suggesting an interesting connection between our results and autoencoders, and providing an intriguing paper. According to Dennis et al., we also found that our construction has an autoencoder structure (i.e., minimal internal width 1). We will cite this paper and add a related discussion, including a new lower bound for the supremum norm in (Dennis et al.), to the final draft.

We would be happy to clarify any concerns or answer any questions that may come up during the discussion period.

审稿意见

评分: 32025-03-11

This paper considers a class of activation functions and provides the minimum width required for the corresponding neural network to have the universal approximation property.

给作者的问题

Null.

论据与证据

I think the conclusion of this paper is reliable.

方法与评估标准

I think the method in this paper is credible. However, there have been many discussions on such problems in the literature. In particular, the idea of this paper is obviously borrowed from the existing literature, such as Park (2021) and Kim (2024), but it is never compared and discussed in the paper.

理论论述

I did not check the proof of this paper.

实验设计与分析

This paper is a theoretical article and does not require experiments.

补充材料

I didn't review the supplementary material.

与现有文献的关系

I feel that this paper is an incremental generalization of existing research.

遗漏的重要参考文献

The cited papers are not discussed in detail in this paper, especially Park (2021), Cai (2023), and Kim (2024).

其他优缺点

I was hastily assigned to review this paper. I did not verify the details of the proofs, but I believe the conclusions are correct. However, the results of this paper did not impress me. The minimum width under the $L^p$ -norm has already been thoroughly studied in excellent works, such as Park (2021), Cai (2023), and Kim (2024), which have analyzed many different cases. While previous papers have not exhaustively covered all activation functions, it is straightforward to deduce results for many activation functions from existing findings. The proof strategy in this paper is essentially no different from those in earlier works, especially the encode-memorize-decode techniques considered in Park (2021) or the earlier bit extraction techniques. Therefore, I believe this paper is an incremental contribution and is not suitable for publication in a top conference like ICML.

其他意见或建议

The activation functions considered in this paper are related to the Step function, yet the paper does not mention the study on Step activation functions in Park (2021).
The first time I encountered the equation $w_{\min} = \max(d_x, d_y, 2)$ was in Cai (2023), but it is not listed in Table 1.

作者回复

2025-03-31

We appreciate the reviewer for their time and effort to provide valuable comments. We address all comments of the reviewer below.

While previous papers have not exhaustively covered all activation functions, it is straightforward to deduce results for many activation functions from existing findings.

We do not think that tight minimum width results for general activation functions (e.g., nonaffine analytic) can be straightforwardly deduced from existing results that only consider ReLU and its variants (Park et al., 2021b; Cai, 2023; Kim et al., 2024). For example, Cai (2023) showed the minimum width for Leaky-ReLU networks using the following two results: (1) Neural ODEs can approximate any $L^p$ functions [1], (2) Leaky-ReLU networks can approximate any neural ODEs [2]. Here, (2) requires an activation to either have a strictly monotone breakpoint or have two asymptotic lines with different nonzero slopes. Namely, their proof technique does not directly extend to general activation functions such as nonaffine analytic functions.

In addition, Park et al. (2021b) and Kim et al. (2024) proved the minimum width for networks using ReLU or ReLU-like activation functions using the properties of ReLU (and its variants): it maps the negative inputs to (approximately) zero while the positive input is preserved by an (approximate) identity map. However, general activation functions do not have such properties, and hence, how to extend their constructions to general activation functions is unclear.

[1] Li et al., "Deep learning via dynamical systems: An approximation perspective." Journal of the European Mathematical Society, 2022.

[2] Duan et al., "Vanilla Feedforward Neural Networks as a Discretization of Dynamical Systems." Journal of Scientific Computing 101, no. 3 (2024): 82.

The proof strategy in this paper is essentially no different from those in earlier works, especially the encode-memorize-decode techniques considered in Park (2021) or the earlier bit extraction techniques. Therefore, I believe this paper is an incremental contribution and is not suitable for publication in a top conference like ICML.

We strongly disagree that our proof strategy is not different from those in earlier works. As we mentioned in the previous response, the existing constructions by Park et al., (2021b) highly rely on specific properties of ReLU and its variants, which cannot be found in general activation functions. To show the tight minimum width for a general class of activation functions, we propose a completely different condition for activation functions called squashable (Definition 1), and our proofs are based on this condition.

Since we are using different properties of activation functions, our constructions of the encoder and decoder significantly differ from prior ones. Our decoder construction is based on a novel filling-curve argument (Definition 2) that did not appear in prior constructions. Our encoder explicitly slices the input domain along each coordinate via a width $d_x$ network using the squashability, while prior constructions use either a larger width $d_x+1$ (Park et al., 2021) or only prove the existence of a proper slicing (Kim et al., 2024) based on the properties of ReLU.

The cited papers are not discussed in detail in this paper, especially Park (2021), Cai (2023), and Kim (2024).

While we compared our bound with the bounds in (Park, 2021b; Cai, 2023; Kim, 2024) in Sections 1.1–1.2, following the reviewer’s comment, we will add the above discussions and thoroughly compare our proof techniques with prior ones in the final draft.

The activation functions considered in this paper are related to the Step function, yet the paper does not mention the study on Step activation functions in Park (2021).

Park et al. (2021b) showed that if a network can use both ReLU and Step, the minimum width for universal approximation under the supremum norm is $\max\\{d_x+1,d_y\\}$ . Here, they use the discontinuity of the Step function, which enables them to derive the tight minimum width under the supremum norm. On the other hand, our activation functions of interest (i.e., those satisfying Conditions 1 and 2) are continuous and do not have additional ReLU; hence, the proof completely differs from the existing ReLU+Step network result. We will add this discussion to the final draft.

The first time I encountered the equation $\max\\{d_x,d_y,2\\}$ was in Cai (2023), but it is not listed in Table 1.

Following the reviewer’s comment, we will add the bound $\max\\{d_x,d_y,2\\}$ by Cai (2023) to Table 1 in the final draft.

We would be happy to clarify any concerns or answer any questions that may come up during the discussion period.

审稿人评论

2025-04-05

Thank you for your response. Unfortunately, my concerns have not been addressed. On the contrary, I am now even more convinced that this paper represents an incremental contribution.

The research on minimal width can be categorized into two main approaches. The first approach originates from bit extraction techniques, which were formally summarized by Park et al. (2021) as the encode-memorize-decode technique (or simply the encode-decode technique if "memorize" is merged into either "encode" or "decode"). The second approach is based on the perspective of flow, to which the works you mentioned—Li et al. (2022) and Duan et al. (2024)—belong. Cai (2023) used both approaches for various cases. Specifically, as you noted, the second approach can be used to determine the minimal width of leaky-ReLU neural networks. However, for some reason, you did not mention that Cai (2023) also applied the first approach to determine the minimal width of ReLU+Floor neural networks, which is also $\min(d_x, d_y, 2)$ .

By combining these two approaches, the study of minimal width can be extended to more general activation functions. However, I do not believe such an extension is suitable for publication in a top-tier conference like ICML.

The reason I consider your technique to be fundamentally similar to previous works is based on the following observation: The overall construction method in this paper follows the encode-decode technique, where the encoder is constructed using the flow perspective. In other words, your encoder and decoder constructions fall entirely within the two existing approaches mentioned above. Therefore, when you claim that “our constructions of the encoder and decoder significantly differ from prior ones,” I find this statement rather superficial.

Of course, from your perspective, you may be able to list numerous differences. However, that is merely a matter of viewpoint. Your encoder is similar to that in Cai (2023), while your decoder is similar to that in Park et al. (2021). Naturally, you can argue that your encoder differs from Park et al. (2021) and your decoder differs from Cai (2023), but that would merely be a play on words.

For example, regarding the construction of the decoder in this paper, Cai (2023) has already shown that a Floor NN can serve as a decoder. It is not difficult to see that the Floor NN decoder can be replaced by an (id, step)-NN decoder. Therefore, the decoder you examine is merely a corollary of an existing result. While your connection between the decoder and filling curves introduces some novelty, your claim that “this did not appear in prior constructions” is highly misleading. Prior research may not have explicitly used the term “filling-curve,” but in essence, they describe the same type of decoder.

In summary, while this paper does contain some new results, it offers little novelty from a technical perspective. In particular, the authors lack an objective understanding and discussion of existing results. Therefore, I maintain my previous assessment that this paper is not suitable for publication in a top-tier conference like ICML.

作者评论

2025-04-06

We thank the reviewer for their response. Our responses to the reviewer’s concerns are listed below.

By combining these two approaches, the study of minimal width can be extended to more general activation functions. However, I do not believe such an extension is suitable for publication in a top-tier conference like ICML.

As we answered in our first response, our work is not a simple extension of (Park et al., 2021; Cai, 2023). We believe that such existing proofs do not directly extend to general activation functions such as non-affine analytic functions, unless concrete evidence is provided.

The reason I consider your technique to be fundamentally similar to previous [...] when you claim that “our constructions of the encoder and decoder significantly differ from prior ones,” I find this statement rather superficial.

We would like to clarify our contribution. As the reviewer mentioned, our overall construction follows the encoder-decoder framework proposed by Park et al. (2021) in the sense that the encoder maps an input vector to a scalar value and the decoder maps the scalar value to the target output. Here, our main contribution is to construct networks implementing such encoder and decoder using the tight minimum width and general activation functions. Due to the reasons we answered in our first response, we think our network constructions (not the encoder-decoder framework) of the encoder and decoder (in the proofs of Lemmas 9-10) significantly differ from prior ones and are non-trivial/important contributions. We will clarify this in the final draft.

The encoder is constructed using the flow perspective.

We clarify that in our encoder network construction, we do not use the flow perspective in (Li et al., 2022; Duan et al., 2024, Cai, 2023). Both our encoder network and the underlying idea behind it (e.g., see Section 4.3 and Figure 4) are completely different from those in (Section 4, Cai, 2023): our encoder slices small pieces in the input domain as illustrated in Lemma 10 and Figure 4 while the flow perspective in (Cai, 2023) approximates a neural ODE.

Your encoder is similar to that in Cai (2023), while your decoder is similar to that in Park et al. (2021).

As we described in our previous answer, we do not use the flow argument in our encoder network and we could not find any notable similarity between our encoder network and that in (Cai, 2023). Furthermore, as we answered in our first response, our decoder network is completely different from that in (Park et al., 2021). Specifically, our decoder is a filling curve constructed by exploiting the continuous transition between 0 and 1 that arises when we approximate the binary step function using a squashable activation function (see Section 4.2 and Figure 3 for more details). On the other hand, the decoder in (Park et al., 2021) highly relies on the properties of ReLU: they construct a piecewise linear function that maps the input scalar values to target vectors (Lemma 10, Park et al., 2021).

Cai (2023) has already shown that a Floor NN can serve as a decoder. It is not difficult to see that the Floor NN decoder can be replaced by an (id, step)-NN decoder. Therefore, the decoder you examine is merely a corollary of an existing result.

We do not think the decoder of width $d_y$ in (Cai, 2023) can be easily replaced by (id, step) networks of the same width. For example, if $d_y=1$ , then the width- $1$ (id, step)-NN is either an affine map or a scaled/shifted version of the binary step function. Hence, (id, step) networks of width $1$ cannot represent the floor function. This also implies that one cannot simply replace the floor functions in the Floor decoder by width- $1$ (id, step) networks to construct the (id, step) decoder. We do not think our decoder of width $d_y$ is a direct corollary of the Floor decoder in (Cai, 2023) unless concrete proof is provided.

Naturally, you can argue that your encoder differs from Park et al. (2021) and your decoder differs from Cai (2023), but that would merely be a play on words.

We did not argue that our network constructions differ from only a subset of existing ones. As we responded in our first response, we believe our encoder and decoder networks are significantly different from all those in (Park et al., 2021; Cai, 2023; Kim et al., 2024).

“this did not appear in prior constructions” is highly misleading.

Following the reviewer’s comment, we will not include “this did not appear in prior constructions” in our final draft (it is not written in our current submission). Instead, we will clarify in the final draft that we construct a filling curve via width- $d_y$ networks using the squashability of activation functions and tight width while prior works utilize properties of specific activation functions such as ReLU and Floor (Park et al. 2021; Cai 2023).

On ReLU+Floor networks result.

We will add the ReLU+Floor network result by Cai (2023)) to the final draft.

审稿意见

评分: 32025-03-13

The paper extends previously established result that two neurons per layer are sufficient for universal function approximation in an unbounded depth neural network; the extension establishes the known result for a wider family of activation functions, the so called "squashable" functions, which includes sigmoid, exponential and other functions.

给作者的问题

No questions for authors.

论据与证据

The paper is very dense mathematically. Having not gone through the proofs in detail I can't tell, but at a high-level glance it looks like a rigorous mathematical treatment.

方法与评估标准

The paper is theoretical, no empirical results.

理论论述

No, I did not check the proofs in detail.

实验设计与分析

The paper is theoretical, no experiments.

补充材料

No, I did not go through the proofs in the supplementary material.

与现有文献的关系

Assuming the math (which I didn't verify) is correct, it's a nice theoretical result to extend the proof of the width minimum to a greater set of activation functions. Practically it's probably of little use, since (as far as I know) these super-thin networks are terrible generalises (in general). Also, given that the existing proof already covers a wide variety of ReLU functions, the extension is not that surprising - sure if relu-like function can do with 2-neurons per layer, it seems reasonable that sigmoid would too. I am unable to appreciate the mathematical aspects, so it is possible that the technique of the proofs has something interesting to offer - as in, it might be of more interest how this has been proven, rather than what has been proven.

遗漏的重要参考文献

I don't have any additional references to offer.

其他优缺点

I can't comment on the originality of the proofs provided. The result itself is an incremental contribution, extending a know results that already covered most popular activation function, to wider set of activation functions. As far as a very mathematical treatment goes, the work is well presented and the paper well written.

其他意见或建议

No other comments or suggestions.

作者回复

2025-03-31

We appreciate the reviewer for their positive evaluation and valuable comments. We address the reviewer’s concern and clarify our contribution below.

The result itself is an incremental contribution, extending known results that already covered most popular activation functions, to wider set of activation functions.

While prior works derived tight minimum width for ReLU and its variant, we believe that studying general activation functions is not incremental since (1) prior results do not cover popular non-ReLU-like activation functions such as sigmoid, tanh, and sin, and (2) minimum width results for general activation functions can be applied to any new activation functions that can be used in future as long as they satisfy our conditions. We believe that our result can be used to better understand the fundamental limit of narrow networks.

In addition, we would like to note that our result is not a simple extension of prior works that only consider ReLU and its variants (Park et al., 2021b; Cai, 2023; Kim et al., 2024). For example, the proof of the Leaky-ReLU result by Cai (2023) heavily relies on the property of Leaky-ReLU: it contains at least one break point and is strictly monotone near that point. Namely, their proof technique cannot be directly extended to general activation functions such as nonaffine analytic functions. Likewise, Park et al. (2021b) and Kim et al. (2024) rely on the properties of ReLU (and its variants): it maps the negative inputs to (approximately) zero while the positive input is preserved by an (approximate) identity map. However, general activation functions do not have such properties, and how to extend their constructions to general activation functions was unclear.

To show tight bounds for general activation functions, we propose a novel condition on activation functions (Condition 2) that is satisfied by all nonaffine analytic functions and a class of piecewise functions, and explicitly construct a universal approximator of the minimum width based on that condition. We will add this discussion to the final draft.

We would be happy to clarify any concerns or answer any questions that may come up during the discussion period.

最终决定Accept (poster)

2025-05-01

This paper provides novel insights on universal approximation properties of neural networks with certain activation functions. All reviewers seem to agree that the paper is mathematically sound. However, there were big discussions about the paper's relation to previous work and the way the previous work is discussed in the submission. Three out of four reviewers see a significant contribution in the submission and are fine with the way the previous work is presented. The fourth reviewer initially expressed strong disagreement with that assessment. After a long discussion, the reviewer also raised the score to "weak accept", while still voicing doubts. At the end, I follow the opinion of the majority of the reviewers, recommending accept. However, I urge the authors to take the criticism seriously, even the parts that initially sound extreme. They should definitely improve the discussion about the relation to previous work, including proof strategies in previous work, in the revised version.