PaperHub
5.8
/10
Poster4 位审稿人
最低4最高7标准差1.1
6
7
6
4
2.8
置信度
正确性2.8
贡献度2.8
表达2.5
NeurIPS 2024

Can neural operators always be continuously discretized?

OpenReviewPDF
提交: 2024-05-10更新: 2024-11-06
TL;DR

Discretizing injective neural operators may cause them to lose injectivity, even with very flexible discretization methods.

摘要

关键词
Neural OperatorsInvertibilityCategory Theory

评审与讨论

审稿意见
6

This paper studies the question of whether a neural operator (or a general diffeomorphism on an infinite dimensional Hilbert space) can be continuously discretized through the lens of category theory. It first proves that there does not exist a continuous approximation scheme for all diffeomorphisms. Then, it shows that neural operators with strongly monotone layers can be continuously discretized, followed by a proof that a biliptschitz neural operator can be approximated by a deep one with strongly monotone layers. Some further consequences are discussed.

优点

  1. The paper studies the discretization of a continuous operator, which is a source of error and an important issue in operator learning if one is not careful.
  2. The paper is fairly comprehensive, encompassing both positive and negative results. The study of positive results contains theorems of different flavors.
  3. Although the theory is based on the category theory, the presentation and explanation of the results are relatively clear and accessible for people who are unfamiliar with it.

缺点

  1. Most results presented in the paper are purely theoretical and lack a quantitative or asymptotic estimate. For example,
    • the notion of continuous approximation functor in Definition 8 does not care about the rate of convergence, and
    • there is no estimate for the number of layers JJ in Theorem 4.
  2. No empirical results are there to support the theory. While this is a theoretical paper, some toy experiments that exemplify the theory would be very helpful.

问题

  1. In general, is there any assumption of the Hilbert space studied in this paper? For example, are the Hilbert spaces assumed to be separable?
  2. What is the definition of the convergence of finite-dimensional subspaces used in this paper? Strong convergence of the projection operator? I do not think there is a standard definition for this in elementary functional analysis so it would be helpful to say it explicitly in the paper.
  3. Have you studied the role of the bilipschitz constant in your theorems? For example, how does the number of layers JJ in Theorem 4 depend on it?

局限性

None.

作者回复

We would like to thank the reviewer for the detailed comments and fair criticisms. We address all of these below:

  1. "Most results presented in the paper are purely theoretical and lack a quantitative or asymptotic estimate."

The proof of Theorem 4 makes it possible to estimate the number JJ of layers as a function of ϵ\epsilon, the Lipschitz constants of the map FF and F1F^{-1}, the C2C^2-norm of F:B(0,2r1)XF:B(0,2r_1)\to X in a ball having double the radius r1r_1 where we consider the approximation of FF. More precisely, JCϵ2J\leq C\epsilon^{-2} where CC depends on the radius r1r_1 of the ball where FF is approximated and the C2C^2-norm of the map FF in a ball of radius 2r12r_1. We will add an explicit formula for C0(r1,FB(0,2r1);X))C_0(r_1,\|F\|_{B(0,2r_1);X)}) and its proof in the final version of the paper.

The main steps of the proof are the following: First, we use spectral theory of the compact operators T1T_1 and T2T_2 to find a finite dimensional subspace WW and the projection PWP_W onto it so that F^(x)=(IPW)x+PWF(PWF)\hat F(x)=(I-P_W)x+P_WF(P_WF) is a diffeomorphism that is close to the operator F:XXF:X\to X. Let f:WWf:W\to W be the restricition of F^\hat F to WW. After this we deform to the invertible linear map A1:WWA_1:W\to W is the derivative of ffat x=0x=0, along the path tft,t\to f_t, ft(x):=1t(f(tx)f(0))+tf(0),for t>0,f_t(x):=\frac 1t(f(tx)-f(0))+tf(0),\quad\hbox{for }t>0, ft(x):=A1,for t=0. f_t(x):=A_1,\quad\hbox{for }t=0. All operators ft:WWf_t:W\to W are bi-Lipschitz maps, f0(x)=f(x)f_0(x)=f(x) and f0(x)=A1f_0(x)=A_1. We consider the values t1=c1ϵt_1=c_1\epsilon and tj=t1+jht_j=t_1+jh, for j=2,,J1j=2,\dots,J_1, and the operators Id+Bj=ftj+1ftj1 :WW,j=0,1,2,,J1. Id+B_j=f_{t_{j+1}}\circ f_{t_j}^{-1}\ : W\to W,\quad j=0,1,2,\dots,J_1. We show that in the ball BW(0,2r1)B_W(0,2r_1) Lip(B0)=Lip(ft1A11Id)C2t1=C2c1ϵ.\hbox{Lip}(B_0)=\hbox{Lip}(f_{t_1}\circ A_1^{-1}-Id)\le C_2t_1=C_2c_1\epsilon. Here, C2C_2 depends on r1r_1 and C2C^2-norm of FF as well as on the Lipschitz constants of FF and F1F^{-1} and c1c_1 is chosen to be sufficiently small.

Moreover, we show that for j1j\ge 1, Lip(Bj)C1tj(tj+1tj)=C1tjh,\hbox{Lip}(B_j)\leq C\frac 1{t_j}(t_{j+1}-t_j)=C\frac 1{t_j}h, where the factor 1tj\frac 1{t_j} appears due to the multiplier 1t\frac 1t in the definition of ftf_t. To obtain Lip(Bj)ϵ\hbox{Lip}(B_j)\le \epsilon, we choose hc2ϵ2h\le c_2\epsilon^2. This causes the factor ϵ2\epsilon^{-2} in the bound for JJ.

In addition to this, we consider paths on the Lie group O(n)O(n), n=dim(W)n=\dim(W) and show that there are sequence of invertible matrixes AjO(n)A_j\in O(n), j=1,2,,J2j=1,2,\dots,J_2 such that Id+BJ1+j=Aj+1Aj1 :WW Id+B_{J_1+j}=A_{j+1}\circ A_j^{-1}\ :W\to W where Lip(Bj)<ϵ\hbox{Lip}(B_j)<\epsilon and AjA_j, where j=J2j=J_2, is either the identity operator Id:WWId:W\to W or a reflection operator Be:xx2x,eeB_e:x\to x-2\langle x,e\rangle e. Here, we can choose the number of steps to be J2c3ϵ1.J_2\leq c_3\epsilon^{-1}. Combining the operators Id+BjId+B_j with j=1,,J1+J2j=1,\dots, J_1+J_2, we see that FF can be deformed to a linear operator or a reflection operator by combining J=J1+J2J=J_1+J_2 operators of the form Id+BjId+B_j. This yields the bound for JJ.

  1. "While this is a theoretical paper, some toy experiments that exemplify the theory would be very helpful."

We much appreciate the reviewer's comment. In Appendix A.1 we had given an example where we approximate the solution operator x(t)u(t)x(t)\to u(t) of the nonlinear elliptic equation t2u(t)g(u(t))=x(t),t(0,1),\partial_t^2 u(t)-g(u(t))=x(t),\quad t\in (0,1), with u=0u=0 on the boundary using a discretization that is based on Finite Element Method. When gg is convex and the source term x(t)x(t) is represented in the form x(t)=d2dt2h(t),x(t)=\frac {d^2}{dt^2}h(t), the map F:huF:h\to u is a diffeomorphism in the Sobolev space H2H^2 with the Dirichlet boundary values. The approximation FFVF\to F_V can be obtained by Galerkin method. We will expand upon this example in the final version of the manuscript. We will give an example on no-go theorem using the elliptic (but not not strongly elliptic) problem Bsu:=ddt(sign(ts)ddtu(t))=f(t),t[0,1],B_su:=-\frac d{dt}\bigg(\hbox{sign}(t-s)\frac d{dt} u(t)\bigg)=f(t),\quad t\in [0,1], u(0)=0,  ddtu(1)=0.u(0)=0,\ \ \frac d{dt} u(1)=0. For all 0s10\le s\le 1 these equations are uniquely solvable, but we show that when we use FEM to approximate those, some of the obtained finite dimensional problems has also a zero eigenvalue and is not solvable.

  1. "Do the Hilbert spaces need to be separable?"

The no-go theorem applies also to non-separable Hilbert spaces. Naturally, for such spaces the partially ordered set S(X)S(X) of finite dimensional linear subspaces of XX, that is used as an index set, is huge. In most of our positive results on existence of approximation operations we have assumed that the Hilbert space is separable as we use finite rank neural operators as approximators, and have used this in the orthogonal projectors PnP_n form XX to span(ϕ1,,ϕn)\hbox{span}(\phi_1,\dots,\phi_n), where ϕj,\phi_j, jZ+j \in \mathbb Z_+, is an enumerable orthonormal basis. However, it seems to us that our results can be generalized to non-separable Hilbert spaces XX that have a non-enumerable orthonormal bases. We will check carefully if this generalization is possible.

  1. "What is the definition of the convergence of finite-dimensional subspaces used in this paper?"

After Definition 7, line 223, we defined that the limit limVXyV=y.\lim_{V\to X} y_V=y. This limit can be defined also by endowing the set S0(X){X}S_0(X)\cup \{X\} by the topology associated to the partial ordering of S0(X){X}S_0(X)\cup \{X\}, that is, the topology generated by the sets U_V:=\{W\in S_0(X):\ X\supset V\\}\cup \{X\}.

  1. "Have you studied the role of the bi-Lipschitz constant in your theorems? For example, how does the number of layers JJ in Theorem 4 depend on it?"

The bi-Lipschitz constraint (and form of neural operator layers) enable us to decompose the map into strongly monotone neural operator layers Gk=Id+BkG_k=Id+B_k (Theorem 4), where Lip(Bk)<ϵ\mathrm{Lip}(B_k)<\epsilon. Here, ϵ(0,1)\epsilon \in (0,1) is arbitrary. The bi-Lipschitz constant appears in the proof of the estimate of JJ (under 1.) We will mention this observation in the final version of the manuscript.

评论

Thank you for the detailed response. Since I am not absolutely familiar with category theory and other related work, I am unable to further raise my score, but I acknowledge that I have read through the rebuttal and it appears to be a nice paper overall.

评论

Well, I felt sorry for the authors because a reviewer of NeurIPS, once well-known for a top theory-oriented conference of machine leaning, cannot raise his/her score simply because he/she is not familiar with the category theory. I even feel this is symbolic of the current state of a theory-oriented machine learning conference. It should not be the problem of the individual reviewer him/herself, but the problem of conference's matching systems that mistakenly assign a perfect amateur of the category theory for reviewing a category theoretic research.

This is a suggestion for chairs for future avoidance of mismatches, that the reviewers should be examined if they have fundamental knowledge/background/understandings in the field. I am an expert of expressive power analysis, but not at all of category theory and tropical geometry. Unfortunately, this kind of mismatch happens every year, so I am usually skeptical to any mathematical ''theorems'' published in machine learning conferences.

审稿意见
7

This paper focuses on the continuous discretization in operator learning. This is a very important question since it involves reducing the infinite-dimensional space to a finite-dimensional space in operator learning. The authors present cases where discretization is continuous and cases where it is not. The results are interesting and can be applied to design methods in operator learning.

优点

The proof is solid, and the paper is well-written and organized. I appreciate the results presented in this paper.

缺点

Since this paper is submitted to NeurIPS and not a mathematical journal, I hope the authors can provide some practical examples, such as solving the Poisson equation Δu=f\Delta u = f to learn the operator relationship between ff and uu. By using methods like DeepONet and FNO, it would be beneficial to determine whether the discretization in these methods is continuous or not. I believe this could make the paper more accessible to a broader audience.

问题

Mentioned in the Weakness.

局限性

All right.

作者回复

We thank the reviewer for the suggestion to include an example based on the discretization of simple differential equations as it surely helps the readers to quickly understand the essential features of the no-go theorem of the approximation of invertible operators.

On the positive results, in Appendix A of our paper, we have considered nonlinear discretiation of the operators uΔu+G(u)u\to -\Delta u+G(u); see the reply to reviewer mJde under point 3. To exemplify the negative result, we will add in the appendix of the paper the following example on the solution operation of differential equations and the non-existence of approximation by diffeomorphic maps: We consider the elliptic (but not not strongly elliptic) problem (below, called as "PDE1")

Bsu:=ddt((1+t)sign(ts)ddtu(t))=f(t),t[0,1],B_su:=-\frac d{dt}\bigg((1+t)\hbox{sign}(t-s)\frac d{dt} u(t)\bigg)=f(t),\quad t\in [0,1],

with the Dirichlet and Neumann boundary conditions

u(0)=0,ddtu(1)=0.u(0)=0,\quad \frac d{dt} u(1)=0.

Here, 0s10\le s\le 1 is a parameter of the coeffient function and sign(ts)=1\hbox{sign}(t-s)=1 if t>st>s and sign(ts)<0\hbox{sign}(t-s)<0 if t<st<s. We consider the weak solutions of PDE1 in the space

uHD,N1(0,1):=vH1(0,1): v(0)=0.u\in H^1_{D,N}(0,1):=\\{v\in H^1(0,1):\ v(0)=0\\}.

We can write

Bsu=Dt(2)AsDt(1)u,B_su=-D_t^{(2)}A_sD_t^{(1)}u,

where

Asv(t)=(1+t)sign(ts)v(t),A_sv(t)=(1+t)\hbox{sign}(t-s)v(t),

parametrized by 0s10\le s\leq 1, are multiplication operations that are invertible operators, As:L2(0,1)L2(0,1)A_s:L^2(0,1)\to L^2(0,1) (this invertibility makes the equation PDE1 elliptic). Moreover, Dt(1)D_t^{(1)} and Dt(2)D_t^{(2)} are the operators vddtvv\to \frac {d}{dt}v with the Dirichlet boundary condition v(0)=0v(0)=0 and v(1)=0v(1)=0, respectively. We consider the Hilbert space X=HD,N1(0,1)X=H^1_{D,N}(0,1); to generate an invertible operator Gs:XXG_s:X\to X related to PDE1, we write the source term using an auxiliary function gg,

f(t)=Qg:=d2dt2g(t)+g(t).f(t)=Qg:=-\frac {d^2}{dt^2}g(t)+g(t).

Then the equation,

Bsu=Qg,B_su=Qg ,

defines a continuous and invertible operator,

Gs:XX,Gs:gu.G_s:X\to X,\quad G_s:g\to u.

In fact, Gs=Bs1QG_s=B_s^{-1}\circ Q when the domains of BsB_s and QQ are chosen in a suitable way. The Galerkin method (that is, the standard approximation based on the Finite Element Method) to approximate the equation PDE1 involves introducing a complete basis χj(t)\chi_j(t), j=1,2,j=1,2,\dots of the Hilbert space XX, the orthogonal projection

Pn:XXn:=spanχj: j=1,2,,n,P_n:X\to X_n:= \hbox{span}\\{\chi_j:\ j=1,2,\dots,n\\} ,

and approximate solutions of PDE1 through solving

PnBsPnun=PnQPngn,unXn, gn=Png.P_nB_sP_nu_n=P_nQP_ng_n,\quad u_n\in X_n, \ g_n=P_ng.

This means that operator Bs1Q:guB_s^{-1}Q:g\to u is approximated by (PnBsPn)1PnQPn:gnun(P_nB_sP_n)^{-1}P_nQP_n:g_n\to u_n, when PnBsPn:XnXnP_nB_sP_n:X_n\to X_n is invertible.

The above corresponds to the Finite Element Method where the matrix defined by the operator PnBsPnP_nB_sP_n is m(s)=[bjk(s)]j,k=1nRn×nm(s)= [b_{jk}(s)]_{j,k=1}^n\in \mathbb R^{n\times n}, where

m(s)=01(1+t)sign(ts)ddtχj(t)ddtχk(t)dt,j,k=1,,n.m(s)=\int_0^1 (1+t)\hbox{sign}(t-s)\frac d{dt} \chi_j(t)\cdot \frac d{dt} \chi_k(t)dt,\quad j,k=1,\dots,n.

Since we used the mixed Dirichlet and Neumann boundary conditions in the above boundary value problem, we see that for s=0s=0 all eigenvalues of the matrix m(s)m(s) are strictly positive, and when s=1s=1 all eigenvalues are strictly negative. As the function sm(s)s \to m(s) is a continuous matrix-valued function, we see that there exists s(0,1)s\in (0,1) such that the matrix m(s)m(s) has a zero eigenvalue and is no invertible. Thus, we have a situation where all operators Bs1Q:guB_s^{-1}Q:g\to u, s[0,1]s\in [0,1] are invertible (and thus define diffeomorphisms XXX\to X) but for any basis χj(t)\chi_j(t) and any nn there exists s(0,1)s\in (0,1) such that the finite dimensional approximation m(s):RnRnm(s):\mathbb R^n\to \mathbb R^n is not invertible. This example shows that there is no FEM-based discretization method for which the finite dimensional approximations of all operators Bs1QB_s^{-1}Q, s(0,1)s\in (0,1), are invertible. The above example also shows a key difference between finite and infinite dimensional spaces. The operator As:L2(0,1)L2(0,1)A_s:L^2(0,1)\to L^2(0,1) has only continuous spectrum and not eigenvalues nor eigenfunctions whereas the finite dimensional matrices have only point spectrum (that is, eigenvalues). The continuous spectrum makes it possible to deform the positive operator AsA_s with s=0s=0 to a negative operator AsA_s with s=1s=1 in such a way that all operators AsA_s, 0s10\le s\le 1, are invertible but this is not possible to do for finite dimensional matrices. We point out that the map sAss\to A_s is not continuous in the operator norm topology but only in the strong operator topology and the fact that A0A_0 can be deformed to A1A_1 in the norm topology by a path that lies in the set of invertible operators is a deeper result. However, the strong operator topology is enough to make the FEM matrix m(s)m(s) to depend continuously on ss.

评论

Thanks for the reply. I will keep my score.

审稿意见
6

This paper investigates theoretical limitations of discretizing neural operators on infinite-dimensional Hilbert spaces. The authors first prove a "no-go theorem" (Theorems 1,2) showing that diffeomorphisms between infinite-dimensional Hilbert spaces cannot generally be continuously approximated by finite-dimensional diffeomorphisms. Then, they provide positive results for certain classes of operators such as strongly monotone (Theorem 3) and bilipschitz neural operators (Theorem 4). They finally provide concrete example of approximation by finite residual ReLU networks (Theorem 5).

优点

  • The universality of neural networks has been demonstrated in various settings. However, research on the approximation abilities of operators is relatively scarce. Particularly, the characterization of classes that cannot be approximated is intriguing. This study is important as it succinctly demonstrates the differences between finite-dimensional and infinite-dimensional properties in the manageable setting of Hilbert spaces.

  • Moreover, the novel approach of expressing approximation sequences in terms of category theory is noteworthy.

缺点

  • On the other hand, the proofs are based on conventional analytical arguments rather than category-theoretic arguments. Therefore, the "category theory" framework might be somewhat exaggerated. It is expected that with refinement of notation and sentence structure, the description could become more perspicuous in the future.

  • There is concern that the categorical description may have obscured the contributions typically seen in traditional approximation theory papers. As the authors likely recognize, various topologies are used in function approximation, and this study focuses only on approximation in the norm topology of Hilbert spaces, and does not negate "all considerable approximation sequences". So, the impossibility theorem presented here might simply be due to the norm topology being too strong. While the Hilbert structure sounds natural as a generalization of Euclidean structure, in reality, concepts like L2 convergence of Fourier series are quite technical and not necessarily an inevitable notion of convergence. It seems that in pursuit of an elegant categorical description, the diversity of function approximation may have been compromised.

问题

In Definition 3, why σ\sigma is imposed besides GG?

局限性

The authors did not discuss the validity of assumptions.

作者回复

We appreciate the detailed suggestions, criticisms and endorsement of the reviewer. We address all of these below:

  1. "On the other hand, the proofs are based on conventional analytical arguments rather than category-theoretic arguments."

The proofs are indeed based on analytical arguments. We used category theory as a formalism (similar to the one for object oriented programming) to describe approximation operations in all Hilbert spaces (including non-separable ones). As the collection of all Hilbert spaces cannot be considered as a set (cf. Russel's paradox) but can as a category, we chose to use the language of category theory. In the beginning of the paper, we considered an "approximation operation" to avoid difficulties related to formal category theory.

  1. "The impossibility theorem presented here might simply be due to the norm topology being too strong."

We much appreciate the issue raised by the reviewer. We will include the analysis and discussion below in the revised manuscript, in an appendix on generalizations.

We formulated the approximation functor using norm topology as uniform convergence in compact sets is extensively studied in the theory of neural networks. Norm topology also makes it possible to consider quantitative error estimates. However, we agree with the referee that it is important to understand no-go results in weaker topologies. It turns out that our results can be generalized to a setting where the norm topology is partially replaced by the weak topology. Definition 7 is replaced by the following

Definition [Weak Approximation Functor] When S0(X)=S(X)S_0(X)=S(X) we define the weak approximation functor, that we denote by A:DB\mathcal A: \mathcal D\to\mathcal B, as the functor that maps each(X,F)OD(X,F)\in\mathcal O_{\mathcal D} to some(X,S(X),(FV)VS(X))(X,S(X),(F_V)_{V \in S(X)}) and has the following the properties

(A') For all r>0r>0, all (X,F)OD(X,F)\in O_{\mathcal D} and all yXy\in X, it holds thatlimVX supxB(0,r)V FV(x)F(x),yX=0.\lim_{V\to X}\ \sup_{x\in B(0,r)\cap V} \ \langle F_V(x)-F(x),y\rangle_X=0.Moreover, when F:XXF:X\to X is the operator Id:XXId:X\to X or Id:XX-Id:X\to X, then FVF_V is the operator IdV:VVId_V:V\to V or IdV:VV-Id V:V\to V, respectively.

In (A') we added conditions on the approximation on the operators IdId and Id-Id. Similarly, the continuity of the approximation functors can be generalized in the case where the convergence in the norm topology is replaced by the weak topology. The proof of the no-go theorem generalizes also to this setting; we will add in the Appendix a theorem which states that there are no weak approximation functors that are continuous in the weak topology.

  1. In Definition 3, why σ\sigma is imposed besides GG?

This definition needs to be interpreted with care, which we will clarify in the revised manuscript. The appearance of the compact operators T1T_1 and T2T_2 makes the discretization of activation function σ\sigma and the activation functions inside GG in Definition 3 different, and this is one reason why we have introduced both σ\sigma and GG. To consider invertible neural operators, we will below assume that σ\sigma is an invertible function, for example, the leaky Relu function. In the operation N:uu+T2(G(T1u)),N:u\to u +T_2(G(T_1u)), the nonlinear function GG is sandwiched between compact operators T1T_1 and T2T_2. The compact operators map weakly converging sequences to norm converging sequences. This is essential in the proofs of the positive results for approximation functors as discussed in the paper. However, we do not have general results on how the operation uσuu\to \sigma \circ u can be approximated by finite dimensional operators in the norm topology, but only in the weak topology in the sense of the above {Definition} 1 of the Weak Approximation Functor. Nonetheless, one can overcome this difficulty, for example, for using the explicit form of the activation function and choosing different finite dimensional spaces VjV_j in each layer of the neural operator.

We address the question whether the activation function σ\sigma is relevant in universal approximation results. If the activation function σ\sigma is removed, the operator FF becomes a sum of a (local) linear operator and a compact (nonlocal) nonlinear integral operator. Moreover, if we compose above operators of the above form, the resulting operator, HH say, is also a sum of a (local) linear operator, WW, and a compact (nonlocal) operator, KK. The Fr'{e}chet derivative of HH at u0u_0 is equal to WW and a compact linear operator. This means that the Fredholm index of the derivative of HH at u0u_0 equal to the index of WW is constant, that is, independent of the point u0u_0 where the derivative is computed. In particular, this means that one cannot approximate an arbitrary C1C^1-function XXX\to X in compact subsets of XX by such neural operators. Indeed, for a general C1C^1-function, the Fredholm index may be a varying function of u0u_0. Thus, σ\sigma appears to be relevant for obtaining universal approximation theorems for neural operators. Again, we will add this analysis to the final version of the manuscript.

  1. "The authors did not discuss the validity of assumptions."

We appreciate this criticism and will address it in the final version of the manuscript. The key assumption is that the neural operator is bilipschitz while being of the general form (4). the expressability properties and applicability in designing generative models is discussed in global comments. We also point of that the strong monotonicity used as an assumption in several lemmas and theorems is an intermediate assumption that is absorbed in Theorem 4 where we consider approximation of bi-Lipschitz neural operators.

In Theorem 5, finite-rank residual neural operators appear as explicit natural approximators of bilipschitz neural operators. Such a perspective has been empirically studied as in [Behrmann, et al PMLR 2019, pp. 573-582], although in the finite-dimensional case.

评论

Thank you for detailed clarifications. I would like to keep my score as is.

The proofs are indeed based on analytical arguments.

If so, I recommend the authors to reconsider the following phrases in the abstract and conclusion:

Using category theory, we give a no-go theorem We used tools from category theory to produce a no-go theorem

It would be much impactful and significant if the authors could more directly point out any incorrectness of the proof or inappropriateness of the assumption in the previous studies.

评论

Thank you for your response. We would will address your points in the following way.

  • If so, I recommend the authors to reconsider the following phrases in the abstract and conclusion: "Using category theory, we give a no-go theorem." "We used tools from category theory to produce a no-go theorem"

We appreciate the advice, and will follow it. We will replace

"Using category theory, we give a no-go theorem"

with

"Using analytical arguments, we give a no-go theorem framed with category theory."

and replace

"We used tools from category theory to produce a no-go theorem"

with

"We give a no-go theorem framed with category theory"

in the abstract and conclusion.

  • It would be much impactful and significant if the authors could more directly point out any incorrectness of the proof or inappropriateness of the assumption in the previous studies.

There are several papers which use continuous functions (either as elements of infinite dimensional function spaces or metric spaces) to model images or signal and apply statistical methods and invertible neural networks or maps modeling diffeomorphisms. Often in these papers one derives theoretical results in the continuous models and presents numerical results using a finite dimensional approximations. In this process the errors are caused by the discretization and the effect of changing the dimension of the approximate models. We believe that our work meaningfully addresses these questions as applied to injective/bijective neural operators, an important architecture. We hope that our paper inspires further study these points. We can include citations to the following papers, related to these issues.

The below papers which combine neural networks and approximation of diffeomorphisms, as applied to imaging.

  • Elena Celledoni · Helge Glöckner · Jørgen N. Riseth, Alexander Schmeding Deep neural networks on diffeomorphism groups for optimal shape reparametrization. BIT Numerical Mathematics (2023) 63:50

  • GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency Lin Tian · Hastings Greer · François-Xavier Vialard · Roland Kwitt · Raúl San José Estépar · Richard Jarrett Rushmore · Nikolaos Makris · Sylvain Bouix · Marc Niethammer West Building Exhibit Halls ABC 153

The below papers combine invertible neural networks and statistical models, especially for solving inverse problems (including imaging problems).

  • Alexander Denker , Maximilian Schmidt , Johannes Leuschner and Peter Maass Conditional Invertible Neural Networks for Medical Imaging. Journal of Imaging 2021, 7(11), 243

  • Ardizzone, L.; Kruse, J.; Rother, C.; Köthe, U. Analyzing Inverse Problems with Invertible Neural Networks. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019.

  • Anantha Padmanabha, G.; Zabaras, N. Solving inverse problems using conditional invertible neural networks. J. Comput. Phys. 2021, 433, 110194

  • Denker, A.; Schmidt, M.; Leuschner, J.; Maass, P.; Behrmann, J. Conditional Normalizing Flows for Low-Dose Computed Tomography Image Reconstruction. In Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, Vienna, Austria, 18 July 2020.

  • Hagemann, P.; Hertrich, J.; Steidl, G. Stochastic Normalizing Flows for Inverse Problems: A Markov Chains Viewpoint. SIAM/ASA Journal on Uncertainty QuantificationVol. 10, Iss. 3 (2022) 10.1137

  • Papamakarios, G.; Nalisnick, E.T.; Rezende, D.J.; Mohamed, S.; Lakshminarayanan, B. Normalizing Flows for Probabilistic Modeling and Inference. Journal of Machine Learning Research 22 (2021) 1-64

审稿意见
4

The paper addresses the problem of discretizing neural operators, maps between infinite dimensional Hilbert spaces that are trained on finite-dimensional discretizations. Using tools from category theory, the authors provide a no-go theorem showing that diffeomorfisms between Hilbert spaces may not admit continuous approximations by diffeomorfisms on finite spaces. This highlights the fundamental differences between infinite-dimensional Hilbert spaces and finite-dimensional vector spaces. Despite these challenges, the authors provide positive results, showing that strongly monotone diffeomorphism operators can be approximated in finite dimensions and that bilipschitz neural operators can be decomposed into strongly monotone operators and invertible linear maps. Finally, they observe how such operators can be locally inverted through an iteration scheme.

优点

  • The paper provides theoretical results addressing the challenging problem of discretizing inherently infinite-dimensional objects (neural operators)

缺点

  • The text and presentation require significant polishing. It contains numerous typos, poorly formulated sentences, and instances of missing or repeated words
  • While the paper's theoretical focus is valuable, it lacks examples of specific neural operator structures that meet the theorems or remarks
  • A more detailed discussion on the practical impact of this work, accompanied by examples, would be beneficial for the audience

Please note that my review should be taken with caution, as I am not familiar with category theory and did not thoroughly check the mathematical details. My feedback primarily focuses on the presentation and potential impact of the results rather than a rigorous validation of the theoretical content.

问题

  • Neural operators are typically defined between Banach spaces. Why does your theory focus on maps between Hilbert spaces instead?
  • Comment: The work in [1] might have been relevant to cite as well.
  • The main neural operator paper [2] develops theoretical results on the universal approximation theory of neural operators. How do your results relate to the ones in that paper?

[1] F. Bartolucci, E. de Bézenac, B. Raonić, R. Molinaro, S. Mishra, R. Alaifari, Representation Equivalent Neural Operators: a Framework for Alias-free Operator Learning, NeurIPS 2023.

[2] Nikola B. Kovachki, Zongyi Li, Burigede Liu, Kamyar Azizzadenesheli, Kaushik Bhattacharya, Andrew M. Stuart, and Anima Anandkumar. "Neural operator: Learning maps between function spaces with applications to PDEs," J. Mach. Learn. Res., 24(89):1–97, 2023.

局限性

The paper lacks examples of applications of the theorems to specific neural operator structures, and some further discussion on the practical impact of the results with examples

作者回复

We appreciate the valuable comments and constructive feedback of the reviewer. We are pleased to address all of these below.

  1. "The text and presentation require significant polishing."

We agree and sincerely regret this, and have already made many corrections to the manuscript.

  1. "While the paper's theoretical focus is valuable, it lacks examples of specific neural operator structures that meet the theorems or remarks."

In our approximation results (Theorem 5 and Corollary 1), while we consider a large class of bijective neural operators, approximators can be obtained through finite rank neural operators (see Def. 10). Finite-rank neural operators are encountered, e.g., as FNOs [37], wavelet neural operators [Tripura and Chakraborty, Wavelet Neural Operator for solving parametric partial differential equations in computational mechanics problems, Comp. Meth. Appl. Mech. 2023], and Laplace neural operators [Chen et al, arXiv:2302.08166v2, 2023].

The neural operators,F:uσ(u+T2(G(T1u))),F:u\to \sigma\circ (u+T_2(G(T_1u))),studied in our paper include neural operators that are close to those introduced in Kovachki-Lanthaler-Mishra (KLM) [28, 31]. This is discussed in the comments for all reviewers.

  1. "A more detailed discussion on the practical impact of this work, accompanied by examples, would be beneficial for the audience."

We much appreciate this suggestion. In Appendix A.1 we had given an example where we approximate the solution operator x(t)u(t)x(t)\to u(t) of the nonlinear elliptic equation t2u(t)g(u(t))=x(t),tΩ=(0,1),\partial_t^2 u(t)-g(u(t))=x(t),\quad t\in \Omega=(0,1),with u=0u=0 onΩ,\partial \Omega,using a discretization that is based on Finite Element Method.In the case when gg is a convex function and the source term x(t)x(t) is represented in the form x(t)=d2dt2h(t),x(t)=\frac {d^2}{dt^2}h(t), the map F:huF:h\to u is a diffeomorphism in the Sobolev space H2H^2 with the Dirichlet boundary values u(0)=0u(0)=0 and u(1)=0u(1)=0. The approximation FFVF\to F_V can be obtained by Galerkin method. We will expand upon this example in the final version of the manuscript.

  1. "Neural operators are typically defined between Banach spaces. Why does your theory focus on maps between Hilbert spaces instead?"

Via our general framework, we found that strong monotonicity is one of the key ingredients to obtain a "positive" result, that is, preserving invariant discretization. Strong monotonicity is defined by using inner products, which is why we have focused on Hilbert spaces.

However, the no-go theorem which states that diffeomorphisms of Hilbert spaces cannot be continuously approximated by finite dimensional diffeomorphisms implies directly that the same "negative" result holds for general Banach spaces.

The main challenge for using general Banach spaces for the "positive" result is that a map PY:XYP_Y : X \to Y, that maps a point to the closest points in the subspace YXY \subset X, may be set-valued, that is, there may be several nearest points. Nonetheless, several of our results can be generalized to uniformly convex Banach spaces, XX. For these, x=y=1 and xy    (x+y)/2<1.\|x\| = \|y\| = 1\hbox{ and }x\not = y\quad \implies \|(x+y)/2\|<1.In such a space, for a closed subspace YXY\subset X and xXx \in X, there is a unique closest point yYy\in Y to xx. (In fact, uniformly convex Banach spaces are strictly convex Banach spaces where the above inequality is given in a quantitative form). This makes, e.g., the linear discretization FFV=PVFVF \to F_V = P_V F|_V well defined. We will include a detailed discussion in the revision on generalizations to strictly convex spaces.

  1. "Comment: The work in [1] might have been relevant to cite as well.

We thank the reviewer for bringing this paper to our attention. We will add [1] to our references.

  1. "The main neural operator paper [2] develops theoretical results on the universal approximation theory of neural operators. How do your results relate to the ones in that paper?"

We agree with the reviewer that this is an important point. The approximation result in [2] is the universality of neural operators, i.e., to approximate any continuous map by a neural operator. Diffeomorphisms are contained in this result, which holds in the function space setting. However, even though a general diffeomorphism, F: XXF :\ X \to X, can be approximated by a neural operator, FNO: XXF^{NO} :\ X \to X, and the neural operator can be approximated by a finite dimensional operator, FVNO: VVF^{NO}_V :\ V \to V, the proof of the no-go theorem implies that either the approximating infinite dimensional neural operators FNO:XXF^{NO} : X \to X are not diffeomorphisms or that the approximation of neural operators by finite dimensional operators, that is, operation FNOFVNOF^{NO}\to F^{NO}_V, is not continuous. Note that the present universal approximation results for neural operators have mainly analyzed the approximation of functions F:XXF : X \to X by neural operators in norms of the spaces C(K)C(K), where KXK\subset X is compact, but not in the C1C^1-norms.

We will add this discussion to the Introduction.

评论

Thank you for your detailed response. As I mentioned in my initial review, my understanding of category theory is somewhat limited. My feedback has mainly focused on the presentation and potential impact of the results rather than an in-depth validation of the theoretical content. I am not in a position to increase my score.

作者回复

We thank the reviewers for their valuable comments and detailed questions. We will provide replies to the individual reviewers below, but first would like to make some general statements addressing a few issues raised by all the reviewers.

Common questions: Practical impact/examples of this work?

A practical implication of our result is the description of bi-Lipschitz neural operators as a composition of discretization invariant layers of invertible finite dimensional neural operators (i.e. neural networks). Such neural operators are useful in generative models where a probability distribution μ0\mu_0 supported on a given model manifold M0M_0 is pushed forward by a map FθF_\theta to a distribution that one would like to be close to an empirical target distribution μdata\mu_{data} supported on some submanifold MdataM_{data} of the Hilbert space XX. (Here μdata\mu_{data} and MdataM_{data} are unknown and θ\theta are optimized with samples from μdata\mu_{data}). Suppose that we know a priori the topology of the data manifold MdataM_{data} and there is a diffeomorphism f0:M0Mdata.f_0:M_0\to M_{data}. As all smooth finite dimensional submanifolds of a Hilbert space are close to some finite dimensional subspace VV, one can start by assuming that there exists an embedding f1:M0PV(Mdata)f_1:M_0\to P_V(M_{data}) that is close to f0f_0, where PVP_V is a finite dimensional orthoprojection onto VV. By considering the model manifold M0M_0 as a subset of VV, we can extend the embedding f1:M0Vf_1:M_0\to V to a diffeomorphism F0:VVF_0:V\to V. This can be done when the dimension of VV is sufficiently large [Puthawala et al., ICML 2022].

Furthermore, F0F_0 can be extended to a diffeomorphism Fext=F0×IdV:XX,F_{ext}=F_0\times Id_{V^\perp}:X\to X,where FextF_{ext} maps x=v+wVVx=v+w\in V\oplus V^\perp toFext(v+w)=F0(v)+w.F_{ext}(v+w)=F_0(v)+w.The map FextF_{ext} can be written asFext=Id+PVGPV,F_{ext}=Id+P_V\circ G\circ P_V, where PVP_V is a compact linear operator and G=FextIdG=F_{ext}-Id. By definition the map FextF_{ext} is a neural operator diffeomorphism. Thus, diffeomorphic neural operators can be used to obtain generative models. As the finite dimensional subspace VV is not a priori known, and its dimension depends on the accuracy required for the generative model, it is natural to consider infinite dimensional neural operators F:XXF:X\to X and study their approximation properties.

In our paper we show, in Theorem 3, that strongly monotone neural operators can be approximated continuously by finite dimensional neural operators that are diffeomorphisim (so, invertible). In Theorem 4 we show that any bi-Lipschitz neural operator (not necessarily strongly monotone) can locally be represented as a composition of strongly monotone neural operator layers. This implies that bi-Lipschitz neural operators can be approximated by a composition of invertible, finite dimensional neural networks in a continuous way. This makes invertible neural operators a class that behaves well in finite dimensional approximations. Our results can also be summarized by stating that neural operators conditionally serve as a class of diffeomorphisms of function spaces that are simple enough for well-working approximations but still sufficiently expressive (and may model a rich variety of deformations).

The neural operatorsF:uσ(u+T2(G(T1u)))F:u\to \sigma\circ (u+T_2(G(T_1u)))we study include the neural operators that are close to those introduced in Kovachki-Lanthaler-Mishra (KLM) [28, 31]. We have assumed that operators T1T_1 and T2T_2 are compact linear operators; in several cases these can be chosen to be identity embeddings that are maps between different function spaces so that these embeddings are compact.

Consider a KLM neural operator F:XXF:X\to X of the formF:uσ(u+S2(H(S1u))),F:u\to \sigma\circ (u+S_2(H(S_1u))),where X=Hm(D)X=H^m(D) and DRdD\subset \mathbb R^d is a bounded set. Moreover, let Y=C(D)Y = C(\overline D) and Z=Cm+1(D)Z = C^{m+1}(\overline D), where m>d/2m>d/2. Let h:YXh:Y \to X be a nonlinear (integral) operator,H(u)(x)=Dkθ(x,y,u(y))u(y)dy,H(u)(x)=\int_D k_\theta(x,y,u(y))u(y)dy,where kθk_\theta is a kernel given by a neural network with sufficiently smooth activation functions σj\sigma_j of the formkθ(x,y,t)=j=1Jcj(x,y,θ)σj(aj(x,y,θ)t+bj(x,y,θ)),k_\theta(x,y,t)=\sum_{j=1}^{J} c_{j}(x,y,\theta)\sigma_j(a_{j}(x,y,\theta)t+b_j(x,y,\theta)),and S1:XYS_1 : X \to Y and S2:ZXS_2 : Z \to X the identity embedding operators mapping between function spaces,Sj(u)=u.S_j(u)=u.Thus,F(u)(x)=σ(u(x)+Dkθ(x,y,u(y))u(y)dy).F(u)(x)=\sigma(u(x)+\int_{D}k_\theta(x,y,u(y))u(y)dy).The Hilbert spaces X,YX,Y, and ZZ are isomorphic and by writing, e.g.,S2H=T2G,S_2\circ H=T_2\circ G,whereT2=S2JZ1, G=JZh,T_2=S_2\circ J_Z^{-1},\ G=J_Z\circ h,where JZ:ZXJ_Z : Z \to X is an isomorphism, we can write FFin the formF(u)=σ(u+T2(G(T1u))),F(u)=\sigma\circ (u+T_2(G(T_1u))),in which T1:XXT_1:X\to X and T2:XXT_2:X\to X are compact linear operators and G:XXG:X\to X a continuous nonlinear operator. In this way, the KLM-operator FF can be written in the form studied in our paper.

Furthermore, by choosing kθ(x,y,u(y))=kθ(xy)k_{\theta}(x,y,u(y)) = k_{\theta}(x - y) and D=TdD = \mathbb{T}^d as the convolutional kernel and the torus, the map FF takes the form of an FNO [37].

最终决定

The reviewers find this work studying an important problem and encourage the authors to incorporate the suggestions and developments during the review process.