PaperHub
6.8
/10
Poster4 位审稿人
最低4最高5标准差0.4
4
4
4
5
3.3
置信度
创新性2.5
质量2.8
清晰度2.5
重要性2.5
NeurIPS 2025

Infinite Neural Operators: Gaussian processes on functions

OpenReviewPDF
提交: 2025-05-10更新: 2025-10-29

摘要

关键词
infinite-width neural networksGaussian processneural operator

评审与讨论

审稿意见
4

The authors show that infinite-width neural operators converge to function-valued Gaussian processes. The resulting covariance functions are computed for two neural operators, including the Fourier neural operator. This result is validated numerically.

优缺点分析

The fundamental result that neural operators induce Gaussian processes over their outputs is interesting. As the authors note, the theoretical properties of neural operators are still largely unexplored, and this result suggests a natural means of uncertainty quantification for neural operators with Gaussian parameters/kernels.

One important limitation of this result (which is also applicable to similar known infinite-width NN/GP equivalence results) is that the characterization only applies to neural operators with i.i.d. Gaussian parameters/kernels. Although neural networks in practical scenarios often begin with randomly initialized parameters, how these are affected by training in practical regimes is still largely an open question, which means that any connection between these infinite-width Gaussian process/NN equivalence type results and neural networks (or operators) actually used in practice is still largely unclear.

I have some particular confusions on Section 3.1 which seem fundamental to the paper's result (see Questions below). If these can be answered and clarified in the paper's presentation, I will consider raising my score.

问题

  • Most importantly, in Section 3.1, what is meant by the covariance function associated with a particular operator? From the description in the paper it is not at all clear what CB1C_{B_1} or cB1c_{B_1} are supposed to be. Is the idea that B1:L2(X,RJ1)L2(X,RJ2)B_1: L^2(\mathcal{X}, \mathbb{R}^{J_1}) \to L^2(\mathcal{X}, \mathbb{R}^{J_2}) is itself a Hilbert space-valued Gaussian process and CB1C_{B_1} is its covariance function? (This seems to be what is implied by the following discussion about covariance functions of neural operator layers under Gaussian assumptions.) If so, this needs to be clarified, as it would only make sense specifically if B1B_1 is a Gaussian process rather than an arbitrary operator. I am also not clear on what the distinction between CB1C_{B_1} and cB1c_{B_1} is supposed to be.
  • Is the closed form of the limiting covariance cc_{\infty} in Theorem 3.1 only known for certain activation functions σ\sigma? The discussion in Section 3.1 seems to imply that the Gaussian process result only holds for Lipschitz σ\sigma with σ(0)=0\sigma(0) = 0 and closed-form covariance is only known for some activation functions such as ReLU and sigmoid. These assumptions should be explicitly stated in Theorem 3.1.
  • What is the purpose of the regression experiment in section 5.2? It doesn't seem like it has anything to do with the infinite-width GP result, but rather it just seems to be a comparison of the predictive performance of FNOs and standard finite input/output neural networks on a particular regression task. I don't see how this contributes to the paper.

局限性

Yes

最终评判理由

I thank the authors for their detailed response; they have satisfactorily answered my questions and I have raised my score accordingly. I do encourage them to make some of the points in their rebuttal more explicit in the paper. As they have noted, it should be clarified in Lemma 3.1 that B1B_1 and B2B_2 are Hilbert-space valued stochastic and Gaussian processes, respectively. They should also more explicitly discuss which assumptions on the activation functions are necessary for the various results to hold. I have no problem in principle with assumptions on activation functions, it should just be stated clearly what assumptions are required for an NO to be well-defined, what assumptions are necessary for a closed-form covariance function to be computed, etc. I also encourage them to specifically state that CB1[f,g]=cB1[f,g]IjC_{B_1}[f,g] = c_{B_1}[f,g] I_j means that the covariance is proportional to the jxjj x j identity matrix (I didn't grok on my initial read that cB1[f,g]c_{B_1}[f,g] is supposed to be a constant depending on f,gf,g) and to note explicitly that this is equivalent to the the outputs of B1B_1 being i.i.d. across dimensions (I understand that this can be inferred from all the definitions, but because the setup is already heavy on definitions and notation, this makes it more digestible).

格式问题

None

作者回复

Thanks for your thoughtful feedback and contribution towards sharpening our definitions and fleshing out the assumptions behind our theoretical results . Below, we provide clarifications regarding the content of Section 3, as well as our experiments in Section 5.2.

We hope our answers sufficiently address your concerns and elevate your appraisal of our work. Otherwise, we would be happy to engage further.

What is meant by the covariance function associated with a particular operator?

We apologize, the statement of Lemma 3.1 in the main text is missing some details that are only present in Appendix B.1. This will be corrected and improved in the main text. In summary, B_1B\_1 is a Hilbert space-valued stochastic process, which is not necessarily Gaussian, while B_2B\_2 must be a Hilbert space-valued Gaussian process. What we mean by the covariance function associated is meant to be the covariance function of these stochastic processes/random operators.

Distinction between C_B_1C\_{B\_1} and c_B_1c\_{B\_1}

Initially, it is important to note that B_1B\_1 maps functions landing in mathbbRJ_1\\mathbb{R}^{J\_1} to functions landing in mathbbRJ_2\\mathbb{R}^{J\_2}. Consequently, as detailed in Section 2.2, lines 89-91, applying its covariance function to two input functions, mathbff\\mathbf{f} and mathbfg\\mathbf{g}, yields mathbfC_B_1[mathbff,mathbfg]\\mathbf{C}\_{B\_1}[\\mathbf{f},\\mathbf{g}]. This result is a function that takes two domain elements and outputs a J_2timesJ_2J\_2 \\times J\_2 matrix. This matrix structure accounts for potential cross-covariance between the output dimensions of B_1[mathbff]B\_1[\\mathbf{f}] and B_1[mathbfg]B\_1[\\mathbf{g}]. Therefore, when we state that mathbfC_B_1[mathbff,mathbfg]=mathrmc_B1[mathbff,mathbfg]boldsymbolI_J_2\\mathbf{C}\_{B\_1}[\\mathbf{f},\\mathbf{g}] = \\mathrm{c}\_{B_1}[\\mathbf{f},\\mathbf{g}]\\boldsymbol{I}\_{J\_2}, we are indicating that this matrix is proportional to the identity matrix. This condition is equivalent to requiring that the output dimensions of B_1[cdot]B\_1[\\cdot] are independently and identically distributed (i.i.d.) for all input functions.

Assumptions on the activation function.

We would like to clarify that some assumptions on the activation function are necessary for any neural operator to be mathematically well-defined, including finite-width one, thus are included when we assume Z_JZ\_J is a NO. For example, Kovachki et al. (2023) assume measurable, linearly bounded activation functions, a condition slightly weaker than Lipschitz continuity, which includes activations like ReLU, ELU, tanh, and sigmoid. Nonetheless, our statement in line 150 (Section 3.1) is overly restrictive and could be misleading; it should say, "if sigma\\sigma is Lipschitz [...] then this is a well-defined operator," rather than "only if." We will amend the main text and add to the supplementary material a proof that, for a toroidal domain, any measurable linearly bounded activation induces a well-defined L2L^2 operator.

As for the functional form of the activation function, closed-form solutions are only known for certain activation functions but these do not invalidate the result theorem. However, it is not limited to just ReLU and sigmoid. We have cited these solutions because they are the most popular activations, but the infinite-width neural network literature has continued to explore more closed-form solutions and numerical approximations to Eq (16). For example, Han et al. (2022) [arxiv:2209.04121] provides ways to compute exact and approximate solutions for a wide class of activations, and these have been implemented in the neural-tangents Python package.

What is the purpose of the regression experiment in section 5.2?

It is true that Section 5.2 does not relate directly to our results in Section 3, however, it shows that, regardless of the model's width, the model’s band-limit needs to at least cover the data’s band-limit for accurate predictions and that, even as the width increases, the predictive performance of finite-width models is not directly related to that of infinite-width ones.

评论

I thank the authors for their detailed response; they have satisfactorily answered my questions and I have raised my score accordingly. I do encourage them to make some of the points in their rebuttal more explicit in the paper. As they have noted, it should be clarified in Lemma 3.1 that B1B_1 and B2B_2 are Hilbert-space valued stochastic and Gaussian processes, respectively. They should also more explicitly discuss which assumptions on the activation functions are necessary for the various results to hold. I have no problem in principle with assumptions on activation functions, it should just be stated clearly what assumptions are required for an NO to be well-defined, what assumptions are necessary for a closed-form covariance function to be computed, etc. I also encourage them to specifically state that CB1[f,g]=cB1[f,g]IjC_{B_1}[f,g] = c_{B_1}[f,g] I_j means that the covariance is proportional to the j×jj \times j identity matrix (I didn't grok on my initial read that cB1c_{B_1} is supposed to be a constant depending on [f,g][f,g]) and to note explicitly that this is equivalent to the the outputs of B1B_1 being i.i.d. across dimensions (I understand that this can be inferred from all the definitions, but because the setup is already heavy on definitions and notation, this makes it more digestible).

评论

Thank you for your comments and we are happy your questions have been answered. We agree these points need more clarity and will revise the statements and discussions accordingly. While we aren't aware of any explicit assumptions that guarantee closed-form solutions for the covariance functions of the activations, we will include the extended discussion on previous literature and numerical approximations based on our comments.

审稿意见
4

This paper studies when infinite neural operators behave like Gaussian processes. In particular, the authors find that under certain conditions, arbitrary-depth neural operators with Gaussian distributed convolution kernels converge to GPs. These conditions include infinite width, randomly initialized weights. This is demonstrated empirically through empirical estimates of the variance converging to the theoretical value.

优缺点分析

(+) Well written and clear paper.

(+) The proofs and proof sketches are also clear.

(+) The empirical results demonstrating the density estimation and variance converging are compelling.

(-) The definition of both NO and infinite-width neural operator are unclear. It looks like the lift and projection layers are ignored, but it could also be interpreted to meaning that AK1=AKd=0A_{K_1} = A_{K_d} = 0. Please state this more clearly and define the infinite-width NO.

(-) It is not clear exactly what the that need to be met are for the infinite with neural operator to behave like a GP.

问题

  • How is the infinite width neural operator implemented in practice, especially in the case of the results for Fig. 3?
  • Is the infinite width definition applied to both the projection layers as well as the kernel KK and matrix WW?
  • Does the theory still hold if the number of Fourier modes used in each neural operator layer is fewer than what is necessary to represent the data? Clarity on this and other limitations would potentially increase my score.

局限性

Yes

最终评判理由

This paper provides a strong theoretical framework for understanding neural operators with empirical experiments that demonstrate this as well. The authors have addressed the concerns of most of the reviewers and have addressed my main concerns. I will keep my score as is.

格式问题

Minor typos:

  • line 135: CB2[1J2f2Tf1]C_{B_2}[\frac{1}{J_2}f_2^Tf_1] -> missing comma?
  • line 186: broken hyperlink
作者回复

Thank you for attentively reviewing and appreciating our work. We hope our answers below appropriately address your concerns. If this is the case, please consider raising your score. Otherwise, we will be more than happy to discuss our contributions in further detail.

How is the infinite width neural operator implemented in practice?

As discussed in Theorem 3.1, an infinite-width neural operator is equivalent to a Gaussian process with a specific kernel c_inftyc\_\\infty. In Section 4, we show how to compute this kernel in two cases: Eq. (32) and Eq. (36). Given this, we can implement the infinite-width NO just like any kernel method. For instance, given some training and test data, we can compute the Gram matrix boldsymbolC\\boldsymbol{C} of the kernel by evaluating it on every pair of training points. We can sample from the corresponding GP by multiplying independent Gaussian samples with the Cholesky decomposition of boldsymbolC\\boldsymbol{C}, and the inverse of boldsymbolC\\boldsymbol{C} can be used to compute the posterior for prediction.

Additionally, our supplemental material includes an implementation of the infinite-width model in PyTorch.

Assumptions behind the theoretical results.

We apologize for the possible confusion regarding the lift and projection layers. Kovachki et al. (2023) separate the layers of their architecture into three sections: lift, NO layer, and projection components. However, as we discuss in Section 2.1, both the lift and projection operators can be represented by just the NO layer by simply setting boldsymbolK\\boldsymbol{K} to zero. As such, in Theorem 3.1, when we defined boldsymbolK_ell\\boldsymbol{K}\_\\ell and boldsymbolW_ell\\boldsymbol{W}\_\\ell to map functions from J_ell1J\_{\\ell-1} channels to J_ellJ\_{\\ell} channels, this includes the possibility of either of them being zero. Our theorem requires that all J_ellJ\_\\ell values approach infinity, encompassing all layers, both boldsymbolK\\boldsymbol{K} and boldsymbolW\\boldsymbol{W}, thus the hidden layers of the lift, NO layer, and projection components all go to infinity.

As for the number of Fourier modes, our theorem is applicable to any number of Fourier modes, regardless of whether they are sufficient to represent the data. As stated in Theorem 3.1, the only requirements are that the weights and kernel for each component operator are independent and identically distributed (i.i.d.) and that their covariance shrinks with increasing width. In the FNO model, Section 4.1 demonstrates that the number of Fourier modes (2B+1) acts as a configurable hyperparameter in kernel computation. Additionally, our theorem is general enough to encompass different NO architectures which might not even have a limited number of Fourier modes, such as the Matérn kernel of Section 4.2 which inherently possesses an infinite number of Fourier modes, though in practice, it is truncated by the input data's bandlimit.

评论

Thank you for clarifying these questions and for the thoughtful responses. After reading through the other reviews, I will keep my score as is.

评论

Thank you for reviewing our responses and the other reviews. If you have any additional comments for improvement or questions about our replies, please let us know as we would be happy to improve this work.

审稿意见
4

This paper extends the classic result that finite width neural networks converge to Gaussian processes as the width of the network width tends to infinity. The results are are extended to the operator learning setting, where the input and output of a network are infinite-dimension functions. The result briefly states that as the output dimension of intermediate functions in the operator tend to infinity, the output of the network converges to a function-valued Guassian process with a computable kernel.

优缺点分析

Strengths:

  • The contributions of this paper are reasonable, from the mathematics point of view.
  • The mathematics are well presented and clear.
  • The position in the literature and relation to other works is well discussed.

Weaknesses:

  • As a practical method, the \infty-FNO is not well evaluated, and not in the settings typical of neural operators.
  • As a method of guiding architecture design, it is not clear to me the paper provides insight to guide FNO design

问题

  • Line 186 has a missing reference.
  • Do the authors think it worth discussing Deep Gaussian Processes [1]? They seem a related method that applies in the Bayesian inference setting, and therefore potentially a better comparison than finite-width FNOs.
  • Do the authors see any routes to applying these methods in practice, given they perform less well than finite width FNOs in this simple regression setting? Do they see any impact on finite width FNO design that could be gleaned from this paper?
  • Can the authors think of any classic FNO settings (PDE learning, weather modeling) where a more in detail experiment might be possible?
  • Can the authors explain why lower J seems to return better results for finite FNOs? This seems unexpected to me.

[1] https://arxiv.org/pdf/1211.0358

局限性

None.

格式问题

None

作者回复

Dear reviewer, thank you for your feedback. Below, we address your concerns. We hope our answer elevates your appraisal of our work. Please let us know if you have any other questions or need more clarification. We would be happy to engage further.

Do the authors think it worth discussing Deep Gaussian Processes [1]? They seem a related method that applies in the Bayesian inference setting, and therefore potentially a better comparison than finite-width FNOs.

To the best of our knowledge, DGPs have not yet been applied to operator learning and it is not clear how it could be done. In terms of our theoretical results, previous work on neural networks indicates doing either infinite width limits or infinite depth and width limits recovers Gaussian process models, but not DGPs. Our current paper generalizes these results to the operator learning setting, thus, we did not find a link with DGP models.

Do the authors see any routes to applying these methods in practice, given they perform less well than finite width FNOs in this simple regression setting? Do they see any impact on finite width FNO design that could be gleaned from this paper?

The purpose of the regression experiment is not to evaluate the performance of FNOs; it is rather to stress-test the FNOs when the true width and bandlimit are violated in the model specification and show that their performance is not correlated, as our paper is not focused on advocating for infinite-width FNOs as a modelling tool. A priori, this does not mean that infinite-width FNOs are not applicable in more realistic scenarios.

However, based on previous results on infinite-width neural networks, we see promising avenues of application of our work. For example, Adlam et al. (2023) [arxiv:2303.05420] has successfully scaled infinite-width NNs on the millions of data points setting and achieved competitive results against SotA methods and Meronen et al. (2020) [NeurIPS 2020, arxiv:2010.09494] derived activations for finite-width NNs based on infinite-width models, achieving better out-of-distribution detection and uncertainty quantification.

Can the authors explain why lower J seems to return better results for finite FNOs? This seems unexpected to me.

We speculate that a possible reason behind the performance decrease of FNO as the width (J) increases is that our training set has n = 100 functions, giving a total of (2*5+1)*100 = 1,100 data points, which is much less than the number of parameters.

评论

I thank the authors for their rebuttal. I have read through the other reviews and rebuttals. I keep my score as is.

评论

Thank you for going over the rebuttal and our other responses. If you have any further questions or require clarifications, please let us know. We are more than happy to elaborate on any points.

审稿意见
5

The paper studies neural operators in the infinitely wide limit. The paper provides a thorough background on (neural) operators alongside probability theory (including the strong law of large numbers and Gaussian processes).

The main contribution is a proof that the (a priori) infinite-width limit of neural operators is a function-valued Gaussian process under certain conditions. The paper thus extends results established for the infinite-width limit of Bayesian neural networks priors being Gaussian processes. It does so under similar assumptions of shrinking covariance with width and Gaussian assumptions (on the operators in each layer of the neural operator). The provided proof involves establishing each operator layer as a function-valued GP and that the covariance functions of these layers compose. Following this, the authors show closed-form representations for the covariance of an infinite-width Fourier neural operator and a Toroidal Matérn operator.

Finally, experimental validation shows that the empirical variance and output distribution match theoretical predictions as neural operators become wider in a simple, synthetic problem. An example regression task is studied, showing the difference between the infinitely wide FNOs and trained FNOs in approximating a simple, predefined ground truth FNO.

优缺点分析

The paper is generally concise and very clearly communicates the contributions. The paper concisely motivates the study and provides a good, if relatively thorough and dense, review of necessary background material. Instead of the level of detail in Section 2 towards establishing the setting, some of the details could be considered for the supplementary material, thus possibly allowing for more detail in, e.g., Section 3.2.

The theoretical contribution builds on well-established approaches to studying the infinite-width limits of neural networks and adds, to my knowledge, novel theoretical results applicable to, e.g., widely-adopted Fourier neural operators. While outside my immediate area of expertise, the provided proof appears correct and the claims on the theoretical results seem well-founded, and the results are presented clearly.

(a) On experimental results: In the introduction, the authors refer to the data that is analysed as realistic scenarios and real-world data (ll. 11 & 35). Given that the experimental validation in Sections 5.1 and 5.2 is relatively simple “synthetic” settings, I find this misleading. I recognise that the contribution is theoretical and that limited experimental validation is not a big weakness, yet I would suggest to soften the language in the introduction (simple, synthetic problems) - or, alternatively, the present language necessitates adding an analysis of actual realistic/real-world data, for instance the smallest (or a simplified version of) problems in PDEBench [1] or the simplest problems studied Sec. 6 in Kovachki et al. (2024).

(b) For the theoretical results, I find that the implications of the results are under-explored in the paper’s presentation. The contributions can stand on their own, and as the authors conclude, their contributions "lay a foundation for future investigations". However, the consequences of the infinite-width view of Bayesian neural networks' priors, as highlighted early on by, e.g., Neal (1995), could be considered for the infinite-width neural operator setting by analogy. Neal, for one, considers the necessity of infinite variance priors and heavy-tailed assumptions (calling the Gaussian process priors "disappointing"). While the theoretical results are sufficiently novel and well-founded (and supported by some experimental results), the paper would, in my opinion, be much stronger if it also discussed current experimental findings on the limitations of, e.g., FNOs in the context of the paper's theoretical results (see question related to empirical studies below).

I find the paper to be a strong submission as it is, but I would strongly consider increasing my score if a valuable discussion of the points in (b) were added or if a more realistic setting, as discussed in (a), were added.

问题

See points (a) and (b) above, primarily.

Do you think there is a connection between your results and the experimental results showing optimum widths (or diminishing returns) for FNOs in, for example, Section B.2 of Georg et al. (2024) [2], or the analogous results regarding modes in Section C of Qin et al. (2024) [3]? I am just curious, and you do not have to have or provide answers to this: do you think the initialization behaviour of the infinitely-wide FNO could be related in any way to the findings on "trainability" by Koshizuka et al. (2024) [4], who note that stable training requires initialization "at the edge of chaos"?

Why did you choose to investigate a J=1 ground truth FNO in Section 5.2, rather than, for example, a J=10 configuration? Again, out of curiosity, does the infinite-depth perspective, in your view, offer any insights into why increasing the depth of an FNO appears to worsen performance (disregarding the main issue of the discrepancy between the behaviour of newly initialized and trained models)? Similarly, out of curiosity, do you have any intuition as to whether, e.g., Eq. 26 might inform the limitations encountered by the infinite-depth FNO in the context of Figure 3?

l.70: "matrix-valued"? l. 164: "... whose image is an RKHS..." (?) and "a covariance function" l. 174: "... we apply know..." l. 186: undefined reference l. 241: “…evaluate compre our model..." l. 325: "...scales with cubically..."

[1] Takamoto, Makoto, et al. "PDEBench: An extensive benchmark for scientific machine learning." Advances in Neural Information Processing Systems 35 (2022): 1596-1611. [2] George, Robert Joseph, et al. "Incremental Spatial and Spectral Learning of Neural Operators for Solving Large-Scale PDEs." Transactions on Machine Learning Research (2024). [3] Qin, Shaoxiang, et al. "Toward a better understanding of Fourier neural operators: Analysis and improvement from a spectral perspective." CoRR (2024). [4] Koshizuka, Takeshi, et al. "Understanding the Expressivity and Trainability of Fourier Neural Operator: A Mean-Field Perspective." Advances in Neural Information Processing Systems 37 (2024): 11021-11060.

局限性

Largely, yes, authors discussed the limitations.

最终评判理由

The authors partially addressed one of my concerns by exploring results on the 1d Burger's equation, and their added preliminary results does provide evidence towards their relatively strongly worded claims related to realistic scenarios and real-world data; I would still suggest some softening but I appreciate and recognize the supporting evidence provided.

Similarly, while the authors provide a discussion of the implication of the theoretical work, and the theoretical contributions are strong in their own right, I still find the part (b) in "Strengths And Weaknesses" of this review to be a relevant critique of the work even if the weakness were to be ameliorated by adding a discussion similar to the one provided by the author's in the rebuttal to my review.

I am raising my rating to a weak 5 in recognition of the fact that this work constitues a necessary theoretical foundation for the further study of infinite-width limit of neural operators.

格式问题

No concerns.

作者回复

Thank you for your valuable input and pointing out places where our text can be improved and other typos. We hope our answers sufficiently address your concerns. If we have left blank spots or you would like additional clarifications, we will be happy to discuss further.

Connections with existing FNO literature, further theoretical results.

Thank you for highlighting these recent empirical and theoretical results on FNOs. Indeed, we believe that the conclusion of Neal (1995), that, compared to finite-width Bayesian neural networks, their infinite-width counterparts are "disappointing", has become the common sense in the community, as it has been shown in both the Bayesian setting and in the empirical risk minimization setting (e.g. neural tangent kernels) that such infinite-width networks do not learn features from data. This analysis has been expanded by the muP / mean-field perspective, where it has been shown that there exists an optimal way to initialize networks to induce feature learning in the so-called “edge of chaos”.

As Neal (1995) laid the foundation for the analysis of infinite-width neural networks, our paper establishes the sufficient conditions for the existence of the infinite-width limit of NOs and characterizes their covariances at initialization. While we are very interested in continuing this analysis into infinite-width NOs trained by stochastic descent, similar to neural tangent kernels, we couldn't have done so without this foundation.

Regarding connections between our paper and the works of Gerog et al. (2024) and Qin et al. (2024), we would not expect our theoretical results to shed light on questions about optimal width or band-limits of FNOs, in the same way that we do not expect NN-GPs to be very illuminating about the specific post-training behavior of finite-width NNs.

Why a J=1 ground truth FNO was investigated, whether an infinite-depth perspective explains why increasing FNO depth worsens performance, and if Eq. 26 sheds light on the limitations of the infinite-depth FNO in Figure 3.

We chose J = 1 as the ground truth to stress-test the infinite-width models, as we would expect this model to be the most different from the infinite-width neural operator, as shown in Section 5.1, compared to other values of J.

Additionally, we would like to clarify that our experiments only consider one-hidden-layer deep FNOs and infinite-width FNOs. That said, we speculate that a possible reason behind the performance decrease of FNO as the width (J) increases is that our training set has n = 100 functions, giving a total of (2*5+1)*100 = 1,100 data points, which is much less than the number of parameters. Nonetheless, we would like to recall that the main point of Figure 3 is showing that, regardless of the model's width, the model’s band-limit needs to at least cover the data’s band-limit for accurate predictions and that, even as the width increases, the predictive performance of finite-width models is not directly related to that of infinite-width ones.

Claims around experimental results

Thanks for pointing this out. We agree that we have oversold our experimental results. As suggested, we will soften the tone of lines 11 and 35 in our revised manuscript. To move our results toward a more realistic setting, we set up an additional experiment with the same set-up as in the paper but using the 1D Burgers' equation dataset with nu=0.002\\nu = 0.002 from PDEBench. The task is to predict the end state (t=2) given the initial condition (t=0). Due to memory constraints while inverting the kernel matrix, we subsample the total dataset data to n = 100 functions and a grid size of 103. Our preliminary results are as follow:

ModelBJTest L² norm
FNO20108.05±0.63
∞-FNO2011.31±0.85
FNO2010020.72±4.00
FNO20124.02±3.73

At a glance, the finite-width FNO has a different optimal width for this experiment and still there is no correlation between results from models with increasing width and the test loss from the predictive posterior mean of the infinite-width model.

评论

I would like to thank the authors for their response and the additional experimental results provided. I find that the authors have adequately addressed my main concerns here and in their responses to other reviewers, and I will improve my rating accordingly.

评论

We're glad that our responses and additional experiments have addressed your concerns, please let us know if you have any additional comments. We will incorporate parts of our discussion into the paper to clarify our choices in the synthetic experiment, add the results from the 1D Burgers’ equation experiment, and better contextualize in the conclusion the role of infinite-width limits and finite-width NNs.

最终决定

This paper studies the infinite-width limit of randomly initialized deep neural operators and finds a (Hilbert space-valued) Gaussian process limit that is analogous to the limits of deep neural networks. A major justification for the paper is that the inductive biases and functional properties of neural operators are largely unexplored, and this analysis of the infinite-width limit provides a strong theoretical foundation for further analyses of this important class of models. The background and mathematical content are clearly explained, and the proofs themselves are nontrivial contributions of significance. The experiments, though limited, provide empirical justification for the main theoretical results.

Beyond minor points of confusion that the authors have promised to clarify, the key limitation of this paper is the lack of discussion on how to translate these theoretical results into practical insights. In particular, the authors don't offer suggestions for insights into future neural operator design, nor do they discuss how their theory relates to known empirical limitations like optimal width findings from other papers. I encourage the authors to include such a discussion in the camera-ready version. Nevertheless, this paper is a strong and important theoretical contribution that will be of interest to the NeurIPS community, and therefore I recommend acceptance.