4.8

/10

Rejected4 位审稿人

最低1最高6标准差2.2

3.8

置信度

正确性2.5

贡献度2.3

表达2.8

ICLR 2025

Controlling Statistical, Discretization, and Truncation Errors in Learning Fourier Linear Operators

Unique Subedi,Ambuj Tewari

OpenReview PDF

提交: 2024-09-24更新: 2025-02-05

TL;DR

We bound the statistical, discretization and truncation errors that occur while learning the linear core of the Fourier Neural Operator architechture.

摘要

关键词

Operator LearningFourier Linear Operators

评审与讨论

审稿意见

评分: 6置信度: 32024-10-31

The authors investigate Fourier operators by analyzing the statistical, discretization, and truncation errors within the Fourier Neural Operator framework. There are error bounds derived from DFT-based least squares estimators and some insights into the efficient approximation of solution operators for PDEs.

优点

The error bounds for approximation power and training landscapes are poorly understood in the context of FNOs. I enjoyed the analysis of the three types of error: (1) Finite training data (statistical error), (2) Representing continuous input/output on finite grid (Discretization error), and (3) Fourier series truncation (truncation error).

In this paper, the discretization error and the truncation error are bounded very nicely. It is very welcomed to have both upper and lower bounds on these errors so that one can attempt to balance efficiency with accuracy. The upper bound of the statistical error makes sense as it is just a Monte Carlo rate of convergence, as expected. The lower bound is a little more mysterious because it comes from a particular difficult distribution that the authors have designed.

缺点

The statistical error is decaying only like the square root of the sample size n; however, much faster rates of convergence are usually observed in practice. Getting at this discrepancy is important, because the analysis right now suggests that a large number of training data samples are needed for accurate FNOs. One may have to make much bigger assumptions about the PDE to make this work or restrict the function space on which to learn the solution operator.

The paper’s insights into the model structure only apply to the FNO’s linear core. While that is a first step, the nonlinear layers in an FNO’s architecture is the most important part.

问题

I worry that the suggestion of selecting N>= n^{1/2s} and K>=n^{1/4s} is only because of the weaknesses of Theorem 1. The only serious weaknesses between Theorem 1 and Theorem 2 is the 1/sqrt{n} versus the 1/n.
In most PDE theory, we don’t assume that V = W. Instead, W is a smoother space than V. Is it easy to adapt theorem 1 and theorem 2 so that V \neq W?
Due to the Sobolev embedding theorem, one would usually imagine that the discretization error goes down like N^{s-d/2}, not N^s. Similarly, the truncation error goes down like K^{2s-d}, not K^{2s}. What’s your main idea for removing the dimension-dependency in the error bounds in Theorem 1?

2024-11-19

We are glad that the reviewer enjoyed reading our article. We address the reviewer's concern below.

We acknowledge that, in practice, rates faster than $1/\sqrt{n}$ , and even faster than $1/n$ , are often observed. This discrepancy likely occurs from the careful selection of training samples in practice, which deviates from the i.i.d. sampling assumption commonly used in statistical models. In the discussion and future work section, we highlight that alternative learning frameworks, such as active learning, could provide a more appropriate theoretical foundation for analyzing operator learning and may offer insights into these faster convergence rates.
Based on our lower bound of $1/n$ , we require $K \geq n^{1/2s}$ and $N \geq n^{1/2s}$ to ensure that the statistical error does not dominate the truncation and discretization errors. However, as the reviewer correctly points out, the recommendation of $K \geq n^{1/4s}$ may not be optimal. We plan to revisit and refine these recommendations as we address the gap in the upper and lower bounds of the statistical rate.
It is indeed possible for the input space and output space to have different smoothness levels, $s_1$ and $s_2$ , with $s_1 < s_2$ . However, our analysis still require $s_1 > d/2$ , and the statistical and truncation errors will be dominated by the less smooth space. Specifically, the upper bounds will be $N^{-s_1}$ and $K^{-2s_1}$ . Since the more smooth space is contained in the less smooth space, we chose both spaces to have the lower smoothness level in our analysis.
We believe that bounds of the form $N^{-(s-d/2)}$ are typically obtained when analyzing the discretization error of DFT-based interpolants under Sobolev norms. In our analysis, we quantify the error under the $L^2$ -norm, resulting in a rate of $N^{-s}$ for $s > d/2$ . If the reviewer could provide specific references for comparison, we would be happy to review them and offer a more detailed discussion of how our bounds compare with existing results.

2024-11-26

Thank you for your response. I have decided to maintain my score, as I believe that it is a fair assessment of the manuscript.

审稿意见

评分: 6置信度: 42024-11-01

Authors derived upper and lower bounds on excess risk. Excess risk is formally defined on page 7, but in a gist it is an average of difference between error of selected statistical estimator and the best model in selected class. As a class of models authors consider linear functional model $g(x) = \sum_{\left\|m\right\|\_{\infty}\leq K}\lambda\_m\phi\_{m}(x)\left<\phi\_{-m}, f\right>_{L_2}$ where $\phi_m(x) = \exp(2\pi i \left(m^T x)\right)$ , $\lambda_m$ are unknown parameters, $f$ and $g$ are input and output functions respectively. As an estimator authors use least squares with functions discretised on the uniform grid and scalar products computed with DFT.

优点

The article is easy to read, rigorously written, and provides many details about related topics in functional data analysis and neural operators. Although I only reviewed appendices superficially, I have little doubts about the correctness of the theoretical results.

缺点

The main weakness is that the problem setup chosen by the authors is too simple (or too convenient) and not directly related to the operator learning problem. Somewhat subjectively, I do not find the results presented by the authors surprising. Given the simplicity of the setup, I am also not entirely sure that related results are not already available in the literature on functional data analysis.

问题

Major concerns:

As I understand, the problem authors consider is, how to estimate the error of a linear functional model with integral kernel diagonal in truncated Fourier basis, fitted with ordinary least squares and discretized on the uniform grid.

This problem allows a succinct formulation on a single page. Can the authors please discuss why they decided to provide excessive details on operator learning, Fourier neural operator, and straightforward results like Proposition 1 and Proposition 2? Would not it be easier (for the authors and the reader) to formulate the problem right away?
It does not seem to me that there is a direct relation between the statistical or approximation properties of FNO and the model authors consider. For example, the authors claim "The inductive bias in FNOs is that the functions are sufficiently smooth so that the higher Fourier modes can be safely ignored." Although this is correct that FNO has some bias toward smoother function, the situation is not as simple as the author's frame. FNO contains nonlinear activation functions that are merely continuous. The resulting operator can transform frequencies well above the chosen truncation frequency for the integral kernel. In contrast, the linear model considered by authors can not possibly process high frequencies.

This distinction already compromises the relation between FNO and the linear model. Given that, it is unclear that one should introduce FNO (vaguely related model with completely different properties) in such great detail.
The result seems to be straightforward in a way it contains a well-known statistical error, a well-known discretization error, and a well-known truncation error that essentially estimates the convergence rate of Fourier series for the functions from Sobolev space. I can not point to the paper or monograph that contains results proven by authors in the paper, but it looks very unlikely that related results are not already available in the literature. The authors provide some discussion explaining why their results are different from already available ones. For example, they mentioned https://arxiv.org/abs/1208.2895 and claim that the differences are: (i) they consider an "agnostic setting" rather than additive noise model; (ii) parameter $K$ is not learned from the observed data; (iii) basis is not learned from observed data; (iv) functional data is given in a discrete. I would like the authors to extend the discussion of the differences. For example, why is it significant that data is not discretized? Is not it the case that discretization errors (or interpolation errors) are well understood when a particular space

In my opinion, the setup given in https://arxiv.org/abs/1208.2895 is more natural and more general than considered by the authors of the current contribution. The reason for that, in my opinion, is https://arxiv.org/abs/1208.2895 in a standard functional-data analysis, while the setup of a present contribution is in-between operator learning and Functional data analysis. On the one hand, the authors want to keep the structure of the linear operator from FNO, which seems too restrictive from a functional data analysis point of view. On the other hand, authors consider a single linear layer which is again too restrictive from a neural operator point of view.

Minor issues:

Authors work with functions of $\mathbb{T}$ . Intermediate layers of the Fourier neural operator do not produce exclusively periodic functions. Can the authors comment on that discrepancy?
On lines 503-504 authors made a strange claim that exponential dependence on $d$ does not matter for $d=3$ . This statement is trivial. Perhaps, more importantly, FNO can not be applied for large $d$ because it suffers from exponential increases in memory and computing.
Line 423 $v_1$ contains incorrect subscript.
It seems that the result obtained by authors can be extended to many families of functions, e.g., $\phi_{m}$ can be replaced with Chebyshev or Legendre polynomials. Can the authors please comment on that?

All in all, I can not point to any significant technical flaw in the article, but I do not find the results surprising and useful for operator learning and functional data analysis. These points can be considered subjective, so I do not think they should significantly affect the decision to accept or reject the paper. I hope this explains the chosen "overall score".

伦理问题详情

2024-11-19

We thank the reviewer for noting that our article is easy to read, rigorously written, and provides many details about related topics in functional data analysis and neural operators. We address the reviewers' concerns below.

We agree with the reviewer that our problem involves estimating the error of a linear functional model with an integral kernel that is diagonal in the Fourier basis (though the true model itself need not be truncated—only our estimator is). We chose to motivate this problem using FNO because our target audience is the operator learning community. We are happy to move some of the details to the Appendix in the final version if necessary.
We acknowledge that the approximation capabilities of FNOs and our linear model are not directly comparable, as FNOs incorporate nonlinear activation functions capable of processing higher frequencies. Our analysis, however, focuses on understanding specific error components, such as truncation and discretization, which are fundamental to both linear and nonlinear models. While FNOs may effectively handle high frequencies in practice, it is not yet clear whether their discretization error can be controlled for such functions. The only available discretization error analysis for the forward pass through FNO (see https://arxiv.org/pdf/2405.02221) is in the regime $s>d/2$ , where functions are dominated by low frequencies. As a result, any approximation advantage of FNOs may be offset by non-vanishing discretization errors when dealing with non-smooth functions. Since the lower bound on the discretization error for our model also applies to FNO, our model can be a valuable framework for exploring FNO’s limitations in this regard.
We understand the reviewer's concern regarding the simplicity of our setup. However, this was partly intentional. More importantly, our contributions extend beyond standard results by addressing the novel problem of multiresolution generalization, where operators trained on lower resolutions generalize effectively to higher resolutions as observed in practice. In particular, we establish the generalization error of operators trained on lower-resolution grids ( $N^d$ ) that are evaluated at full resolution ( $N \to \infty$ ). Since this is a novel problem, we do not believe that results addressing it currently exist in the literature. Additionally, while classical results often depend on the effective dimension $K^d$ , our analysis establishes a $1/\sqrt{n}$ rate that is independent of $K$ . This distinction is significant because such $K^d$ -dependence can lead to a statistical sample complexity that is worse than the Monte Carlo rate when the truncation error is made arbitrarily small by taking $K \to \infty$ . While some existing works achieve a $1/\sqrt{n}$ rate under specific assumptions, such as decay in the operator's spectrum (e.g., bounded Hilbert-Schmidt norm), to the best of our knowledge, no results establish this rate without such assumptions. In our case, this rate is achieved by using the smoothness of the input functions. Since most FDA literature assumes only $L^2$ -integrability of input functions, it is unlikely that these results are present in the existing FDA literature.
``I would like the authors to extend the discussion of the differences. For example, why is it significant that data is not discretized?" This is an excellent question. We believe the key factor that conceptually distinguishes operator learning from standard function-to-function regression in FDA is the role of discretization. In FDA, the grid size is typically predetermined and not within the user’s control. Even in rare cases where measurements can be taken at specific intervals, it is often difficult to make reliable assumptions about the smoothness of the functions. In contrast, operator learning uses training data generated by PDE solvers, enabling the user to determine both the grid size and the smoothness properties of the training functions. This control provides a unique learning-theoretic framework where discretization error plays a critical role in understanding the trade-offs between grid size, smoothness, and generalization. Our work emphasizes the importance of quantifying discretization error, particularly in multiresolution settings. Importantly, this error is not merely the error of the DFT for functions in a specific function space. While the lower bound on the DFT error is likely a lower bound for the discretization error, the upper bound on the DFT error does not necessarily provide an upper bound on the discretization error of the trained operator.

2024-11-19

``Authors work with functions on $\mathbb{T}$ . Intermediate layers of the Fourier Neural Operator do not produce exclusively periodic functions. Can the authors comment on that discrepancy?" We agree with this observation. There are two potential approaches to address this discrepancy. First, quantify the error associated with learning non-periodic functions in the Fourier domain. Second, replace $\varphi_m$ with Chebyshev polynomials or Legendre polynomials, which are better suited for non-periodic functions, and extend our analysis accordingly. While we have not fully worked out the details, our analysis should generalize to other bases as suggested by the reviewer.
``Perhaps, more importantly, FNO can not be applied for large because it suffers from exponential increases in memory and computing." This is correct and is one of the potential limitations of FNO.
While our results may appear unsurprising, we believe they address foundational questions that are critical for bridging operator learning practice, FDA, and learning theory. By highlighting the conceptual parallels between these fields, we hope to stimulate interest from the latter communities. Furthermore, our focus on learning-theoretic aspects, such as $K^d$ -free statistical bounds and multiresolution generalization, contributes to understanding operator learning beyond specific architectures like FNOs. These insights may inform future developments in operator learning and its applications.

2024-11-25

I thank the authors for their comments on my questions. After reading other reviews and the discussion, I decided to maintain my original score. In my opinion, the article has no significant errors, but I am still uncertain about the originality of the research and the significance of the problem posed by the authors. The review by fKwW, and the references provided therein, also supports this view.

审稿意见

评分: 1置信度: 52024-11-04

Fourier neural operators (FNO) have been widely using in learning nonlinear operators between function spaces. Motivated by FNO, this paper studies a simple case when the FNO is reduced to the composition of a Fourier transform, a learnable linear transform, and an inverse Fourier transform. The goal is to learn a linear operator using this simplified trainable linear operator network. This paper studies the error analysis of this simplified linear operator learning.

优点

This paper is very well written.

缺点

The trainable operator in this paper is too simple to be useful for more general analysis. This is a linear operator and the problem is essentially a linear regression problem in the Fourier domain. This is a well-studied problem. The discretization error related to DFT is a textbook subject in applied harmonic analysis. The generalization error analysis of a linear regression problem is also classical. The only difference is that this paper consider the linear regression with an infinite coefficient vector instead of a finite coefficient vector in textbooks. But the statistical learning error is trivial after the finite truncation with the parameter K in this paper, because after a finite truncation, the infinite coefficient vector is reduced to a finite one and the classical statistical analysis can be directly apply to analyze the error. I don't see any technical challenge to finish the analysis in this paper. It is very straightforward.
The proposed method cannot be directly extended to learning nonlinear operators because the main idea is a simple linear regression in the Fourier domain.
The authors missed many important recent development in the error analysis of operator learning. A recent JMLR paper (https://jmlr.org/papers/v25/22-0719.html) has studied a very general error analysis for learning nonlinear operators. When the deep neural network in the JMLR paper is simplied to a linear model, and the encoder and decoder in the JMLR paper is the DFT and inverse DFT, then the analysis there can be directly applied to analyze the problem considered in this paper.
There are other important generalization analysis of operator learning should be dicussed. For examples, [1] Generalization error guaranteed auto-encoder-based nonlinear model reduction for operator learning, Hao Liu, Biraj Dahal, Rongjie Lai, Wenjing Liao, Applied and Computational Harmonic Analysis Volume 74, January 2025, 101717. [2] Deep Operator Learning Lessens the Curse of Dimensionality for PDEs. Ke Chen, Chunmei Wang, Haizhao Yang. Transactions on Machine Learning Research, 2023 [3] On the Training and Generalization of Deep Operator Networks. Sanghyun Lee and Yeonjong Shin, SIAM Journal on Scientific ComputingVol. 46, Iss. 4 (2024)

问题

I don't have specific questions. I rank this paper mainly based on the problem and setting this paper has.

2024-11-19

We thank the reviewer for their comments.

We agree that our problem can be framed as regression in the Fourier domain. However, quantifying the discretization error requires analysis beyond that of standard DFT analysis. Specifically, our focus is on quantifying the generalization error of an operator trained on a grid of size $N^d$ but evaluated at full resolution ( $N \to \infty$ ). This formalizes the concept of multiresolution generalization, a phenomenon commonly observed in practice. To address this, we first established a non-trivial result showing that the discretization error of the estimator is bounded by the uniform deviation between the empirical risk at resolution $N^d$ and the empirical risk at full resolution ( $N \to \infty$ ) (see the beginning of Appendix D). While DFT error bounds are a component of our analysis, they do not fully determine the trained operator’s discretization error. Specifically, while the lower bound on the DFT error is likely a lower bound for the discretization error, the upper bound on the DFT error may not necessarily translate to an upper bound on the discretization error of the trained operator. To the best of our knowledge, quantifying multiresolution generalization error remains an open question, particularly for general nonlinear operators.
We agree with the reviewer that classical statistical bounds apply after truncation. However, these bounds depend on the effective dimension $K^d$ , leading to a statistical risk of $\sqrt{\frac{K^d}{n}}$ , which becomes vacuous as $K \to \infty$ . Our key contribution is establishing a $\frac{1}{\sqrt{n}}$ bound independent of $K^d$ , a significant improvement that allows us to analyze the problem for arbitrary $K$ . Achieving this required careful Rademacher analysis, as standard techniques such as covering numbers and parameter counting fail to eliminate $K^d$ -dependence.
The $K$ -free bound is significant because $K$ in FNOs is analogous to the width in standard neural networks. While generalization bounds for FNOs currently depend on $K$ , it is well-established that such bounds for standard MLP networks do not depend on width (see https://arxiv.org/pdf/1712.06541). Our work provides preliminary evidence that a similar $K$ -free generalization bound is possible for FNOs, contributing to the theoretical understanding of this widely used model.
``The proposed method cannot be directly extended to learning nonlinear operators because the main idea is a simple linear regression in the Fourier domain.” We agree that our method is limited to linear operators and cannot be directly extended to nonlinear cases. However, this simplification is intentional, as it allows us to analyze the core component of Fourier Neural Operators (FNOs) in depth.
We thank the reviewer for bringing the JMLR paper (https://jmlr.org/papers/volume25/22-0719/22-0719.pdf) and other references to our attention, which we will cite in the final version. While this JMLR paper achieves impressive results, its analysis does not directly apply to our context. Specifically, the generalization bound in Corollary 11 (where the encoder/decoder are trigonometric polynomials) is $\lesssim n^{-\frac{2}{2+d_{X}}} \log^2{n} + d_{X}^{-\frac{2s}{D}} + d_Y^{-\frac{2s}{D}}$ . In particular, the bound depends on the encoding dimension $d_X$ , resulting in a slower decay rate, $n^{-\frac{2}{2+d_X}}$ , compared to our $n^{-1/2}$ bound. This bound is only tight when $d_X \leq 4$ , but even in this regime, the term $d_X^{-2s/D}$ does not vanish, making the overall bound vacuous. For larger $d_X$ , the decay rate deteriorates further and becomes arbitrarily slower than the Monte-Carlo rate.

The dependence on $d_X$ in their analysis arises because the authors work with non-parametric operator classes and rely on covering numbers. Such covering number analysis inherently ties the statistical rate to the encoding dimension and cannot establish a bound independent of $d_X$ , even for linear operator classes.

In our setting, $d_X$ corresponds to $K^d$ , and one of our primary contributions is decoupling $K^d$ from the statistical error through a careful Rademacher analysis. This allows us to establish a $n^{-1/2}$ bound independent of $K^d$ , a distinction that is not addressed in the JMLR paper. While we appreciate the depth of their analysis, we believe the goals of these two works differ slightly.

We acknowledge that the immediate practical implications of our results are limited. However, our work addresses foundational questions in operator learning, such as multiresolution generalization and $K$ -free bounds, which are critical for bridging gaps between theory and practice. By highlighting the conceptual similarities with the FDA and identifying unique learning-theoretic challenges, we also hope to generate interest from both the functional data analysis and the learning theory communities.

审稿意见

评分: 6置信度: 32024-11-04

This paper is focused on establishing error estimates for operator learning with Fourier neural operators (FNOs). The authors specifically focus on the linear integral operator inherent to the FNO architectures, and establish error estimates for statistical errors (due to finite numbers of samples), truncation errors (from the fact that higher frequency modes are thrown away in the FNO), and a discretization error (arising from the fact that the input functions are sampled on a finite grid). The authors also use the discrete Fourier transform (DFT) to develop a least squares type error estimator that can be bound the other errors both from above and below.

优点

This paper represents an important step in the direction of certifiable and trustworthy operator learning by establishing different error bounds for the FNOs. The authors' exposition is mostly clear, and the connection to other works such as functional data analysis (FDA) is well-highlighted. The authors also make a number of other interesting and insightful observations on the way to their main results; for instance, they use the Mercer decomposition of the kernel to suggest that the FNO's Fourier layers simply parametrize the eigenvalues of a kernel while fixing the eigenfunctions to be complex exponentials/ trigonometric polynomials.

缺点

FNOs use lifting and projection operators (via MLPs). Lifted input functions are fed to the Fourier layers. To the best of my knowledge, the authors' error estimates don't take this lifting into account at all; they are agnostic to it. However, as far as I am aware, it is unclear how the (nonlinear) lifting procedure interacts with the smoothness assumptions the authors place on the input functions to the operator learning problem. If it in fact violates the assumptions (since the lift is learned via an MLP), I suspect the authors' results will break down. This needs to be addressed in the work.
The authors assume the input and output spaces to the operator learning problem are both in L^2(T^d,R), rather than the general Banach space setting; the choice of R here is to simplify analysis. However, practical operator learning problems are not usually in the convenient setting of periodic domains like T^d. Typically, the problems of interest are in L^2(R^d, R). Then, for non-periodic input functions, one would expect a fourth error term to come into play with the Fourier layer: a Gibbs-type error term that potentially "pollutes" both the truncation and discretization error terms. This also needs to be addressed. I think the setting in the work is too restrictive.
I'd like to see at least preliminary results with input functions that have limited smoothness. This will be vital in order to use operator learning for standard problems containing shocks or discontinuities.
FNOs are certainly not just used with the FNO integral operators studied in this work. Assume an input function in T^d (should be R^d as in the previous comment, but let's assume T^d!) is lifted by the input MLP to R^p, i.e., there are p channels. Each linear operator acts on p vector-valued input functions, and is augmented with a pointwise convolution operator that mixes information across those p channels. The analysis has completely ignored this, as far as I can tell, but those pointwise convolutions appear to be key in the approximation power of FNOs.
The authors should present at least one or two numerical experiments verifying their error bounds even in these restrictive settings (T^d, no pointwise convolution, arbitrary lift and projection). Currently, this work has none, and that is concerning. These numerical experiments would also give practitioners guidance on how to leverage the results from this paper in settings where a practitioner has control over the sampling and discretization procedures.

问题

See weaknesses above. I think this work is an important step in the right direction, but has certain critical weaknesses and assumptions that are not clearly stated in the work. It would be extremely useful to know whether those hidden assumptions drastically alter the results.
Another smaller quibble is the notation in the paper. Could we have a table in the appendix showing all the symbols and their meaning, at least if those symbols are FNO hyperparameters?
There is also a minor presentation issue. The three different error terms are all mixed together in the text. The authors should clearly separate them out and show us which is which.
The title of the paper says "controlling [various error terms]", but the authors do not specify a practical numerical procedure to control those errors nor do they demonstrate the efficacy of such a procedure. The title probably needs to be changed, or the numerical results I asked for (above) should contain an example demonstrating how to control these errors.

伦理问题详情

N/A

2024-11-19

Yes, the practical implementation of the linear core of the FNO includes a pointwise transformation $Wv(\cdot)$ . If $v$ is a $\mathbb{R}^p$ -valued function, $W$ is generally a $p \times p$ matrix (or $q \times p$ if the channel dimension changes across layers). However, for mappings between scalar-valued functions, $W$ reduces to a scalar constant (denoted as $c$ ), and the transformation $Wv$ becomes a scaled identity operator, $c \mathbf{I}$ . Since the identity operator is diagonalizable in the Fourier basis, we have $c\mathbf{I} = c\sum_{m} \varphi_m \otimes \varphi_m$ . In light of Proposition 2, this operator shares a similar structure as the Fourier layer $\mathcal{F}^{-1}(\Lambda \mathcal{F}(v))$ . Thus, all our analysis should easily extend with minimal adjustments for the scalar-valued case. However, as the reviewer notes, the situation becomes more involved for vector-valued functions, as the matrix $W$ mixes information across channels. Addressing these challenges for vector-valued cases is an important direction for future work.
We have included experiments demonstrating that our estimator achieves a vanishing error at a reasonable rate. These results are presented in the final section of the revision, located in the appendix, and highlighted in blue for emphasis.
We will add a notation table in the appendix to clarify the symbols and their meanings. Additionally, we will separate the three error terms into distinct subsections or bullet points to address each error clearly.

2024-11-26

"conceptually separate the paradigm of operator learning from its commonly used instantiation using neural network architectures". In this case, why restrict analysis to the FNO integral operator and not more general basis functions as you mentioned in your point 2 above?
I still have concerns. While periodic extensions are possible even on non-rectangular domains (via analogues of the double Fourier sphere mapping), I'd like to see an explicit proof that the boundary errors won't pollute the interior also. I also agree your analysis can be extended, but see 1 above.
Yes. This discussion would be helpful.
If it was that easy to extend the analysis, I'm a little surprised the authors didn't do so already. It would have significantly enhanced the utility of this work.

Based on the author responses, I increased my score to a 6.

2024-11-19

We thank the reviewer for noting that our work is an important step in the direction of certifiable and trustworthy operator learning. We address reviewers' concern below.

We agree with the reviewer that FNOs use lifting and projection operators (via MLPs), which we do not explicitly account for. However, as stated in Section 1.3, our goal is to “conceptually separate the paradigm of operator learning from its commonly used instantiation using neural network architectures.” Thus, we omitted lifting and projection operators from our model. Although nonlinear projection/lifting is often used in practice, Remark 2.2 of (https://arxiv.org/pdf/2107.07562) notes that FNO retains universal approximation properties even when projection/lifting operators act linearly pointwise. In our case, since both input and output functions are scalar-valued, these lifting and projection operators reduce to simple scaling by constants $r$ and $q$ . Because there is no nonlinearity in our model, these constants can be absorbed into the model parameters $\lambda$ , and our results remain valid. However, we acknowledge that our results may not hold under nonlinear projection or lifting operations.
The reviewer is correct that many practical operator learning problems involve functions in $L^2(\mathbb{R}^d, \mathbb{R})$ . However, since current operator learning methods are grid-based, they are effectively constrained to handling functions on a bounded subset $\Omega \subset \mathbb{R}^d$ , mostly rectangular. For bounded rectangular domains, a periodic extension of the function can be performed, and for sufficiently smooth functions, any potential discontinuities introduced by this extension are confined to the boundary of $\Omega$ . Since we quantify error using the $L^2$ -norm (rather than pointwise deviation or uniform norms) and the boundary is a measure-zero set, such discontinuities are unlikely to affect our theoretical results. However, Gibbs-type errors may still arise in practice when computations are performed on a discrete grid. Thus, for non-periodic functions, alternative bases such as orthonormal polynomials or wavelet bases may be more suitable than the Fourier basis, as they are better equipped to handle non-periodicity. Given that our operator of interest has the general form $\sum_{m} \lambda_m \, \varphi_m \otimes \varphi_{-m}$ , we could replace $\varphi_m$ with a different orthonormal basis and extend our analysis accordingly.
``I’d like to see at least preliminary results with input functions that have limited smoothness." The truncation error bound of $\lesssim K^{-2s}$ holds for any $s > 0$ and does not require $s > d/2$ . As discussed at the beginning of Section 4.3, $s > 0$ is necessary, as we can establish a non-vanishing lower bound when $s = 0$ . Regarding discretization error, the DFT-based method is generally ineffective when $s \leq \frac{d}{2}$ , as the DFT cannot reliably approximate the true Fourier transform in this regime. Specifically, for any mode $m$ , it is possible to construct a function $v$ such that the error $\left| \text{DFT}(v^N)(m) - (\mathcal{F}v)(m) \right|$ does not vanish as $N \to \infty$ . Finally, for the statistical error, when $s \leq d/2$ , we can establish a bound of type $\sqrt{\frac{K^d}{n}}$ . Then, to achieve a desired accuracy $\varepsilon$ , choosing $K$ such that $K^{-2s} \leq \varepsilon$ implies the sample size must satisfy $n \geq \varepsilon^{-2 - \frac{d}{2s}}$ to make statistical error $\leq \varepsilon$ . However, this is worse than our sample complexity of $n \geq \varepsilon^{-2}$ and can become arbitrarily worse when $d/2 \gg s$ . If a $\frac{1}{\sqrt{n}}$ rate for the statistical error is desired without any $K$ dependence, it can be achieved under a mild assumption on the spectrum of the operator. If we assume the spectrum sequence $\lambda$ lies in $\ell^2$ rather than $\ell^{\infty}$ , the operator becomes Hilbert-Schmidt. Then, we can use results from (https://arxiv.org/abs/1901.10076) and (https://arxiv.org/abs/2309.06548) to show a $1/\sqrt{n}$ upper bound on the statistical error, even when $s = 0$ . We are happy to include a discussion of these results in the paper if the reviewer would like.

2024-11-26

We thank the reviewer for engaging with us and increasing the score. We address their questions below.

Our goal was to remove the complexities associated with neural networks while retaining the focus on the linear core of one of the most prominent architectures in the field. By focusing on a single basis rather than a more general one, we aimed to keep the discussion centered on the interesting learning-theoretic aspects of the problem, particularly the interplay between various error types. While the analysis of statistical and truncation errors would remain largely similar for other bases, the discretization error analysis would require a separate analysis, as a different basis would require data on the different grids. For example, the estimator based on Discrete Chebyshev Transform for Chebyshev polynomials requires access to functions over a Chebyshev grid instead of the uniform grid.
We understand the reviewer's concern that the boundary terms may pollute the interior for non-periodic functions when DFT-based techniques are used. So, to ensure rigor and clarity, we will refrain from making any formal claims about non-periodic functions in this work and instead leave this analysis for future work.
We will make sure to include these discussions in the final version.
Including the transformation by a constant operator $c \mathbf{I}$ introduces only a minor adjustment to our analysis. Specifically, we can express $c \mathbf{I}$ as $c \mathbf{I} = c \sum_{m \in \mathbb{Z}^d} \varphi_{m} \otimes \varphi_{m}.$ Thus, the operator of interest takes the form: $T =\sum_{m \in \mathbb{Z}^d} c \varphi_{m} \otimes \varphi_{m}+ \sum_{m \in \mathbb{Z}^d} \lambda_m \varphi_m \otimes \varphi_{-m}.$

Let us call the first term $T_1$ and the second $T_2$ . For the upper bound, as our analysis relies on controlling deviations between two types of risks, the terms involving $T_1$ and $T_2$ can be separated using the triangle inequality. The same steps applied to $T_2$ in the paper can be repeated verbatim for $T_1$ , yielding at most an additional factor of $c$ times the previously established error. While this modification is straightforward, implementing it would result in changes throughout the paper. Thus, we will include it as a remark in the final version of the paper.

AC 元评审

2024-12-20

(a) Summarize the scientific claims and findings of the paper based on your own reading and characterizations from the reviewers.

This paper addresses the problem of learning linear operators in the context of Fourier Neural Operators (FNOs). The authors focus on three sources of error: statistical error due to finite sample size, discretization error arising from sampling on a finite grid, and truncation error from approximating operators in a finite Fourier basis. The main contributions are theoretical error bounds (upper and lower) for these components, framed within a DFT-based least squares estimator. The work situates itself within operator learning and functional data analysis (FDA), emphasizing foundational questions about multiresolution generalization.

(b) What are the strengths of the paper?

Rigorous and well-written presentation with detailed mathematical analysis. Clear contributions to understanding error components in operator learning. Offers theoretical insights into the limitations and behavior of FNO’s linear core. Makes connections to FDA, particularly in highlighting multiresolution generalization.

Relevance to Practical Operator Learning: The setup is overly simplistic, focusing on linear models that do not adequately reflect the nonlinear and more general cases typically encountered in practice. This limits the paper's impact on the operator learning community. Lack of Novelty: Many results appear to be straightforward applications of existing statistical and harmonic analysis tools. Reviewers questioned whether the contributions significantly advance the state of the art. Omission of Key Components: Critical elements of FNOs, such as lifting/projection operators and pointwise convolutions, are excluded. This restricts the applicability of the results. Missing Numerical Verification: The absence of empirical experiments to validate the theoretical findings makes it difficult to assess the practical relevance of the results. Restrictive Assumptions: The focus on periodic domains and smooth functions overlooks practical scenarios involving non-periodic or non-smooth data, which are common in real-world problems.

(d) Provide the most important reasons for your decision to accept/reject.

While the paper is rigorous and provides valuable theoretical insights, the overall contributions are incremental and lack sufficient novelty to warrant acceptance at ICLR. The assumptions and simplified setup significantly limit the paper's practical relevance, and the omission of numerical experiments undermines its potential utility for practitioners. Encouragingly, the authors have demonstrated a clear interest in exploring foundational questions, which could serve as a basis for future, more impactful work.

审稿人讨论附加意见

The reviewers raised several critical points during the discussion, particularly regarding the paper's limited scope and practical relevance. Key issues included:

Simplistic Setup: Reviewers (e.g., fKwW and HGAJ) noted that the linear operator model and its connection to FNO are overly restrictive and lack alignment with real-world operator learning tasks. The authors defended this simplification as necessary for a foundational analysis but acknowledged its limitations. Relation to Prior Work: Multiple reviewers questioned the novelty of the contributions, with specific concerns about overlap with existing FDA results. The authors clarified distinctions but did not fully dispel doubts about originality. Omitted Components: Reviewers pointed out the omission of lifting/projection and pointwise convolutions. While the authors argued that these omissions simplified the analysis, they agreed to highlight these limitations in the paper. Numerical Validation: The absence of experiments was a recurring concern. Although the authors added theoretical clarifications, no empirical validation was included in the rebuttal. Based on these discussions, I agree with the reviewers' consensus that while the paper is mathematically sound, its contributions are not substantial or novel enough for acceptance at ICLR. However, the work poses interesting foundational questions and could serve as a stepping stone for future research.

最终决定Reject

2025-01-22

Reject