/10

Poster3 位审稿人

最低3最高4标准差0.5

ICML 2025

A Bregman Proximal Viewpoint on Neural Operators

Abdel-Rahim Mezidi,Jordan Patracone,Saverio Salzo,Amaury Habrard,Massimiliano Pontil,Rémi Emonet,Marc Sebban

OpenReview PDF

提交: 2025-01-24更新: 2025-07-24

摘要

关键词

neural operatorsproximal optimizationbregman divergencefourier neural operator

评审与讨论

审稿意见

评分: 32025-03-04

This paper considers the problem of efficient PDE solutions and operator learning. This paper shows that the neural operator architecture can be interpreted as the minimizer of a Bregman regularization problem, and further designs a novel architecture that includes an inverse activation function. This general framework is then applied to the Fourier neural operator and gives better empirical results. This paper also proves a universal approximation result for Bregman neural operators.

update after rebuttal

Thank you for your response! I will keep my score.

给作者的问题

N/A

论据与证据

The claims look well-supported; however, I have no background in this area and cannot verify the claims and proofs.

方法与评估标准

Yes.

理论论述

I didn't check the proofs.

实验设计与分析

This paper compares different operators on various PDE benchmarks and shows that the Bregman Fourier operator gives the best results on most benchmarks.

补充材料

No.

与现有文献的关系

This paper gives a general framework to characterize neural operators as the minimizers of certain Bregman regularization problems. Classical neural operators can be interpreted as special cases of this framework.

遗漏的重要参考文献

No.

其他优缺点

N/A

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for the review and the positive feedback. Should the reviewer require any further clarification, we would be delighted to provide it.

审稿意见

评分: 32025-03-08

This paper proposes a novel perspective on neural operators based on Bregman proximity operators, where the action of operator layers is interpreted as the minimizer of a Bregman-regularized optimization problem. By defining the Bregman distance through Legendre functions, activation operators are characterized as Bregman proximity operators mapping from the dual space to the primal space. To this end, the authors designed a novel operator termed B-FNO, and numerical experiments show better performance compared to classical FNO and F-FNO.

update after rebuttal

After rebuttal I opt to maintain the score to positive.

给作者的问题

1.In Section 2.2, the statement "we observe that this property allows training deeper and more accurate models" suggests that the architectural design resembles skip connections, which enable deeper neural networks. Could you clarify the specific source of the BFNO improvement?

2.The test error in Fig. 6 appears marginal.

3.While BFNO demonstrates improved stability in deep architectures, does requiring invertible activation functions (e.g., SoftPlus) limit its applicability in scenarios where non-monotonic or non-smooth activations (e.g., ReLU, GELU) are preferred?

论据与证据

The claim that Bregman Neural Operators (BFNOs) enable deeper architectures with improved performance (Abstract, Section 3.3) is partially supported by Figure 4 (Page 7), where BFNO errors decrease as depth increases up to $T=64$ , unlike FNO/ResFNO. However, the paper does not rigorously explain why the Bregman formulation stabilizes deep networks. While Remark 3.6 suggests the identity-preserving property helps, no ablation studies isolate this effect from residual connections. Additionally, Figure 6 (Page 8) shows marginal improvements (e.g., BFNO vs. FNO on 2D Darcy: ∼∼1% relative error reduction), but statistical significance is not tested.

方法与评估标准

The evaluation uses standard PDE benchmarks (Navier-Stokes, Burgers, Darcy) from PDEBench (Section 5.1, Page 7), which are appropriate for operator learning. But there remains rooms for improvement:

Computational costs (training time, memory) of BFNO are not compared to baselines.
The choice of SoftPlus as an invertible ReLU proxy (Section 5.1) is pragmatic but underdiscussed; no experiments validate whether this approximation introduces biases or artifacts.

理论论述

Theorem 4.1 (Page 6) asserts universal approximation for BFNOs with sigmoidal activations. The proof (Appendix B) adapts Cybenko’s approach but assumes $\sigma$ is a homeomorphism (e.g., sigmoid). However, the experiments use SoftPlus ( $\approx$ ReLU), which is not sigmoidal.

实验设计与分析

The experimental framework is broad and systematic, covering multiple PDE benchmarks (1D/2D Navier-Stokes, Burgers, Darcy) from PDEBench (Section 5.1, Page 7), which are widely accepted in operator learning literature. The comparison to strong baselines (FNO, ResFNO, F-FNO) and ablation studies (e.g., activation functions in Appendix D.4) demonstrates rigor.

Areas for Improvement

Layer-wise analysis: Figure 5 (Page 8) shows BFNO weights concentrate near zero, suggesting implicit regularization. However, no causal link is established between this distribution and improved generalization.

Activation choice: Table 5 (Appendix D.4) claims BFNO with ReLU matches SoftPlus performance, but this is only tested on 2D Navier-Stokes. ReLU violates Theorem 4.1’s assumptions, yet no theoretical justification is given for its empirical viability.

补充材料

I have reviewed the supplementary material typically for additional results.

与现有文献的关系

Numerical PDE by deep learning, such as Fourier Neural Operators and it's variations.

遗漏的重要参考文献

None.

其他优缺点

None.

其他意见或建议

The writing of the abstract and contribution section is quite opaque. It seems difficult for the general conference audience to catch up with the major contributions of this work. To improve readability, the authors should clarify whether the core contribution of this paper is a novel theoretical framework or the introduction of BFNO as a new state-of-the-art (SOTA) neural operator.

作者回复

2025-04-01

We sincerely thank the reviewer for his time and thoughtful comments and appreciate the global positive feedback. First, we would like to make it clear that the main objective of our contribution is to provide a novel theoretical framework that allows the development of founded and effective models, which we will clarify in the abstract and introduction.

Questions

Q1: Our theoretical framework formulates BFNO layers as solutions to Bregman regularized optimization problems. When all weights are zero, they naturally reduce exactly to the identity mapping (Remark 3.6), creating an implicit regularization effect distinct from standard residual connections.

We created a novel architecture, ResFNO, precisely to perform an ablation study with ResNet-style connection, confirming this distinction. While ResFNO still degrades with increased depth, BFNO maintains or improves performance. This difference manifests in Figure 5, where BFNO exhibits a distinctive Laplace-like weight distribution sharply peaked at zero, contrasting with the Gaussian distributions in other models.

From an ODE perspective (see Add. insights), the difference becomes clearer: BFNO applies its linear operator outside the activation function $\frac{dz(t)}{dt} = K(t)\sigma(z(t))$ , while residual networks place it inside $\frac{dv(t)}{dt} = \sigma(K(t)v(t))$ , creating different gradient flow dynamics.

Overall, BNFO's improvement can be analyzed as combination of these two elements: an implicit regularization and a different gradient flow. While a rigorous layer-wise analysis establishing a causal link between weight distribution and improved generalization would be valuable, such analysis remains an open challenge in the neural operator literature.

Q2: The primary goal of our experiments was to validate our theoretical framework and demonstrate that the properties derived from our Bregman proximal viewpoint translate into measurable performance advantages. While focusing on the theoretical foundations, we ensured rigorous experimental comparisons to establish practical relevance (even by considering WNO models in Appendix D.2).

To ensure a fair comparison, we conducted thorough validation through multiple runs with four fixed seeds and cross-validation of learning rates for all models. To assess statistical significance, we performed non-parametric one-tailed Wilcoxon signed-rank tests. Compared to FNO, BFNO is superior on 5 datasets of Figure 6 with a p-value of 0.0625, with p=0.125 for the last one 2D NS $(10^{-3})$ , which shows that improvements over FNO are consistent across diverse PDE types. Similarly, BFNO achieves the same level of superiority over ResFNO on 4 datasets (Burgers, Darcy, NS $10^{-4}$ and $10^{-8}$ ), and on 3 datasets over F-FNO (Burgers, NS $10^{-4}$ and $10^{-8}$ ).

Q3: From a theoretical perspective, expressing activation functions as proximity operators imposes constraints: Non-monotonic functions like GeLU are incompatible, as they cannot be represented as proxs (page 5). For Classical Neural Operators (Sec 3.2), activation functions must be monotonic, making ReLU suitable despite its non-smoothness, as it can be expressed as a proximity operator $prox_g$ of $g=\imath_{[0,+\infty[}$ . In contrast, Bregman Neural Operators (Sec 3.3) impose strict monotonicity, theoretically excluding ReLU.

However, our implementation provides significant practical flexibility. As noted in Remark 3.7, our architecture doesn't require computing inverse activations when composing layers since they cancel out with previous activations (enabled by adding an activation before the first operator layer). Therefore, any activation function can be used in practice.

Results below, completing Appendix D.4, confirm this flexibility, showing BFNO works effectively with both ReLU and GeLU, performing comparably to SoftPlus, which serves as a common smooth approximation to ReLU in the literature when differentiability is needed.

Architecture	4 layers	8 layers	16 layers
FNO (GeLu)	13.7 ± 0.1	12.9% ± 0.2	12.9 ± 0.1
BFNO (GeLu)	13.3 ± 0.1	12.4 ± 0.2	12.1 ± 0.1

As suggested by the results, extending our theoretical framework to non-invertible and non-monotonic activation functions is an interesting research direction.

Remarks

On computational costs: BFNO shares the same parameter count and memory footprint as FNO by design. Therefore, the training time per epoch is in the same order across all tasks. This will be clarified in the final version.

On universal approximation: We indeed acknowledge this limitation in our conclusion. The theorem serves as a first theoretical result for our framework and represents, to the best of our knowledge, one of the first universal approximation results for neural operators with Kovachki et al. (2023).

As for the other remarks, we tried to address them in the answers above while staying within the allowed limit.

审稿人评论

2025-04-07

Thanks for the detailed response from authors, which addresses part of my concerns. I would like maintain my score to positive.

审稿意见

评分: 42025-03-14

In this paper, the authors proposed a new type of neural operator for solving PDE problems. The idea is to set the neural operator to be the solution of a functional optimization problem, and in particular, they choose the operator to be a Bregman Proximal operator with respect to some Legendre functions. This formulation retrieves a lot of classical operators and extends to coined Bregman neural operators. The authors conducted experiments showcasing the power of the new operator formulation.

给作者的问题

Normally, in the context of optimization, the Bregman divergence needs to be a strongly convex function so that it represents a notion of distance. However, here it seems it only requires it to be strictly convex. I am wondering why strictly convex functions suffices?
I am wondering what is the intuition of choosing the neural operator to be a solution of an optimization problem? In the context of optimization, proximal operators are often used for non-smooth problem since it has an explicit smoothing phenomenon. Is this one of the considerations for choosing a Bregman proximal operator?

论据与证据

yes.

方法与评估标准

yes.

理论论述

I checked the proofs in the maim body.

实验设计与分析

The authors conducted fair experiments on the benchmark datasets and compared them with many prior neural operators.

补充材料

与现有文献的关系

This work extends the extensive research on designing neural operators and using neural nets to solve PDE equations. It formulated a new type of neural operator based on prior arts.

遗漏的重要参考文献

N/A

其他优缺点

I think this paper is clearly written with many useful intuition. Unfortunately I do not have an extensive background knowledge of using neural nets to solve PDEs and designing neural operators so I cannot provide a precise evaluation of the novelty and significance of this work compared to prior works. However, using Bregman divergence and Bregman operator seems to be an interesting idea, and the experiments showcased the superiority of this method.

其他意见或建议

N/A

作者回复

2025-04-01

We thank the reviewer for his positive feedback. We provide below our answers to the two questions mentioned in the review.

Normally, in the context of optimization, the Bregman divergence needs to be a strongly convex function so that it represents a notion of distance. However, here it seems it only requires it to be strictly convex. I am wondering why strictly convex functions suffices?

We thank the reviewer for the thoughtful remark. Indeed, strict convexity of the Legendre function $\psi$ is sufficient to define a Bregman divergence $D_\psi(x, y)$ , as it ensures that $D_\psi(x, y) \geq 0$ and that $D_\psi(x, y) = 0$ if and only if $x = y$ . These properties hold as long as $\psi$ is strictly convex and differentiable on the interior of its domain — which is the standard requirement in the general theory of Bregman divergences.

That being said, we would like to emphasize that in our setting — as shown in Table 1 — all Legendre functions considered are actually 1-strongly convex on their respective domains. This ensures that the associated Bregman divergences are lower bounded by a quadratic form, i.e.,

D_\psi(x, y) \geq \frac{1}{2} \|x - y\|^2,

which provides metric-like behavior. We have clarified this point in the revised version.

I am wondering what is the intuition of choosing the neural operator to be a solution of an optimization problem? In the context of optimization, proximal operators are often used for non-smooth problems since it has an explicit smoothing phenomenon. Is this one of the considerations for choosing a Bregman proximal operator?

We appreciate the reviewer's insightful question.

While it is true that proximal operators are classically used to handle non-smooth terms, especially in composite minimization settings, they are not restricted to non-smooth functions. In fact, proximal maps of smooth (even strongly convex) functions are well-defined and have been studied in the context of regularization and smoothing.

In our setting, the motivation for casting the neural operator as the solution to an optimization problem is not solely driven by the smoothing property of proximal operators-although this can be beneficial-but more importantly by their connection to activation functions.

It follows that, from our perspective, each layer performs a structured update—balancing proximity to the previous state with the minimization of an energy functional. This interpretation introduces a variational inductive bias into the architecture, which can encode prior knowledge and improve interpretability and performance. Using Bregman divergences further allows for this trade-off to be geometry-aware.

We hope this novel viewpoint will be leveraged by practitioners to encode structured prior knowledge into the architecture of neural operators through principled, optimization-based layers.

最终决定Accept (poster)

2025-05-01

This is a paper whose overall appreciation was already positive before the rebuttal process, and stayed so during it.

I read the paper and the rebuttals / comments. The paper’s approach of learning layers using Bregman divergences is sound, and I appreciated the good balance between theory and experiments. The loss in (6) makes sense. The authors could have made a better job of explaining a few things asked by reviewers: take for example the question on generalization (Cumi). Why is solving such a problem eventually good for generalizations? One likely possibility is the fact that the Bregman formulation allows to control the Lipschitz constant of the model, which, for deep models, can be a guarantee for training behavior => (translates to) testing behavior as well.

I would also recommend the authors have a look at “one level” higher. Deep architectures have links with exponential families. The Bregman formulation equivalently is a maximum likelihood formulation for a classical exponential family, fitting the parameters of a statistical model through (6). But one can also get “above”, using generalized / deformed exponential families, in which case I am confident the same max likelihood formulation should uncover some interesting dependencies with the activation function. And finally integrating non differentiable activations like ReLU is not a hurdle.

Overall, a very interesting paper, which can both stand as a nice mark on the statistical optimization of deep architectures, but also offers some serious potential for further investigations on the topic.

A clear accept.