6.3

/10

Poster3 位审稿人

最低6最高7标准差0.5

4.3

置信度

正确性3.0

贡献度2.7

表达2.7

NeurIPS 2024

Poseidon: Efficient Foundation Models for PDEs

Maximilian Herde,Bogdan Raonic,Tobias Rohner,Roger Käppeli,Roberto Molinaro,Emmanuel de Bezenac,Siddhartha Mishra

OpenReview PDF

提交: 2024-05-15更新: 2024-11-06

摘要

关键词

PDEsoperatorsfoundation modelstransformerssample efficiency

评审与讨论

审稿意见

评分: 7置信度: 52024-06-23

The paper introduces a new PDE foundation model, named Poseidon. The backbone of the model is a multiscale vision transformer. A data augmentation based on the semi-group property of time-dependent PDEs is also proposed to scale up the amount of training data. After pretraining, Poseidon has shown higher accuracy in a variety fluid dynamics benchmark problems than other task-specific neural operator, such as Fourier Neural Operator. The advantage of the proposed model architecture has been proven by comparing the accuracy with other baseline model of comparable size, such as the Multiple Physics Pretraining (MPP). In addition, the performance of Poseidon improves as the model size increases, verifying the scaling law in the literature of Large Language Models.

优点

The authors proposed Poseidon, a new PDE foundation model based on multi-scale vision transformer. It also incorporates the semi-group property of time-dependent PDEs to do data augmentation, scaling up the training data. Extensive experiments and evaluations are conducted across a suite of 15 challenging downstream tasks. The authors have shown that Poseidon outperforms existing baselines, both in terms of sample efficiency and accuracy. The use of a large-scale, diverse dataset for pretraining further underscores the robustness and reliability of the model. The paper has shown that the model architecture does matter by comparing its performance with other baseline models with comparable size. In addition, with the increase of the model size, Poseidon also shows better accuracy. The framework shows its potential as a general-purpose foundation model for PDEs, capable of generalizing to new and even unseen physics with minimal task-specific training.

缺点

The robustness of POSEIDON to noisy or incomplete training data is not thoroughly examined. This is important for real-world applications where data can often be imperfect. The paper also does not provide a comprehensive analysis of the computational cost and memory usage of POSEIDON compared to baseline models.

问题

How long does it take to pre-train the model? Is it efficient compared to existing PDE foundation models?

局限性

The authors did address the limitations.

作者回复

2024-08-06

We start by thanking the reviewer for your appreciation of the merits of our paper and your welcome suggestions to improve it. We address your detailed concerns below.

[W1:] The reviewer's suggestion on evaluating the robustness of Poseidon to noisy data is excellent. We follow it up by considering one of our downstream tasks CE-RPUI (SM B2.7 for detailed description) and at inference time, we add Gaussian noise to the inputs (initial conditions) at different noise-to-signal ratios (NSRs) of $0.1$ %, $1$ % and $3$ % respectively. The resulting errors, computed with respect to a Ground Truth where the outputs are not noisy, for varying numbers of training trajectories, are shown in Figure 3 of the 1-page rebuttal pdf. The errors in the zero noise (clean) case are also shown in this Figure. We observe from this figure that Poseidon-L's performance is robust to input noise and the error does not grow significantly even when the noise level is an appreciable $3$ %, demonstrating the robustness of this foundation model. We observe similar behavior with other Poseidon models and would include a discussion on this topic in the CRV, if accepted. We thank the reviewer for this suggestion that further highlights the robustness of our model.

[W2:] Regarding the reviewer's point about a comprehensive analysis regarding the computational costs of Poseidon, we would like to point out that the training costs have been provided in SM Table 10 (see SM sec E for further details). This table shows that the training, even for the biggest Poseidon-L model only took place on consumer grade GPUs with 24GB of VRAM. Similarly, the inference costs were provided in Table 11 where all the models barring Poseidon-L were inferred on a RTX-4090 GPU whereas Poseidon-L was inferred on a RTX-3090. In the meanwhile, we have also timed inference runs for Poseidon-L on a RTX-4090 too to find that the inference time is only 4 ms. Thus, all Poseidon models have inference times between 1.6-4 ms, which is comparable to the ML baselines and is between 3-5 orders of magnitude faster than physics-based PDE solvers. We will provide the new inference time of Poseidon-L and expand on the discussion in SM sec. E in a CRV, if accepted.

[Q1:] The pretraining times on consumer GPUs (RTX4090s) are provided in Table 10. Regarding other foundation models, there are 2 models with a similar framework as ours in the literature (MPP of Ref. [50] and DPOT of Ref. [20]). The pretraining costs of these models have not been transparently disclosed to the best of our knowledge. Instead, Ref. [20] and [50] say that their models have been trained on professional GPUs (H100 or V100) but do not mention how much computational time it actually took for the pretraining.

We sincerely hope to have addressed all your concerns and would kindly request the reviewer to update their assessment accordingly.

2024-08-11

Thanks for the reply. The model structure of Poseidon is Swin-Unet proposed in this paper (https://arxiv.org/abs/2105.05537), where the only change is replacing the skip connection with ConvNext (https://arxiv.org/abs/2201.03545). Is there any justification for this? It makes sense for Poseidon to achieve the good results, since both Swin transformer and ConvNext are both state-of-the-art vision models. Also, I believe the data augmentation part is the same as this technique introduced by this paper: https://proceedings.neurips.cc/paper_files/paper/2023/file/5c46ae130105fa012da0446126c01d1d-Paper-Conference.pdf. A reference would be helpful. Overall, the Poseidon is a comprehensive paper with solid results, but I think it can make more references of these papers. I would like to maintain my score.

评论- Reply to the reviewer's comment.

2024-08-11

We start by thanking the reviewer for your response and take this opportunity to comment on the points raised by the reviewer.

[1.] Scalable Operator Transformer (scOT) Fig. 2(a) is the backbone for Poseidon. As we have clearly stated in l117, its encoder-decoder structure is based on a SWIN U-Net architecture of Ref. [12] of our paper, which is precisely ArXiv:2105.05537 pointed out by the reviewer. The residual skip connections in [12] are replaced by ConvNeXt layers (l120 and Ref. [40] which is precisely ArXiv::2201.03545 pointed out by the reviewer). We have studied the role of ConvNeXt layers by replacing them with plain residual blocks inside scOT and comparing performance. This study is performed on 2 downstream tasks-- Poisson-Gauss (SM B2.14) and Helmholtz (SM B2.15), where we train the underlying scOT on 1024 samples to obtain the following test errors: Poisson Gauss: with ConvNeXt ( $0.013$ ) vs. with plain residual connections ( $0.017$ ) and Helmholtz: with ConvNeXt ( $0.068$ ) vs. with plain Residual block ( $0.095$ ). Thus, in both cases, there is an advantage of using the ConvNeXt layers. We will highlight this aspect in a CRV, if accepted.

[2.] That being said, we do not consider scOT as the main contribution of our paper. As a standalone neural operator, scOT is no better than CNO as we clearly show in SM Table 9 where their median EG and mean AG scores are very comparable. Thus, adapting state of the art vision models to the neural operator setting is not sufficient to obtain better performance. Rather, it is the whole framework of foundation models based on pretraining with a very diverse dataset that underpins the superior performance of Poseidon. It is precisely when trained with the right amount and type of data that a state of the art vision transformer is able to perform significantly better than a U-Net type model, when fine-tuned on downstream tasks (compare Poseidon vs. CNO-FM in Table 1 and SM Table 9). This is a key point of our paper and we are happy to highlight it more explicitly in a CRV, if accepted.

[3.] We thank the reviewer for pointing out Mialon et. al. As its title suggests, Mialon et. al. focusses on leveraging symmetries in PDEs to augment training data. In our understanding, the basic premise of this paper is: given $u(x,t)$ solves a PDE, then $v(x,t) = L_1u(L_2x,L_3t)$ solves the same PDE with different inputs, with $L_{1,2,3}$ being the generators of Lie groups corresponding to the symmetries of the underlying PDE. Hence, $v(x,t)$ can be considered as a data point with the changed inputs, thus, augmenting training data. This paper is an interesting approach to increasing the amount of training data for operator learning and we will cite it in a CRV, if accepted.

However, it is not related to the setting we consider in our paper. While Mialon et. al do consider time shifts as one of their symmetries (Eqn 4), which implies that if $u(x,t)$ solves a time-dependent PDE with initial data $u_0 (x)$ , then $u(x,t+\epsilon)$ solves the same PDE but with a different initial condition. On the other hand, our setting is summarized in l86-87 where the task is to learn the solution operator of the PDE, given initial conditions drawn from a given measure. Clearly, shifting the initial conditions will lead to a change in the underlying distribution. Hence, we are not sure how we can leverage time shifts to augment data in our setting.

On the other hand, our all2all training procedure (Fig. 2(d)) operates in a very different manner. It utilizes the existing trajectory data and leverages the semi-group property of the solution operator to group this data better in order to obtain a quadratic complexity in the number of training samples (please see lines l130-138 on how this is done). We do not use any time-shift symmetries to generate new trajectories at all as Mialon et. al. does. Thus, our approach is very different from Mialon et. al.

Finally, we would like to point out that using symmetries to augment data can be very tricky for nonlinear PDEs such as the compressible Euler and Incompressible Navier-Stokes Equations as any kind of singularities such as shock waves and turbulent mixing will break these symmetries (see for instance U. Frisch's text book on Turbulence (Cambridge U. Press, Chapter 2). Thus, it is unclear to us how these symmetries can be leveraged to train foundation models like Poseidon, which are meant to be general purpose and deal with such singularities.

We sincerely hope that we have addressed the reviewer's remaining concerns to your satisfaction.

审稿意见

评分: 6置信度: 42024-07-02

The paper introduces a foundation model for learning PDE solution operators, with a proposed architecture, training method and training dataset consisting of fluid dynamics PDEs. The foundation model was evaluated on various downstream tasks and was shown to outperform baselines in terms of sample efficiency and accuracy.

优点

The paper addresses an important problem of developing practical foundation models for dynamical systems.
The paper also highlighted results of several experiments that demonstrate Poseidon's strong empirical performance compared to baselines in the 15 downstream tasks.
The pre-trained models, when released, could serve as a good base for future work to be built upon.

缺点

The paper could improve in clarity. For example, the approach section (2) need to be made much clearer. Currently, many terms, operations and acronyms are left undefined, with too many references to the appendix for critical information, making it very hard to parse and follow. This is especially so for the model architecture section. It will be very useful to provide an overview of the key architecture components, and how intuitively these differ from standard model architectures. Some of the other sections, e.g. pretraining/finetuning, can likely be significantly simplified.
It would help a lot if the main paper contains some summary of what the training dataset and downstream task PDEs roughly corresponds to -- now they are just acronyms. Some of the tasks seem quite highly correlated to one another, e.g. solutions of the same PDE but with different initial condition.
The problem formulation seem to not explicitly consider the learning of solution operators that take into account different PDE parameters, but only different initial conditions
The Poiseidon models have a significantly higher inference time compared to benchmarks. This is an important weakness that should be highlighted in the main paper rather than placed in just the appendix, especially for foundation models.

问题

A fair comparison with FNO would involve fine-tuning an FNO that has already been trained with other PDEs (i.e. using meta-learning methods/transfer learning). What would be the sample efficiency for that, and evaluation comparison compared to Poiseidon? Especially for situations where the FNO has already been trained on a given PDE, and is fine-tuned for another initial condition.
The authors chose to use L1 error for the underlying final time, but in many settings the intermediate dynamics may be more important and challenging to model compared to the final state where T>>0. Please share empirical results where the evaluation is based on different time indices, and/or some averaged quantity. Ideally, some there should be some charts/visualization confirming that Poseidon's better performance is robust to the chosen time index.
Table 1 shows results comparing Poseidon-L against benchmarks and FNO. From the main paper, it is not clear if the comparisons are fair, as the benchmark model parameters sizes are not indicated. Please provide indication of the relative sizes of the various models, and whether it is fairer to use Poseidon-L or a smaller version as comparison. Key results from the appendix should be shifted to the main paper if they are needed to support a major claim.
For the SE-AF task, a model trained from scratch could even perform better than the largest Poseidon model. This is significant, as the CNO is much smaller and did not benefit from any amortized training at all. It would be useful for the authors to provide some analysis on why this is the case, and whether it reflects a significant weakness of the Poseidon model.
It is unclear what the key differences and contributions in terms of architecture innovation and training method are, compared to benchmarks. It would help if the main paper had a more explicit comparison of what are the new key components that this paper is contributing to the literature, as well as ablation results on how these components impact the final performance.
For the key results (e.g. Table 1), it would be useful to provide some clear indication of the statistical significance of the results explicitly (e.g. error bars)
Does Poseidon learn the PDE solution operator where PDE parameters (e.g. viscosity of the fluid) can vary? Some empirical evaluation of this would be useful.

局限性

In the main paper, it would be useful to more clearly highlight some of the limitations of the model such as its longer inference time, as well as the paper's limited scope to mainly PDEs governing fluid dynamics.

作者回复

2024-08-06

We start by thanking the reviewer for your appreciation of the merits of our paper and your welcome suggestions to improve it. We address your detailed concerns below.

[W1/W2:] The reviewer's concern about clarity are well-taken as are your suggestions to improve it. Given the page limit, we had to make some choices regarding what to present in the main text and what to leave to the SM. We can certainly move more material from SM about methods to main text in a CRV as we can add an extra page there. A short description of the downstream tasks is in lines l191-206, with a detailed description in SM B.2 and a summary in Table 4. We will expand the description in the main text and move Table 4 there to add clarity. We would also like to point out that only 6 of the 15 tasks involve PDEs seen during pretraining and the majority of tasks (9/15) consider new physics in the form of either adding new terms to PDEs or new PDEs altogether.

[W3:] Regarding the reviewer's point about the tasks not considering operators with different PDE parameters, we have clearly explained how PDE parameters (coefficients/forcings etc) are included in our problem formulation (see l79-80 Main Text, l1274-1277 on forcing in Navier-Stokes, l1334-1336 on gravity in Euler and l1363-1367 on coefficients/material properties in the Wave Eqn). In addition, there are 3 steady state PDE tasks (steady Euler (SE-AF), Poisson, Helmholtz) where the operators again map PDE parameters (coefficients, sources) to the solution, see e.g. Eqn (69) and l1463-1470 for Helmholtz. Thus, almost half (7/15) of the downstream tasks actually involve operators which map PDE parameters, not just initial conditions.

[W4:] The inference times for all models (reported in SM Table 11) only show Poseidon-L (with 630M params) as having a larger inference time compared to baselines. This was because it was inferred on a RTX3090 GPU whereas other models were evaluated on RTX4090s as is clearly mentioned in the caption of Table 11. We reran inference for the setup in Table 11, on a RTX4090 to find that Poseidon-L has an inference time of only 4 ms, which is much more comparable to baselines. We apologize for the possible confusion and reiterate that inference time of Poseidon-L is only a factor of 2 over FNO whereas Poseidon-T (with 21M params) is actually faster than FNO at inference (Table 11), while at the same time being much more accurate and sample efficient (see SM Tables 8 and 9). Moreover, all ML models are 3-5 orders of magnitude faster than the physics-based PDE solvers in terms of inference. Thus, we show that inference time is not a limitation for Poseidon models.

[Q1:] We already have an answer in our paper to the reviewer's question by replacing FNO with CNO in your suggested analysis. As CNO (with same number of parameters) is shown to be outperforming FNO in 14 out of the 15 tasks (Table 1), it clearly constitutes a stronger baseline on this task set for your question. So, we also built a foundation model with a CNO backbone (CNO-FM) and pretrained it on exactly the same data as Poseidon, Yet, we find that (Tables 1,8 and 9) CNO-FM is significantly inferior in performance to the Poseidon models. We have highlighted this point in l261-267 of the main text. Hence, Poseidon scores over CNO in your suggested comparison.

[Q2:] The reviewer has a valid point about comparing model performances at different time indices. On all our 12 time-dependent tasks (3 are steady-state), we have found that the maximum error (for all models) occurs at the final time. Hence, we used the final time error for comparison. Nevertheless, we follow your suggestion and plot errors at different times for Poseidon-B and FNO (see Fig 1 in the 1-page.pdf and pt 1 in the reply to all reviewers) for 2 tasks to illustrate that Poseidon's gain in performance is consistent over time. We will include figures like this for all 12 tasks and all models in the CRV, if accepted.

[Q3:] The sizes for all models are shown in Table 5. Poseidon-L is an order of magnitude larger than FNO/CNO and Poseidon-B is comparable in size to other foundation models (CNO-FM,MPP) while Poseidon-T is actually smaller in size than FNO/CNO. Yet, from Table 8 and 9, we see that it still significantly outperforms FNO/CNO in terms of accuracy/sample efficiency. This issue of relative size has been discussed in l278-286 of main text and we are happy to highlight it further in a CRV.

[Q4:] SE-AF (SM B 2.13) is a very challenging downstream task for Poseidon as it has to generalize on multiple fronts (see l1436-1440), when compared to pretraining, namely i) to steady states ii) to an operator mapping Domain shape to solution field iii) to irregular grids and non-periodic BCs. In spite of this challenge, Poseidon models did very well (see SM Figure 15) and scaled better to catch up with CNO with more training samples. Hence, we do not consider performance on this task as a limitation but rather as highlighting their potential for generalization.

[Q5:] We can add further elements from SM D.4 and D.5 to the discussion in l261-312 on factors underpinning model performance.

[Q6:] Full Error distributions (pdfs) for Poseidon-B have been provided in SM Fig. 49 and we plan to add figures like Fig. 2 (left) (1-page pdf) to compare error distributions of all models in the CRV.

[Q7:] Please see pt [W3] regarding PDE parameters. Following the reviewer's excellent suggestion, we have considered an additional task where the viscosity of the NS Eqns is changed (see pt [3] in the reply to all reviewers) and Fig 2 (right) (1-page pdf) which shows that Poseidon is able to readily learn this solution operator with a new viscosity. We will add this test to the CRV

We sincerely hope to have addressed all your concerns, particularly about possible limitations for dealing with PDE parameters and inference times of Poseidon, and would kindly request the reviewer to update their assessment accordingly.

2024-08-08

Thanks for the response. Please view below for follow-up comments/questions:

Regarding my point on taking into account PDE parameters, I was referring to situations where the PDE parameters changes, like the new change of viscosity experiment that you had just provided. The quoted lines in your [W3] responses did not help clarify the foundation model's capabilities in this regard -- it may be useful to adjust the paper accordingly.

On computation time, rather than inference time, what are the computational resources required to fine-tune each of the models and baselines? Are there comparable, or would the proposed models take a longer time to fine-tune?

Regarding comparisons to FNO, I was referring to the simpler situation where we are just using an existing FNO that is meta-learned on a class of similar equations, and also just an existing FNO that might have been trained for another equation. For example, given an FNO that has been trained on one (or multiple via meta-learning methods) of the NS datasets, when fine-tuned to perform on another NS dataset (e.g. NS-SL), how does it compare with Poseidon models? Similarly for e.g. the wave datasets (for e.g. trained on Wave-Gauss, fine-tuned on wave-Layer).

Regarding SE-AF, the other reviewer had raised a related concern on irregular grids/complex geometries, as you had pointed out. The results (a clearer comparison with benchmarks, especially comparisons with similar model sizes would be useful) seem to indicate limitations of Poseidon models to address these more realistic conditions. Some additional validation along this area would help support the claims made around Poseidon's generalizability.

评论- Reply to the Reviewer Part 1 (Answer to Q1)

2024-08-11

We start by thanking the reviewer for your prompt response which provides us the opportunity to clarify your remaining concerns. We request the reviewer's patience in reading our detailed reply below.

[Q1:] We think that the loosely defined term PDE parameters might be the source of a possible misunderstanding regarding the scope of our proposed foundation models. Let us start with the Navier-Stokes (momentum Eqns) $u_t + (u.\nabla) u + \nabla p = \nu \Delta u$ The reviewer correctly observes that changing the viscosity coefficient $\nu$ above is a clear example of changing PDE parameters in the underlying solution operator. In our rebuttal, we have provided the corresponding experiment to show that Poseidon works very well when $\nu$ is changed.

With this perspective, lets revisit the Wave Equation that we considered in our original paper -- it is given by Eqn (64) (l1343) and reads as $u_{tt} = (c(x))^2\Delta u$ with $c(x)$ being a spatially varying coefficient that models wave speeds in a heterogeneous medium. By the same argument as in the Navier-Stokes case, changing $c$ would amount to varying the PDE parameter in this context ? This is exactly what we do in both the Wave-Gauss (SM B 2.10) and Wave-Layer (SM B 2.11) datasets. The precise distributions from which the coefficient $c$ is drawn are given in l1353-1360 for the Wave-Gauss and l1376-1384 for the Wave-Layer experiment. Visualizations of a particular realization (sample) of this coefficient are provided in Fig 65 (a) for Wave-Gauss and Fig 66 (a) for Wave-Layer. As we clearly state in l1366, the underlying solution operator maps $(u(0),c)$ to $u(t)$ for any time $t$ . Thus, both these experiments are unambiguous examples of varying PDE parameters as accepted by the reviewer.

Next, lets revisit the GCE-RT experiment (SM B 2.9) where the underlying PDE is the Euler equation with Gravity (Eqn (57), l1317), whose momentum Eqn in the $x$ -direction reads as $\partial_t (\rho v_x) + \partial_x (\rho v_x^2 + p) + \partial_y (\rho v_x v_y) = - \rho \frac{\partial \phi}{\partial x}$ (analogously for the $y$ -momentum and energy). Here, $\phi$ is the gravitational potential which is a spatially varying radial function given in Eqn. (58) l1323. Again by the same logic as in the NS Eqns, varying $\phi$ would amount to changing the PDE parameter. This is precisely what we do in this experiment with exact distribution from which the gravitational potential is drawn given in Eqn. (63), where the $\rho_0,p_0$ needs to be substituted in Eqn (58). Visualization of a sample of this gravitational potential is given in Fig. 64 (a) (right-most). As clearly stated in l1335, the underlying solution operator maps $(\rho(0), v_{x,y}(0),p(0),\phi)$ to $(\rho(t),v_{x,y}(t),p(t))$ for any $t$ , implying that this experiment is also an unambiguous example of varying PDE parameters.

Next, we revisit the Helmholtz experiment (SM B 2.15) where the underlying PDE is the Helmholtz Eqn (Eq (69), l1458) which reads as $-\Delta u + \omega^2 a(x)^2 u= 0$ In addition to a fixed frequency $\omega$ , we consider a spatially-varying coefficient $a$ which models the material properties of the underlying heterogeneous medium. Again, changing $a$ would amount to varying the PDE parameter and this is exactly what we do in this dataset. The exact distribution from which the coefficient $a$ is drawn is given in l1463-1470 and a visualization of a sample of $a$ is provided in Fig. 70 (a). As explicitly stated in l1470, the underlying solution operator maps $a$ to the solution $u$ , making this experiment another unambiguous example of varying PDE parameters.

Now in our understanding, changing the forcing term as we do in the Poisson-Gauss dataset (SM B 2.14) l1443-1447 and changing the domain shape as we do in the SE-AF dataset (SM B 2.13) l1421-1424 also amounts to changing the PDE parameters. However, even with the narrow interpretation of PDE parameters as per the reviewer, we have provided 4 datasets and will provide 1 more dataset in a CRV, if accepted (thanks to the suggestions of the reviewer), where the PDE parameters are changed. Hence, we believe that providing 5 downstream tasks with changing PDE parameters and showing that the Poseidon models are able to readily learn the underlying solution operator is a sufficient demonstration of their ability to handle varying PDE parameters, especially in view of the fact that the pretraining data did not contain any dataset with varying PDE parameters. We hope that the reviewer also agrees with our detailed argument in this regard. We will state the fact that we consider varying PDE parameters in our downstream tasks more explicitly in the main text of a CRV, if accepted.

(Reply Contd. in the next Comment Field)

评论- Reply to the Reviewer Part 2 (Answers to Q2, Q3 and Q4)

2024-08-11

[Q2:] In response to the reviewer's question about the computational resources required for fine-tuning, we have considered the NS-SL dataset (SM B 2.3 for description) and used FNO, trained with 1024 trajectories, as the baseline. For a fair comparison, we have to factor in that the Poseidon models need much less trajectories for fine-tuning than FNO requires to reach the same accuracy. From Figure 10 , Table 1 and Table 8, we observe that Poseidon-T model requires 106 fine-tuning trajectories, Poseidon-B 54 trajectories and Poseidon-L 47 trajectories to reach the same error level as FNO with 1024 trajectories. For convenience in testing, we evaluated fine-tuning times by rounding up all the Poseidon models to the nearest power of $2$ trajectories, leading to timing fine-tuning for Poseidon-T (128 trajectories) and Poseidon-B and Poseidon-L (64 trajectories). With the caveats that i) runtimes are not performed in a controlled environment and ii) none of the models have been optimized for speed, the resulting approximate fine-tuning times on a RTX4090 GPU are: FNO [16.6 hrs], Poseidon-T [1.1 hrs], Poseidon-B [1.2 hrs] and Poseidon-L [1.4 hrs]. Even with these caveats, the bottomline is very clear that Poseidon models (even if they are larger in size) require much less fine-tuning compute than a trained from scratch neural operator as they are much more sample efficient. We will add a discussion in the SM of a CRV, if accepted, on the fine-tuning times with Poseidon and baselines.

[Q3:] We thank the reviewer for clarifying their intention about how to test FNO, learnt on a similar PDE to a slightly different task. We followed your suggestion to test the following i) We pretrained FNO on the NS-PwC dataset (SM B 2.1) with 4096 trajectories and then finetuned this FNO model on the NS-SL dataset (SM B 2.3). Both tasks involve the Navier-Stokes Eqns., but with different initial conditions. The resulting finetuned FNO performed significantly worse than the baseline FNO trained from scratch in this case. For instance, the EG and AG scores of the fine-tuned FNO (defined in Eqn. 11, l 229) were EG= $0.2$ and AG= $0.7$ , showing that the fine-tuned FNO did not perform as well as the baseline, in contrast Poseidon-L had scores of EG= $21.9$ and AG= $5.5$ (Table 1) and ii) we also followed your second suggestion by pretraining FNO on Wave-Gauss (SM B 2.10) with 4096 trajectories and fine-tuning this pretrained model on Wave-Layer (SM B 2.11). Both tasks involve the Wave Eqn. but with different coefficients (PDE parameters as described in answer to [Q1]). The resulting fine-tuned FNO was slightly better than the baseline FNO (trained from scratch) with EG= $1.4$ and AG= $2.0$ . In contrast, Poseidon-L's scores are EG= $46.5$ and AG= $6.1$ (Table 1). Thus, both experiments showcase the power of Poseidon models, particularly the example of Wave-Layer as neither the Wave eqn nor varying PDE parameters have been encountered during their pretraining. We thank the reviewer again for pointing out a possible avenue for further strengthening the rationale for Poseidon and we will include these tests in a CRV, if accepted.

[Q4:] Regarding SE-AF, as we explicitly state in l361-364 of our main text, our focus in this paper was on Cartesian geometries with SE-AF providing a first test of the potential of Poseidon to handle irregular geometries. We maintain that Poseidon models show promise on such tasks as even Poseidon-T (with 21M parameters) was comparable in performance to the best-performing CNO model (with 40M parameters) on the SE-AF task, with very similar errors for 512 and 1024 fine-tuning samples. Regarding our focus on regular grids, we follow the extensive literature on neural operators where first papers with new models such as FNO (Ref. [34]), CNO [62] and DeepONet [43] focus mostly on PDEs on regular grids. Similarly, recently proposed Foundation models for PDEs [20,50,68,74] only consider PDEs on Cartesian domains. In contrast to these papers, we have significantly larger set of tasks (15) (In contrast to FNO (2) or CNO (8)). Moreover, our tasks involve 9 different PDEs of all types (linear, nonlinear, steady, time-dependent, elliptic, parabolic, hyperbolic, mixed) with underlying operators stemming from variations in initial data, PDE parameters, boundary conditions etc constituting the most diverse task set for PDE operator learning, to the best of our knowledge. However, as we have stated explicitly in l362-364, we aim to add more tasks with PDEs on irregular grids in future work.

With this detailed reply, we sincerely hope to have addressed your remaining concerns and kindly request the reviewer to update their assessment accordingly.

2024-08-11

Thanks for the additional details and results. I believe that the various additional results, clarifications and proposed paper modifications mentioned by the authors over this rebuttal period will strengthen the paper fairly significantly.

In light of this, I have raised my score.

评论- Thanking the Reviewer

2024-08-11

We sincerely thank the reviewer for appreciating the merits of our paper, our clarifications during the discussion phase as well as your suggestions to improve our paper and for raising your score.

审稿意见

评分: 6置信度: 42024-07-06

This paper proposes a PDE foundation model based on a multiscale Swin Transfromer backbone and a flexible pretraining strategy.

优点

The paper is well-organized and clearly written.
The experimental results are comprehensive and solid, which is a valuable contribution to the research.
The studied topic is very interesting.

缺点

Recently, several other PDE foundation models have been proposed. Could you compare your model with [1] to further demonstrate the capability of the scOT backbone?
I am interested in understanding the purpose of adding the ConNeXt block. I think it seems to be somewhat over-designed. Could you conduct some ablation studies to justify its inclusion?
The generalization ability to downstream tasks is crucial for foundation models. Aside from relying on dataset diversity, where does Poseidon's generalization ability come from? Do the scOT backbone and all2all pretraining method offer inherent advantages for enhancing generalization?
PDEs are often solved on discretized meshes, which may be highly irregular [2]. How can Poseidon be extended to handle irregular mesh data?

[1] DPOT: Auto-Regressive Denoising Operator Transformer for Large-Scale PDE Pre-Training

[2] Geometry-Informed Neural Operator for Large-Scale 3D PDEs

问题

See Weaknesses

局限性

See Weaknesses

作者回复

2024-08-06

We start by thanking the reviewer for your appreciation of the merits of our paper and your welcome suggestions to improve it. We address your detailed concerns below.

[W1:] The reviewer's suggestion to compare with DPOT is very well-taken. As we had clearly stated in our paper (see line l347-348), the DPOT model was not publicly available when we submitted our paper, making it impossible to compare with it then. In the meanwhile, DPOT has been released publicly and we have followed the reviewer's suggestion to compare with it. However, we could like to state that i) DPOT takes a sequence of time steps and outputs the next step. Our objective (l86-87) is to learn the entire trajectory of the operator from only the initial data. Hence, DPOT needs to be extended to our setup -- we do so by following exactly the same modifications that we introduced for extending the MPP foundation model in our paper (see SM C.6 and Figure 6) and ii) DPOT only allows an input of 4 channels, which precludes us from testing on the GCE-RT downstream task (SM B2.9) with 5 input channels. Also, given the tight timeframe of the rebuttal, we could only compare with the DPOT-M model (with ca. 120 M parameters), which is comparable in size with our Poseidon-B model. As fine-tuning DPOT takes considerable compute on our GPUs, we had time to fine-tune it for 3 representative tasks, NS-SL (see SM B2.3), CE-RPUI (SM B2.7) and Wave-Layer (SM B2.11). The corresponding EG and AG scores (see Eqn. 11 for definition) for DPOT on these tasks are: NS-SL (EG $=3.9$ , AG $=2.1$ ), CE-RPUI (EG $=40.8$ , AG $=3.2$ ) and Wave-Layer (EG $=21.8$ , AG $=4.5$ ). For your convenience, we reproduce the corresponding scores for the comparable (in size) Poseidon-B model from SM Table 8 as NS-SL (EG $=19.1$ , AG $=4.7$ ), CE-RPUI (EG $=370.8$ , AG $=6.2$ ) and Wave-Layer (EG $=24.9$ , AG $=4.7$ ). Also, we can compare the scores of other models from Table 1 to conclude that DPOT-M 's performance lies between CNO-FM and Poseidon-B for these tasks. It is a strong baseline but preliminary results suggest that it is not as accurate or efficient as Poseidon-B and Poseidon-L as seen here. We plan to include the results for this extended DPOT in the CRV, if accepted and thank the reviewer for suggesting its inclusion.

[W2:] Regarding the reviewer's question about ConvNeXt layers, we would like to point out that some form of residual connections between the encoder and decoder are necessary for stable training of scOT. To study the role of ConvNeXt in this context, we follow your suggestion and ablate scOT by replacing it therein with plain residual connections. This study is performed on 2 downstream tasks-- Poisson-Gauss (SM B2.14) and Helmholtz (SM B2.15), where we train the underlying scOT on 1024 samples to obtain the following test errors: Poisson Gauss: with ConvNeXt ( $0.013$ ) vs. with plain Residual block ( $0.017$ ) and Helmholtz: with ConvNeXt ( $0.068$ ) vs. with plain Residual block ( $0.095$ ). Thus, in both cases, there is an advantage of using the ConvNeXt layers. We can check this fact for other tasks as well and report the results in a CRV, while making a comment on the utility of ConvNeXt as the reviewer has righty suggested.

[W3:] The reviewer's suggestion about further highlighting the generalization ability of Poseidon is excellent. In the paper, we have already presented several factors affecting generalization. To recall, i) Model architecture does matter (see l261-277, Table 1 and SM Table 11) which clearly show that the scOT backbone generalizes much better than a CNO backbone, even when both models are trained on exactly the same pretraining data. ii) Data size is also key (see l297-303 and SM Figures 24-38) as we show that greater size of pretraining dataset enables more accuracy even on downstream tasks. It is in this context that all2all training plays a role as it increases training dataset size, leading to more accuracy (see SM Figure 43) iii) Diversity of pretraining data is absolutely crucial (l304-312) for generalization as you correctly point out and iv) the choice of the pretraining dataset which implicitly contains a rich variety of physics that is learnt by the foundation model, enabling it to generalize better. We have highlighted this point in SM D.4 with 3 case studies. In particular, in D.4.2 (with Allen-Cahn reaction-diffusion PDE) and in D.4.3 (with elliptic Poisson PDE), we have shown that the latent space of Poseidon is rich enough to learn unseen physics (reaction-diffusion, steady state diffusion etc) by fine-tuning with very few samples. These different factors affecting generalization will be further discussed and highlighted in the main text of a CRV, if accepted.

[W4:] In response to the reviewer's question about Poseidon's ability to handle data on irregular grids coming from PDEs with complex geometries, we would like to point out the SE-AF downstream task (SM B2.13), where the underlying data is on an irregular grid (Figure 3), is precisely considered for this very purpose. We follow the protocol of Ref. [62] to process the data in order to feed it into Poseidon (and all the baselines) and find that the results, presented in SM Figure 15 , with Poseidon models is very good, even though this test case requires Poseidon on generalize on several fronts not encountered in the pre-training dataset, namely i) to steady states ii) to an operator mapping domain shape to density field and iii) to irregular grids and non-periodic boundary conditions. Needless to say, SE-AF is only 1 task with irregular grids and further evaluation of Poseidon on other such tasks, like some mentioned in the GINO paper could be interesting. However, the experience with SE-AF bodes well for Poseidon in this context.

We sincerely hope to have addressed all your concerns and would kindly request the reviewer to update their assessment accordingly.

2024-08-10

Thank you for your response. I appreciate the solid experiments presented in this paper. Thus, I will raise my score to 6.

评论- Thanking the Reviewer

2024-08-11

We sincerely thank the reviewer for appreciating our paper and the rebuttal and for raising their score.

作者回复

2024-08-03

At the outset, we would like to thank all three reviewers for their thorough and patient reading of our article. Their criticism and constructive suggestions will enable us to improve the quality of our article. If our paper is accepted, we will incorporate all the changes that we outline below in a camera-ready version (CRV) of our article. As allowed by the conference, we are uploading a one page pdf that contains figures with numerical results which support our arguments. These figures are described below. With this context, We also proceed to answer the points raised by each of the reviewers individually in their respective rebuttal fields.

Yours Sincerely,

Authors of POSEIDON: Efficient Foundation Models for PDEs.

Detailed Description of the 1-page Rebuttal pdf

The 1-page pdf contains 5 figures arranged in 3 rows with the following content,

[1] Row 1: Figure 1: This figure shows how the test error grows with time when our foundation model (Poseidon-B) and the FNO baseline are evaluated on two downstream tasks: NS-PwC (Left) and NS-SL (Right) (See SM B.2.1 and B.2.3 for detailed description of the tasks). We observe from both figures that the gains in accuracy with Poseidon-B over FNO are consistently observed over time and are substantial for all the time indices that we consider. We plan to add similar figures for all time-dependent tasks and with all the baselines in a CRV, if accepted.

[2] Row 2: Figure 2 (Left): shows empirical histograms representing the entire test error distributions for Poseidon-B and FNO for the NS-SL downstream task (Detailed description in SM B.2.3) when 128 trajectories are used to train FNO (from scratch) and fine-tune Poseidon-B. We plan to add similar figures for all tasks and all baselines in a CRV, if accepted.

[3] Figure 2 (Right) shows how test errors scale for Poseidon-B and FNO for a new Downstream Task, suggested by Reviewer KcRd. In this task, we consider the Navier-Stokes Equations (SM Eqn. 31 of the submitted paper) with a viscosity coefficient $\nu=4\times 10^{-3}$ . The ground truth data is generated using the Azeban spectral hyper viscosity solver (Ref. [64]). This new viscosity coefficient is very different from the setup of the pretraining data and downstream tasks in our original paper as there, only a hyperviscosity of $4 \times 10^{-4}$ was applied to high-enough Fourier modes in order to model the incompressible Euler equations with zero viscosity (see SM lines l1082-l1086). In this task, the initial conditions are identical to the NS-PwC downstream task (SM B2.1 for details). We see from Figure 2 (right) that Poseidon-B generalizes very well to this new viscosity coefficient, that was not seen during pretraining, and outperforms FNO readily, in terms of both sample efficiency and accuracy. In particular, the AG and EG scores of Poseidon-B (defined in Eqn. 11) are $EG=925.5$ and $AG=47.5$ , which are completely comparable to (even better than) the scores of $EG=1024$ and $AG=19.7$ (see SM Table 8) for the NS-PwC task with much lower hyperviscosity that was reported in the main paper. We will add the results of other baselines to this figure in a CRV, if accepted.

[4] Row 3: Figure 3 presents how the test error scales for the Poseidon-L model when noise in injected into the input at the level of inference. To study this question, we consider the CE-RPUI task (SM B.2.7 for details) and add Gaussian noise to the inputs (initial conditions) at different noise-to-signal ratios (NSRs) of $0.1$ %, $1$ % and $3$ % respectively. The errors in the zero noise (clean) case are also shown in this Figure. The errors are computed with respect to a Ground Truth where the outputs are not noisy. We observe from this figure that Poseidon-L's performance is robust to input noise and the error does not grow significantly even when the noise level is an appreciable $3$ %, demonstrating the robustness of the foundation model.

评论- Please be active in rebuttal discussion

2024-08-10

Dear Reviewers,

This is your AC. The authors have provided a response to the comments. Please respond to the rebuttal actively.

Best, AC

最终决定Accept (poster)

2024-09-25

This paper presents a PDE foundation model based on a multiscale swin transformer architecture. It presents a training method and a training dataset consisting of fluid dynamics PDEs. The evaluation on various types of PDEs demonstrate the effectiveness of the foundation model. Though the reviews are generally consistently positive and I recommend to accept, various changes should be included in the final version, as promised during the rebuttal. In particular, the comparison with existing foundation models (e.g., DPOT) is too brief in the current Intro and experiments to fairly show the existing progress. A new paragraph should be included in the Intro to compare with the existing foundation models to provide a complete literature review.