PaperHub
5.2
/10
Rejected5 位审稿人
最低3最高6标准差1.2
6
3
6
5
6
3.8
置信度
TL;DR

We develop approaches to enable autoregressive pretraining on multiple physical systems and show it can improve transfer performance across wide domain gaps.

摘要

关键词
transfer learningphysicspretrainingfinetuningsurrogate modelsspatiotemporal

评审与讨论

审稿意见
6

This paper introduces a pretraining strategy for autoregressive modeling of physical systems. It proposes a model architecture and training regime to handle the heterogeneity of training data. Experiments on the PDEBench dataset demonstrate the ability of the method to model different systems with state of the art accuracy. The authors further investigate the added value of the pretrained model in low-data settings and parameter-estimation tasks.

优点

Strength 1: The authors experimentally validate the hypothesis that a single model can learn on diverse fluid mechanics systems, which is a key step for the development of general “foundation” models for physics tasks. This line of experiments is well explained and rigorously reported. The use of the PDE-Bench dataset tasks and baseline models will ease all future work of the community to build and improve on the proposed approach.

Strength 2: The architecture is novel and constitutes a first usage of transformer-based architectures for autoregressive modeling of physics systems. Given the exceptional scaling abilities of these models and their results in other scientific domains (chemical physics, biology), it is a promising direction for autoregressive tasks. Reporting the performance of the proposed architecture on PDEBench tasks, without multi-physics training, would already be valuable to the community.

Strength 3 : The paper is easy to read and the questions it aims at answering are clearly formulated. Design choices and training regime are well motivated by the specificities of multi-systems learning, and clearly documented. This will ease the ramp-up of the community on pre-training tasks, and improve on the proposed architectures by proposing alternate design choices.

Strength 4 : Figures and tables are clear, metrics and units are clearly reported and consistent with previous publications when taken from there

缺点

Weakness 1: The authors frame their question on the low-data regime (Section 5.2) as “Does MPP provide a fine-tuning advantage over existing spatiotemporal foundation models for new autoregressive prediction tasks?”. Their experimental results indeed support the claim that MMP is superior. However I believe that in order to develop useful foundation models, such models should be compared to the “best one can do” task-specific model in the low-data settings. Monitoring the progress on tasks that are too hard for task-specific baselines (like those implemented in the PDEBench paper) will provide rich insights to the community. The first line of experiments (Section 5.1) does not enable this, as all tasks are used at training time, including the test task. A possible experiment to add to section 5.2 would consist in training one of the PDEBench baseline models on the two low-data tasks

Weakness 2: The model architecture is completely novel and more experimental validation should be done to maximize the value for the community. Specifically, to separate the added-value of the transformer architecture from the added value of the multi-systems training, it could have been interesting to train the MMP architecture on each of the PDEBench tasks individually, and include it in Table 1 as a new single-task baseline. It is also quite common to include ablation studies to validate intuitions or design choices, and the claim that positional encodings help to learn boundary conditions (Section 4.2, last paragraph) should be validated either by such experiment, or by a proof in the supplementary material.

Weakness 3: The conclusion reminds the motivation behind multi-systems pretraining, but does not clearly recapitulate which questions have been answered and what is left to future studies. The two paragraphs are not well articulated and I feel would benefit from a rework. Ideally the reader should finish the paper with a clear view of the unanswered questions or limitations to be addressed.

问题

The paper takes first relevant steps towards training general models to model physical systems. I believe the authors have shown that their method of pretraining is functional, but more thorough experiments on the model architecture and the application to low-data tasks would be beneficial to the community and greatly increase the interest of the paper :

  1. Put the low-data results (Section 5.2) in perspective with task-specific baseline performances : are these tasks way too hard for the PDEBench baselines ? Ideally, report the performance of one task-specific baseline in Figure 5.

  2. Rework the conclusion to simply recapitulate the questions answered and which questions should be explored in future studies.

  3. In section 4.2 last paragraph, clarify whether the claim on boundary conditions is justified theoretically, validated by an ablation study, or both. Ideally, add relevant information in the supplementary material.

Minor comments to improve readability :

a) Code clarity : Papers that propose pretrained models should be as easy as possible to finetune, so that the community can make quick progress towards more efficient pretraining approaches on challenging downstream tasks. I would recommend rewriting the train.py file to remove a lot of boilerplate code and make it easier to reuse and tweak.

b) Clarity of Figure 2 and its legend can be improved: ReVIN abbreviation for the normalization layer is not defined. “physics metadata” is not defined neither in text nor in figure legend.

c) Table 1 : abbreviations for task names are not defined in the manuscript and make it hard to find the correspondence in the initial PDEBench paper. Please define these abbreviations in the table legend.

d) There is a typo in section 5.1 (word “all” written twice) : our models (denoted by MPP-AViT-*) must handle all all systems and regimes without finetuning

评论

Thank you for the thorough reading and extremely useful suggestions for improvement. We have implemented most of these and we feel they have made the paper much stronger.

W1

  1. The authors frame their question on the low-data regime... A possible experiment to add to section 5.2 would consist in training one of the PDEBench baseline models...

Thanks for the suggestion! We agree that this would be valuable. Since PDEBench does include results for these systems (albeit with access to the full training data), we plan on adding those results in 5.2 as dashed horizontal lines in Fig 5 for comparison so that reviewers can see at which data level our approach surpasses these widely used baselines.

W2/Q1

... train the MMP architecture on each of the PDEBench tasks... It is also quite common to include ablation studies to validate intuitions or design choices, and the claim that positional encodings help to learn boundary conditions (Section 4.2, last paragraph) should be validated either by such experiment, or by a proof in the supplementary material.

Put the low-data results (Section 5.2) in perspective with task-specific baseline performances : are these tasks way too hard for the PDEBench baselines ? Ideally, report the performance of one task-specific baseline in Figure 5.

This is also a very good suggestion. We have now broken out the "B" family of models into "train from scratch", "pretraining only", and "finetuned" models. Previously we were only showing "pretraining only" since our goal was to show that, since multi-task pretraining is new for physical dynamics, our approach was able to at least match modern baselines. On the whole though, the pretraining performance can be improved on most systems through finetuning. We note this in the paper, but did not include numerical evidence before.

W3 / Q2

  1. The conclusion reminds the motivation behind multi-systems pretraining, but does not clearly recapitulate which questions have been answered and what is left to future studies. The two paragraphs are not well articulated and I feel would benefit from a rework. Ideally the reader should finish the paper with a clear view of the unanswered questions or limitations to be addressed.
  1. Rework the conclusion to simply recapitulate the questions answered and which questions should be explored in future studies.

Thanks. We agree and appreciate the feedback here. We do discuss limitations, but these could certainly be less about problems faced by the entire field and be more specific to our approach. Improving this within the space constraints is something we need to think a bit more about, but we will update this when we post a revised copy prior to the discussion deadline.

Q3/W2

  1. In section 4.2 last paragraph, clarify whether the claim on boundary conditions is justified theoretically, validated by an ablation study, or both. Ideally, add relevant information in the supplementary material.

This is a good point. The language is likely a bit strong in the current version and we should tone this down to suggest that it improves multi-task learning across BCs rather than enables zero-shot learning as we only have experimental support for the former. We have performed a small supplementary study on a pair of 1D advection system that differ only in boundary conditions (absorbing vs periodic) and show that our approach has a significant advantages when trained on both systems compared to traditional encodings. This will be referenced in the main text and added to the supplement.

Minor comments

a) Code clarity :

Thanks. We are making several changes to the format of the repo (especially the README). These should come online in the anonymous repo by the end of discussion. We are still working some API changes to facilitate finetuning as some of the tensor shapes currently make that frustrating, but that is not something that will be completed by the deadlines.

b) Clarity of Figure 2

Thanks for this. We're still looking for a better word internally, but we will update this to mention "field indices" with the caption discussing what these are. On ReVIN - yes, that is also something we need to define. We mention the full word in the main text but didn't actually connect it to the acronym. We will do so both in the main text and in the caption of the figure.

c) Table 1 : abbreviations for task names are not defined in the manuscript and make it hard to find the correspondence in the initial PDEBench paper. Please define these abbreviations in the table legend.

Will do!

d) There is a typo in section 5.1

Thanks. We will fix this.

.

Again, we want to emphasize how grateful we are for the suggestions. We believe that addressing your concerns will make this a much stronger paper. If you believe that we have addressed your concerns, we would ask that you consider raising your evaluation. If you have further concerns, please let us know! Thank you!

审稿意见
3

The core of this paper is the proposal of a new transformer structure for pre-training on different neural operator datasets. Specifically, the paper employs several different PDE datasets from PDEBench for training, and the results can generalize to specific datasets in a zero-shot manner. Overall, the idea is novel, paving a new path for the application of neural operators in data-constrained scenarios. However, the learning of neural operators significantly differs from other data types like text and images, and whether a simple auto-regressive approach can be used for unified pre-training remains ambiguously addressed in the paper. Detailed concerns are noted in the drawbacks. All in all, I appreciate the idea presented in this paper, yet there is considerable room for improvement in problem definition, paper composition, experimental design, and comparisons. Thus I recommend the authors revise the paper and submit it to the next venue.

优点

  1. The idea presented in this paper is relatively novel. To the best of my knowledge, no published work has utilized an auto-regressive approach for pre-training on different types of PDEs to enhance performance on downstream tasks.
  2. The network structure proposed in the paper is efficient, and capable of handling datasets or PDE problems of varying sizes, resolutions, and channel counts within an acceptable level of complexity.
  3. The experimental results validate that even with differences in equation parameters or properties, pre-training can significantly save data, which is a valuable point. Additionally, the authors found that transformers pre-trained on video datasets are also beneficial for tasks of operator learning. This is an interesting fact.

缺点

  1. Firstly, the paper forcibly combines different types of datasets for training without considering the potential conflicts in PDE solutions. For instance, one can construct two PDEs that have identical or minimally differing data within a certain frame count, yet due to the non-linearity of PDEs or inherent differences in the equations, they exhibit substantial differences in subsequent evolution. Unlike Ref [1] which unifies the form of PDEs, this paper's approach to mixed dataset training could lead to the model learning meaningless representations, especially in the presence of conflicting data. From the experiments and provided open-source code, it seems the paper selected equations with vastly different properties for pre-training, which doesn't address this challenge faced in PDE pre-training.

  2. Although the experimental section is logically well-structured, the descriptions of the experimental results and settings remain unclear. For example, PDEBench provides multiple datasets for CNS M1.0 and CNS M0.1, but the paper doesn't clearly state which dataset was used for training or testing. Besides, the source of the results for other baselines is not clarified, and the comparisons have too few baselines.

  3. I believe the paper's contribution of pre-training PDE representations is overstated. Upon careful examination of the appendices, I found that the paper nearly only utilized a few fluid dynamics datasets from PDEBench and a small diffusion-reaction equation dataset. Compared to the NLP or CV community, which employs almost all publicly available data on the internet for pre-training, the scale of pre-training in this paper is not large; it's more aptly termed as transfer learning.

References

  1. Towards Foundation Models for Scientific Machine Learning: Characterizing Scaling and Transfer Behavior (https://arxiv.org/abs/2306.00258)

问题

None

伦理问题详情

None

评论

First off, we would like to thank the reviewer for the time and effort they put into the review. It is greatly appreciated and will certainly help us improve our submission.

  1. ...two PDEs that have identical or minimally differing data within a certain frame count...

There is an important distinction that must be made here: while the benchmark systems are derived from PDEs, our goal is to develop methods that are not dependent on prior knowledge of the equations. In this setting, it is not always possible to restrict the functional form to avoid chaotic or bifurcative behavior because the physical parameters are often outside of our control. This is a challenge faced by all methods in this space that are not simply trying to learn to solve a single PDE.

Knowing we are operating in spaces that can potentially be difficult to distinguish, we made three design choices to mitigate risk:

  1. We use a short trajectory of snapshots (16 in our experiments) rather than a single snapshot because of the need to differentiate between systems.
  2. Different sources of data are embedded into the shared space independently.
  3. We intended for the model to be finetuned for downstream tasks.

Pretraining is intended to learn to identify shared features that the model can refine for specific applications. For example, trajectories from two systems on either side of a bifurcation may follow a similar trajectory prior to diverging, but similar is not identical and the trajectories do evolve differently. (1) ensures that this gap is visible and the pretraining objective encourages the model to learn to differentiate between these small variations. It is not necessary for the model to be able to perform perfect system identification, as it is expected that it will be finetuned to the downstream task where system identification is no longer necessary.

The only case where the trajectories could realistically be identical is a system described by the weak solution of a PDE. In this case, due to discontinuity, certain terms may only be active in certain regions of phase space. If this region is never seen, like for any data-driven method, our approach will not be able to predict here. However, if this region is seen, since each data source is embedded separately, the model is given a priori knowledge to differentiate between systems even if the underlying equations are quite similar.

If the reviewer's concern is something that we have not discussed here, please let us know.

  1. ...CNS M1.0 and CNS M0.1...

With regards to the comparisons, our goals in the experiments are two-fold. First, we want to show that pretraining is learning useful behavior by showing that it can perform on par with dedicated modern baselines. The goal is not to show that pretraining alone results in state-of-the-art results. We have made this clearer by breaking out the B family of models into pretrained and finetuned variants to indicate that the pretrained models are not the upper ceiling on performance with these tasks.

Second, we want to show that given the same architecture, pretraining does improve performance on the downstream task. Since our method is the first to demonstrate the ability to learn from multiple physics, we also need to show that learning in this way outperforms learning from arbitrary spatiotemporal data (VideoMAE) as well.

Thank you for mentioning the CNS labeling. These are obtained by averaging results across the subsets with these parameters. We will add the results per file to the appendix. There is limited space in the main text, and the important distinction was the inclusion of compressible behavior which was not seen in the training corpus, so we aggregated by mach number. The pretrained models are still trained on the training partitions of all subsets and the baselines are still trained per-dataset. The split is performed per file as described in Appendix B.2.

  1. ...overstated. ...the scale of pre-training in this paper is not large; it's more aptly termed as transfer learning.

We are happy to address this concern. In our listed contributions on page 2, we do specifically mention we are exploring transfer capabilities so it seems we are in agreement there. If the contention is with the term large-scale, we would point out that the Pile, the dataset generally used for LLM training, is ~800GB. The subset of PDEBench we use is ~5TB.

We are cautious about our claims of scope - on pages 3 and 9, that we do not believe the dataset is sufficiently diverse to call our pretrained models "foundation models", but the scale of training is still quite large and the development of these methods are vital for us to develop true foundation models as the data environment matures.

.

We sincerely appreciate the effort and careful reading by the reviewer. If we have been able to address your concerns, we would ask that you consider increasing your score. Also, please let us know of any other concerns. Thank you!

审稿意见
6

This paper introduces a multiple physics pretraining approach for surrogate modeling, which learns general useful features across diverse physical tasks with a shared embedding and normalization strategy. The experiment results show the proposed MPP-pretrained model outperforms task-specific baselines on all pretraining sub-tasks and also show superior finetuning results on new physics tasks.

优点

  1. Constructing a physics-based foundational model and exploring multiple task pretraining for computational physics tasks is both interesting and beneficial.
  2. The experiments demonstrate impressive results in both pretraining sub-tasks as well as substantial transfer potential for new tasks with low-data system.
  3. The authors have conducted several training strategies to perform the pretraining effectively.

缺点

  1. More attempts can be made to address the transferability problems between different types of physics equations. For instance, when discussing the Navier-Stokes (NS) equations, whether at high or low Reynolds numbers, the equation forms are quite similar, which allows for the investigation into whether a single model still possesses robust merged learning capabilities for more different classes of equations, such as diffusion equations and wave equations.

  2. What are the advantages and disadvantages of this new approach compared to the Finite Element Method (FEM) for systems with explicit control equations or empirical formulas? One of my biggest problem is whether the physics task should be treated as a purely data-driven problem, or if it ought to incorporate certain explicit priors or equation-based guidelines.

  3. For models without PDEs, how can we determine the similarity of multiple physics fields and whether they can be learned simultaneously? I’m concerned about the improved performances are due to the limited diversity of different physics tasks.

  4. In the appendix, experimental data from 1-step to 9-step show little change in the field, whether looking at ground truth or predicted solutions. If the selected time steps were longer, would the model still be able to accurately predict future changes?

问题

Overall, the purposes behind this work are valuable. However, there remain many questions that need to be addressed. Please consider to answer the above questions.

评论

We'd like to thank the reviewer for their close reading and valuable insights. These critiques have helped us strengthen the paper.

  1. ...the transferability problems between different types of physics equations....

These are interesting suggestions! While we agree that more diverse exploration would be valuable, multi-task training is new to this space, so in this initial paper, we wanted use an established benchmark that has been previously used for analysis in this space. We agree that exploring the limits of transfer is an exciting area going forward.

About the Reynolds remark, the CNS data in PDEBench does actually include a range of Reynolds numbers - we do not report them as we believe the simulations have significant numerical diffusion due to insufficient resolution which we cannot account for in the computation of the effective Reynolds numbers. We do want to make this range clearer and we updated the appendix to reflect the results per nominal viscocity coefficient with the numerical diffusion caveat included in the text.

  1. ...advantages and disadvantages of this new approach compared to the Finite Element Method (FEM)...

This is an important distinction and one we will try to address in the discussion of neural PDE solvers vs operators in the related work. We do consider this to be a data-driven method. In this setting, PDEs are simply convenient testbeds rather than the true target. Numerical methods like FEM are clearly superior in both accuracy and predictability for tasks where the physics can be fully resolved by numerical methods. The advantage of data-driven methods is applicability to faster surrogates for difficult-to-resolve system or for learning directly from observational data in less controlled settings. As we note in our intro, these tend to also be areas where we have limited training data, which is why one of our focuses is transfer in the low-data regime.

Explicit physical priors are another valuable approach in this regime, but recently vision and language have been finding success with less constrained models that are pretrained on large amounts of data, but it has not been clear how to do the same in a dynamics setting. Our work is the first that has seen some success for nonlinear dynamics. We believe both paths to be interesting research directions.

  1. For models without PDEs...

This is a very interesting question. It is true that the data in this benchmark are largely fluid transport, this is not a small area. Transport phenomena are ubiquitous in fluids, rheology, and plasma physics which show up in a broad range of applied fields. PDEBench is one of the most diverse datasets in this domain today (and ~5TB compared to the ~800 GB Pile used for LLM training) with data from multiple equations, parameters, solvers, and boundary condition types, but answering this question will likely require much more diverse data sets than currently exist. Curating such a data set would be an strong contribution to the field on its own.

The limits of transfer from multi-task pretraining is something that was not even possible to explore previously. The results with VideoMAE suggest that the limits to transfer benefits may extend further than we'd normally assume and this is something that would be interesting to explore as data in this space matures. We believe that the fact that these questions are now open to researchers is one of the most exciting aspects of our work.

  1. In the appendix, experimental data from 1-step to 9-step show little change in the field, whether looking at ground truth or predicted solutions. If the selected time steps were longer, would the model still be able to accurately predict future changes?

The answer to this is likely task-specific. For our method and most methods in this space there is likely going to be a trade-off between small steps that rapidly accumulate autoregressive error and large steps which render the problem too difficult. Our goal in experiment 1 was to show that our pretrained models learn the system at least as well as modern baselines, so we used the benchmark steps which vary considerably system-to-system. CNS, for example, has highly visible changes step-to-step and is one of the systems where we see some of the largest improvements by percentage. This does indicate that our method performs no worse in this regard than standard approaches.

.

We greatly appreciate the reviewer's time and effort that went into helping us improve our paper. If we were able to address your concerns, please consider increasing your score. Thank you!

审稿意见
5

The authors present Multiple Physics Pretraining (MPP), a methodology for task-agnostic pretraining of surrogate models in physics. This technique facilitates pretraining on a large scale, allowing for knowledge transfer across a variety of physical domains. To handle the problem that features of physical tasks are diverse, a shared embedding and normalization strategy are introduced to project the fields of multiple systems into a single shared embedding space. The experiments show that a single MPP-pretrained transformer can output competitive or outperform task-specific baselines ( fluid mechanics-oriented benchmark) on all pretraining sub-tasks without finetuning.

优点

  • The idea to construct a large pre-trained base model for physical simulations is promising. The current learning based surrogate models usually have limited generalization ability, which require re-training from scratch given different governing equations. The pre-trained model has the potential to improve the generalization ability or make it possible only to fine-tune on specific tasks without training from scratch.
  • In the experiments part, the proposed model can achieve one order of magnitude smaller errors comparing to the existing methods.

缺点

  • The experiments for validating the proposed model are currently limited to 2-D cases (incompressible and compressiable Navier-stokes equations, shallow-water equations, and a 2D DiffusionReaction equation). It is unclear whether the proposed model can achieve same level of performance and accuracy when applied to more realistic 3D physical simulations.
  • It seems that the current model mainly focus on predicting the solution one time step further. The video results (diffre, incompNS, mpp_swe) shown in the supplemental materials exhibits strong checkboard artifacts as the timesteps grow.
  • The proposed model can only handle simulations on structured mesh. However unstructured mesh (e.g., triangular, tetrahedra) is a more common choice in real world simulations. In addition, the multi-resolution i.e., mesh with adaptive resolution, which is also a common technique in large scale simulations, has not been taken into consideration.

问题

  • It is mentioned in the appendix that the model is trained for 500 epochs for one task. How long does the training process take?
  • Are there any intuitive explanation for the checkboard artifacts as the timesteps grow shown in the video? Does it mean the proposed method is numerical unstable to some extent?
  • Does the proposed method have the potential to be applied to 3D simulations while keeping the efficiency and accuracy? (as the max memory usage of current model is ~60GB)
评论

We'd like to thank the reviewer for the time and energy they put into a close reading of our paper and greatly appreciate the reviewer's perspective on issues in the field. We have used your feedback to strengthen the paper and hope we've been able to address your concerns.

limited to 2-D cases... unclear whether the proposed model... more realistic 3D physical simulations.

...applied to 3D simulations while keeping the efficiency and accuracy?

For introducing this method, we felt it was best to evaluate on a pre-existing benchmark within the community. 2D simulations are conventionally used for this as they elicit complex behavior without the extreme cost of 3D methods. We do choose our base architecture with scaling in mind, but since we're introducing fundamentally new capabilities to deep learning for physical systems (our approach is the first to demonstrate the ability to learn multiple nonlinear time-dependent dynamics simultaneously), it was important to place it in context.

We do agree that 3D fluids are a more realistic setting and did account for this future goal in our architectural choices. Using 3D patching, we are able to train on the (5x128x128x128) Compressible Navier-Stokes (CNS) AViT-B in mixed-precision at a memory cost of 43.90 GB per sample. Training from scatch is still enough to surpass the PDEBench baselines:

ModelNRMSE CNS (Turb)NRMSE CNS (Rand)
UNet1.00.23
FNO.37.24
AViT-B.32.18

However, our goal here is exploring transfer and there are not yet accepted benchmarks that cover a diverse set of 3D problems. Transfer from 2D to 3D is non-trivial due to the downsampling assuming a fixed patch dimension, but dimension-agnostic "patching" is a promising future research direction for further expanding multiple physics pretraining and our design choices do enable this type of exploration.

one time step further...

...checkboard artifacts... numerical unstable...

We're grateful that you examined our supplementary material! We will improve the labeling to make this clear, but the listed videos are actually all pretraining only rollouts. The CNS video is of a fine-tuned model. We've found these checkboard artifacts are addressed for many systems through finetuning as can be observed, though not all.

We suspect these are linked to well-studied stability issues [1, 2, 3, 4] that have been connected to aliasing behavior, though we would note that as you observed in the videos, the model does not seem to completely diverge while the linked works have established that most of the baselines do in similar settings. Long-run stability of these models is an important research direction, but is out of scope here.

References:

[1] Learned coarse models for efficient turbulence simulation. (https://arxiv.org/abs/2112.15275).

[2] Towards Stability of Autoregressive Neural Operators. (https://arxiv.org/abs/2306.10619).

[3] Pde-refiner: Achieving accurate long rollouts with neural pde solvers. (https://arxiv.org/abs/2308.05732).

[4] Are neural operators really neural operators? Frame theory meets operator learning. (https://arxiv.org/abs/2305.19913).

...unstructured mesh (e.g., triangular, tetrahedra) ... adaptive resolution...

Thank you for pointing this out! Natively handling arbitrary meshes is a very exciting problem both in machine learning and numerics and certainly is not solved in either space. Outside of GNNs, most deep learning methods, including all of the baselines discussed, assume uniform grids either natively or via interpolation. We fully agree that handling arbitrary meshes would be a valuable contribution, but our work is focused specifically on the pretraining task.

How long does the training process take?

We are planning on updating the language to switch from "epoch", which we arbitrarily define as 2000 optimization steps as we are sampling with replacement, to listing the number of steps for clarity.

The base model (AViT-B) is pretrained using ~960 GPU hours on H100s, though finetuning is comparable to the UNet and ORCA baselines. The inference time of each model on an A6000 GPU is listed below (we plan on adding this to an appendix since it is important information to convey)

ModelTime (ms)
UNet67.7
FNO7.2
AViT-Ti83.5
AViT-B105.6
ORCA98.5

.

Again, we just want to express our gratitude to the reviewer for their valuable feedback. We expect that incorporating it will lead to a stronger paper. If we were able to address your concerns, please consider revising your score. Please let us know if you have further insights to share!

审稿意见
6

The paper introduces an autoregressive task-agnostic pretraining approach for physical surrogate modeling. As like foundation models, the proposed method proposes training large surrogate models to predict the dynamics of multiple heterogeneous physical systems simultaneously by learning features that are broadly useful across diverse physical tasks.

优点

This paper builds on recent advances in deep learning for physical simulation to enable surrogate models to work for diverse physical systems with a few fine-tuning iterations. The exposition for the motivation is written crisply and convincing, and is generally easy to follow. The experiments are performed for diverse physical systems. The results show that large surrogate models outperform strong baselines, and even models with a relatively few parameters can learn such diverse physics evolutions and perform competitively.

缺点

Although the generalizing capability is significant, the computational cost of the proposed models is still concerning. The followings are associated questions:

  • How do the authors expect the computational cost of the proposed models and how does it compare against other baselines in Table 1?
  • It is still a bit unclear if the comparison reported in Table 1 is fair, since the parameters of the first 4 models do not look comparable.
  • Is the model applicable to 3D simulations?

Many details of the inverse problems seem missing, and it makes assessing the performance of the proposed model a bit difficult in this regard. The followings are some of the questions:

  • How do you define the inverse problem and what model was used for the tasks?
  • How does the performance compare to other baselines?
  • What is the running time of the proposed model and how is it comparable to the other baselines?
  • Could the authors provide insights or results on the inverse problem for boundary condition?
  • Providing qualitative results would be helpful.   Minor comments

Typo in section 5.1: must handle all all systems and regimes without finetuning

问题

Please have a look at the weakness above.

伦理问题详情

Nothing particular.

评论

We'd like to thank the reviewer for their time and efforts in helping us strengthen our paper. We've added additional detail in response to your concerns.

...the computational cost of the proposed models is still concerning.

...computational cost... compare...?

...running time of the proposed model...

These are good points. While the purpose of pretraining methods is to develop multi-use models where the pretraining cost is borne by the pretrainers, this is important information and we have added an appendix section showing this. For reference, the pretraining for AViT-B currently takes ~960 GPU hours in H100s in single precision training (we plan on release the weights to amortize cost for users). During finetuning, the cost is more comparable to baselines. The time per forward pass on an A6000 GPU is listed below:

ModelTime (ms)
UNet67.7
FNO7.2
AViT-Ti83.5
AViT-B105.6
ORCA98.5

Our AViT code is currently unoptimized - it is for instance possible to fuse the axial attentions into a single cuda kernel launch. Despite that, the runtime of B is comparable to ORCA while Ti is comparable to the UNet. The FNO from the benchmark is quite a bit faster, but the performance is competitive only on SWE.

...unclear if the comparison reported in Table 1 is fair...

Our goal in this experiment is to show that despite the difficulty of multi-task optimization, we are able to match established baselines. In vision and language, it is common to report only finetuned results, but due to the novelty of our approach within the domain, we felt showing that pretraining alone can reach currently acceptable accuracy was valuable. We are expanding our B family in the table to show the difference between pretraining and finetuning for emphasis.

While we agree that more precise parity across all models would be valuable, Ti is about as small as people have had success training vision transformers and our research goal here is not development on that front. Ti is in line with the larger of the two comparable baselines (PINNs are included for completeness, but these are fit per example rather than trained for generalization).

Is the model applicable to 3D simulations?

This is something we are excited to explore in future work. The architecture as a whole is selected with scalability in mind and can be run on 3D data with minor changes, but established benchmarks for evaluating cross-system transfer in 3D do not currently exist. Using mixed precision, we can fit an AViT-B on the largest 3D system in PDEBench (5x128x128x128) with ~43 GB of VRAM. Transfer from 2D to 3D has additional complications, like the dimension-specific downsampling (patching) process.

How do you define the inverse problem...

Thanks, we agree with these criticisms and will update the manuscript with a more detailed description.

There are two parameter inference problems discussed here. Both are tackled using the pretrained AViT-B model. The first is the forcing identification in INS. In equation 9 in the appendix, the INS equations are described with a forcing f. This f is randomly sampled per trajectory and constant through the simulation. In this task, we try to predict what this forcing is by finetuning (or training from scratch) the AViT-B model from Experiment 1.

The second is identifying the buoyancy parameter which is a single scalar serving a similar role in an outside simulation entirely separate from pretraining. We will update the appendix to include this precise equation as well for reference in the main text.

  • How does the performance compare to other baselines?

This task only makes sense in the context of multi-task training where the model must perform implicit system identification. Most of the models from Exp 1 are developed and trained for use on only one system and therefore do not need to internally identify these parameters. Mialon et al., 2023 [1] is the only other "pretraining" paper that has tackled one of these problems (buoyancy). We will add these results to the table.

The scalar inverse problem was not a strength of our approach, but we felt this was important information to share. In NLP, autoregressive models typically perform worse than other approaches for non-generation tasks. The fact that the contrastive approach of [1] outperformed the mean prediction suggests that could be the case here as well, but these are early efforts so we cannot say anything absolute.

[1] Gregoire Mialon, et al. Self-supervised learning with lie symmetries for partial differential equations, 2023

Typo in section 5.1: must handle all all systems and regimes without finetuning

This has been fixed, thanks!

.

Once again, we appreciate the time you put into reviewing our paper. If we were able to satisfy your concerns, please consider increasing your score. We'd love to hear any additional concerns as well.

评论

Thank you to all of our reviewers. You have provided us a great deal of valuable feedback which we are actively using to strengthen our submission. We've replied to you all individually for specific concerns, but we also wanted to centrally list the changes we've made to the document in response to reviewer feedback:

Extended Results

  • We've extended the results for AViT-B in Table 1 to highlight that pretraining is not the upper ceiling on performance. We've added finetuned results to those tables as well as results from training from scratch.
  • We added a small supplementary study on a pair of 1D advection systems that differ only in boundary conditions and show that our small periodic adjustment provides an advantage for multi-task training.
  • We've added a note to the main text explaining that the CNS experiments in Table 1 are aggregated based on the Mach number. We've added a full breakout of the various 2D CNS datasets in PDEBench along with analysis to the supplementary material.
  • We've added timings for inference for the size groups containing baselines to give a sense of the cost of training/finetuning.

Clarifications and corrections

  • We've add horizontal lines indicating the results for the PDEBench baselines to Figure 5 (Transfer results)
  • We've added a note to the full CNS results in the supplement explaining the range and why we describe them by the nominal viscosities rather than by dimensionless numbers
  • Parameter inference section
    • We specified which models precisely we are training.
    • We've updated the contents of Table 3 to include the mentioned results from Mialon et al., 2023.
    • The "best constant" field has been updated to specify that this the error from predicting the mean.
    • We corrected a misplaced decimal.
    • The text has been updated to explicitly refer to the equations in the appendix.
  • We've further clarified when we are speaking about groups of optimization steps from pretraining and true epochs (full passes through training set).
  • Figure 2 - RevIN is now explicitly defined in the caption. Physics metadata has been clarified to specifically be field indices and the caption is updated to explain their role.
  • The videos in the supplementary material are now labeled to make it clear which are pretraining and which are finetuned.
  • Fixed the double "all" in 5.1
  • Defined the abbreviations in Table 1

Writing

  • Conclusion - we've partially re-written the conclusion to make clearer what we have done and what questions remain to be addressed, but this is an area we will aim to improve further.
  • Code clarity - We have made updates, but will need to thoroughly examine it for anonymity violations prior to pushing it to the repository so it is not yet available.

We greatly appreciate the time you all have put into your reviews and we hope we have satisfied many of your concerns. If you feel we have addressed your concerns, we'd ask that you increase you score. Thank you!

AC 元评审

The paper introduces a strategy for pretraining auto-regressive transformer models to model dynamical physical systems, aiming to extend successful ideas from text/vision to physics. The proposed model is simultaneously trained on various fluid dynamics transport equations, with the objective of generalizing to different contexts, and to equations with similar forms. Experiments are conducted using data from the PDEBench repository, and additional investigations analyze the benefits of the approach for generalizing to low-data settings and for a downstream parameter estimation task.

Reviewers agree that the setting is new and largely unexplored, opening avenues for data-based approaches to modeling dynamics and solving PDEs. The model adopts an efficient transformer architecture, maintaining a reasonable level of complexity and enabling training on large amounts of data. Extensive experiments demonstrate that the model can compete with task-specific training using less training data. Overall, this paper is interesting, representing a first step in exploring large models in physics and offering new perspectives. However, as with any first exploration, there are areas where the contribution could be improved, including the problem setting, the experiment design and the clear statement of the questions that have been answered through this analysis. Questions also persist about the fairness of comparisons with baselines. Notably, the evaluation loss is the same as the training loss for the proposed system, whereas this is not the case for the reported results of the baselines, potentially biasing the comparisons. Considering these remaining weaknesses, the paper would benefit from a revision and resubmission.

为何不给更高分

The paper requires clarifications, notably regarding the experimental setting, the field of application, and the questions answered by these experiments

为何不给更低分

a

最终决定

Reject