PaperHub
7.8
/10
Spotlight4 位审稿人
最低4最高5标准差0.4
5
5
5
4
3.3
置信度
创新性3.3
质量2.8
清晰度3.0
重要性3.0
NeurIPS 2025

Axial Neural Networks for Dimension-Free Foundation Models

OpenReviewPDF
提交: 2025-05-11更新: 2025-10-29

摘要

关键词
Parameter SharingDimension AgnosticAxis PermutationFoundation ModelMessage PassingPDEArchitecture

评审与讨论

审稿意见
5

The paper introduces axial neural networks (XNNs), which represeng a generalization of deep sets and are essentially networks equivariant with respect to permutations of the axes of tensors. This concept is interesting and allows the creation of more flexible graph networks which can combine different modalities with different number of dimensions.

优缺点分析

Strengths

  • the exposition and core motivation are well and clearly written
  • simple method with a limited scope - make existing architeyctures for physics foundation models more flexible, in particular w.r.t. dimensionalit
  • the authors demonstrate this effectively using the PDEArena dataset by improving upon 2 existing foundation models

Weaknesses

  • some work that is clearly related such as factorized convolutions/attention are not referenced
  • description of experimental method and baselines are somewhat lacking, it is not immediately clear how the efficacy of this method is tested.
  • more baselines would be helpful in quantifying the performance of the metthod, i.e. did this method improve upon a reasonable architecture with reasonable results?

问题

  • for (4) the connection to de Finnetti's theorem is not entirely clear. Is P(\theta) aprior on the parameters? and how does this enter the picture. A better explanation here would help
  • this method is clearly linked to some commutative properties, which are also satisfied by physical systems. This has been exploited by factorized methods such as factorized attention or convolution. Some acknowledgement/reference to these works would be appropriate
  • can the approach of generalizing from less to more dimensions help address the bottleneck posed by the availability of training data in SciML? An experiment in this regard would go a long way to convince readers about the usefulness of the method.

局限性

yes

最终评判理由

The paper is clearly written and the authors were able to address my initial concerns. I trust that with the suggested improvements and modifications the authors mentioned in the rebuttal, this manuscript should be published.

格式问题

None

作者回复

Thank you for the constructive comments. Below are our responses to the remaining concerns.


some work that is clearly related such as factorized convolutions/attention are not referenced this method is clearly linked to some commutative properties, which are also satisfied by physical systems. This has been exploited by factorized methods such as factorized attention or convolution. Some acknowledgement/reference to these works would be appropriate

They are good suggestions. There is a similarity between this work and factorized convolutions/attention in terms of low-rank approximation of a single operation. I will clearly state this in the final version.


description of experimental method and baselines are somewhat lacking, it is not immediately clear how the efficacy of this method is tested.

We will add details of the experimental procedures in the final version. In a nutshell, we followed the basic training procedures of baselines such as CViT and MPP, which take a few timesteps as input and predict the next timestep of a PDE solution. These models are all based on Vision Transformers with some revisions optimized for PDE learning. We evaluated architectural expressivity in Table 2 in terms of from-scratch training, pretraining, and finetuning using only 2D PDE solutions. In Figure 4, we evaluated few-shot learning capability on data from unseen dimensions, that is, the models were pretrained on 2D PDEs only and finetuned with a few samples from 1D or 3D PDEs, demonstrating the benefit of multi-dimensional pretraining in XNN-based foundation models.


more baselines would be helpful in quantifying the performance of the metthod, i.e. did this method improve upon a reasonable architecture with reasonable results?

We have conducted one more baseline FNO, a widely used architecture for PDE solutions. It also has dimension-agnostic functionality and is typically used for equation-specific learning. The Test NRMSE comparison with X-MPP on 1D few-shot learning task as below. As shown in the tables, X-MPP significantly outperforms FNO in the few-shot setting regardless of the size of the downstream dataset.

Table 1. Test NRMSE in Burgers’ (BS) equation

#Data2957361471
FNO0.04850.03190.0214
X-MPP0.03370.01460.0038

Table 2. Test NRMSE in Advection (ADV) equation

#Data197491981
FNO0.25160.23710.2323
X-MPP0.07950.04840.0202

Table 3. Test NRMSE in compressed fluid dynamics (CFD) equation

#Data58145290
FNO0.06790.05220.0355
X-MPP0.02890.01250.0094

Table 4. Test NRMSE in diffusion sorption (DS) equation

#Data123060
FNO0.03500.01630.0078
X-MPP0.00980.00420.0025

for (4) the connection to de Finnetti's theorem is not entirely clear. Is P(\theta) aprior on the parameters? and how does this enter the picture. A better explanation here would help

Yes, P(θ)P(\theta) is the prior distribution of the parameter θ\theta. de Finetti’s theorem tells us that a permutation-invariant (exchangeable) neural network should take the form of a Deep Sets-style architecture, as shown in Eq. 2. Specifically, if P(Xiθ)=exp(ψ(x),θg(θ))P(X_i|\theta) = \exp(\langle \psi(x), \theta \rangle - g(\theta)) and P(θ)=exp(θ,αmg(θ)ψ(α,m))P(\theta) = \exp(\langle \theta, \alpha \rangle - mg(\theta) - \psi(\alpha, m)) in terms of exponential families, then Eq. 4 can be written as P(X1,,Xn)=exp(ϕ(α+i=1nψ(Xi),m+n)ϕ(α,m)),P(X_1,\ldots,X_n)=\exp\left(\phi\bigg(\alpha+\sum_{i=1}^n\psi(X_i), m+n\bigg)-\phi(\alpha,m)\right), which shares the same structure as Eq. 2. That is, an axis-permutation-invariant neural network should also share such an architecture by treating each element XiX_i as an axis. This statement is also described in Eq. 2 of the Deep Sets paper [1]. The original description was insufficient for building intuition, so we will elaborate more in the final version.

[1] Zaheer et al., 2018, Deep Sets


can the approach of generalizing from less to more dimensions help address the bottleneck posed by the availability of training data in SciML? An experiment in this regard would go a long way to convince readers about the usefulness of the method.

Generalization from lower to higher dimensions indeed addresses the bottleneck, as the model trained on 1D and 2D data can perform well on 3D data after fine-tuning with a small amount of 3D data. We provided evidence for this in the last plot of Figure 4: with only 500 data points (trajectories), the model was able to generalize to 3D data.

审稿意见
5

This paper proposes Axial Neural Networks (XNNs), a dimension-agnostic architecture that treats tensor axes as exchangeable elements to achieve axis-permutation equivariance. The authors develop two variants: set-based XNNs using Deep Sets aggregation and graph-based XNNs with message-passing between axes. They demonstrate the approach by adapting existing PDE foundation models (CViT, MPP) and show competitive performance on 2D tasks with superior generalization to unseen 1D/3D dimensions.

优缺点分析

Strengths:

  • The paper is very clearly written and well-structured.

  • It addresses a real and important challenge in PDE foundation models—dimension mismatch. Existing solutions such as padding and separate encoders are indeed suboptimal, which makes the proposed approach valuable for the scientific machine learning community.

  • The connection to De Finetti's theorem provides a compelling justification for why axis-permutation equivariance enables generalization across dimensions. The mathematical framework is rigorous, with formal theorems and proofs that clearly establish the equivariance properties.

Weakness:

  • I do not see any major weaknesses in the paper.

  • The experiments are conducted with relatively small networks (7M–12M parameters), which somewhat limits the strength of the claims regarding foundation model applicability. It would be helpful if the authors could demonstrate that XNNs can also benefit large-scale models in experiments.

问题

  1. I am not fully convinced by the assertion that the "patchify operation...is inherently limited to 2D inputs." (page 2, line 41-42). In practice, patchification can often be extended to 1D or 3D by using the appropriate convolution kernels (e.g., 1D or 3D convolutions). This point may somewhat overstate the limitations of existing approaches, and it would be helpful for the authors to clarify why such generalizations are considered inadequate in this context.

  2. I’m curious why CViT performs significantly worse on the SWE dataset in this paper. According to the original CViT paper, its performance on SWE (from PDEArena) is comparable to MPP. I noticed that this paper uses the SWE dataset from PDEBench—could this dataset difference be the reason for the discrepancy? Some clarification on this point would be appreciated.

局限性

Yes

最终评判理由

The authors adequately addressed the main concerns in the rebuttal. They clarified the scalability setup, provided justification for not using large models, and explained differences in CViT performance and patchification claims. While large-scale model validation remains missing, the core contributions are sound, and the method shows strong performance and generality. I recommend acceptance.

格式问题

N/A

作者回复

Thank you for the positive reviews. Below are our responses to the remaining concerns.


The experiments are conducted with relatively small networks (7M–12M parameters), which somewhat limits the strength of the claims regarding foundation model applicability. It would be helpful if the authors could demonstrate that XNNs can also benefit large-scale models in experiments.

That is a valid concern about this paper. We actually tried such a setting to demonstrate the benefits in large models, but our setup was not sufficient to test large-scale transformers due to the heavy memory requirements of processing PDE data. Instead, we leveraged a relatively large amount of data to demonstrate scalability. Below are the results on scalability with respect to the size of the pretraining dataset, in comparison to the Fourier Neural Operator (FNO), on 1D compressible fluid dynamics (CFD). FNO is trained only on CFD, whereas X-MPP is pretrained on multiple 2D PDEs. As shown in the table, X-MPP also scales well as the size of the pretraining dataset increases and outperforms FNO when provided with a sufficient amount of pretraining data.

Table 1. Finetuning performance on 1D CFD.

ModelX-MPPX-MPPX-MPPFNO
#PretrainData/#TotalData0.010.111
NRMSE0.34690.01980.00520.0232

I am not fully convinced by the assertion that the "patchify operation...is inherently limited to 2D inputs." (page 2, line 41-42). In practice, patchification can often be extended to 1D or 3D by using the appropriate convolution kernels (e.g., 1D or 3D convolutions). This point may somewhat overstate the limitations of existing approaches, and it would be helpful for the authors to clarify why such generalizations are considered inadequate in this context.

As you pointed out, designing patchification is not restricted to specific dimensions. What we meant by “limited to 2D inputs” is that 2D patchification is only applicable to 2D inputs, not 1D or 3D, and requires modifying parameters to process 1D or 3D data. In contrast, XNN patchification is applicable to inputs of any dimension with the same parameters. We will clarify this in the final version.


I’m curious why CViT performs significantly worse on the SWE dataset in this paper. According to the original CViT paper, its performance on SWE (from PDEArena) is comparable to MPP. I noticed that this paper uses the SWE dataset from PDEBench—could this dataset difference be the reason for the discrepancy? Some clarification on this point would be appreciated.

We trained CViT on SWE from PDEArena, while MPP was trained on SWE from PDEBench. The main difference between the CViT paper and ours lies in the evaluation metric. PDE solutions typically involve multiple variables represented as a feature vector in the data. CViT used relative L2 error, which is the average of normalized MSE across variables, whereas we used NRMSE, which is the MSE normalized by the norm of the feature vector. Additionally, the CViT paper reported 5-timestep prediction error, while we reported 1-timestep prediction error.

评论

I thank the authors for their detailed response. My concerns have been fully addressed, and I am happy to maintain my score.

审稿意见
5

This paper proposes Axial Neural Networks (XNNs), a class of models that process tensors of arbitrary rank by treating each axis as a node in a graph and applying message passing to capture cross-axis dependencies. Two variants are introduced: SXNNs (using Deep Sets) and GXNNs (using GNN-style updates with convolutions or attention). The approach is designed to be equivariant to axis permutations and agnostic to axis semantics. XNNs are used to replace dimension-specific components in PDE surrogates (CViT and MPP), resulting in new variants (X-CViT, X-MPP) that show good performance in low-data and cross-dimensional generalization settings.

优缺点分析

Originality: Modeling tensor axes as graph nodes is a clean and general abstraction. The formulation unifies and extends Deep Sets and GNNs under a common interface. The idea is intuitive but nontrivial, and the paper develops it clearly. Related work is generally cited appropriately. The paper may benefit from more diverse set baselines (e.g., FNOs) in the experiments section.

Quality: Theoretical claims (e.g., permutation equivariance) are sound and well formalized. Experimental results support the claim that XNNs improve parameter sharing and cross-dimensional generalization. However, several claims such as axis permutation robustness and generalization to new boundary conditions are not empirically validated. The ablation between SXNN and GXNN is performed only on a synthetic task. No error bars or standard deviations are reported, making it difficult to assess robustness or reproducibility.

Clarity: The paper is well structured and mostly easy to follow. Section 3.2 provides an easy to follow explanation of the GXNN update rule. Notation is consistent, but sometimes dense. The handling of domain-specific features (e.g., BCs, geometry, time) is not entirely clear.

问题

  • Have you tested whether the model is empirically equivariant to axis permutations? For example, if you permute the input axes (e.g., swap spatial dimensions), does the output permute in the same way? This would verify that the symmetry proven in theory holds in practice.
  • Can the model generalize when the meaning of axes changes? For example, if trained on data where axis 0 is time and axis 1 is space, can it still perform well when these roles are reversed at test time? This would test the claim that XNNs are agnostic to axis semantics, not just axis positions.
  • How are geometry and boundary conditions encoded? Is the model tied to Cartesian grids, or can it handle irregular domains?
  • Why not compare against FNOs? These are natural baseline for cross-resolution and few-shot generalization.

局限性

Equivariance to axis permutations: Theorems 3.1 and 3.2 prove that SXNNs and GXNNs are permutation-equivariant by construction, but there are no empirical tests. It’s unclear if the trained models preserve this symmetry under finite data and optimization noise.

Robustness to axis semantics: Claimed in the abstract and introduction is that the model is said to be agnostic to the meaning of each axis.No experiments back this up. There’s no test where axes are reordered or their semantics are changed. These trivial unit tests can be included in the appendix.

Generalization to new geometries or BCs: All tasks are defined on structured Cartesian grids with standard boundary conditions. It’s unclear if the model works for irregular domains or non-grid geometries.The tensor structure and axis-based convolutions don’t easily extend to curved or unstructured domains.

最终评判理由

I thank the authors for their discussion, and maintain my score. This is a solid paper.

格式问题

none

作者回复

Thank you for the positive comments. Below are our responses to the remaining concerns.


The paper may benefit from more diverse set baselines (e.g., FNOs) in the experiments section. Why not compare against FNOs? These are natural baseline for cross-resolution and few-shot generalization.

As you understood, our method was designed as an add-on to existing PDE models, so we did not include many baselines and instead focused on the comparison between a PDE model and its XNN version. Nonetheless, FNO can serve as a comparable example for cross-dimension functionality, so we compared them in terms of few-shot learning. As shown in the tables, X-MPP significantly outperforms FNO in the few-shot setting.

Table 1. Test NRMSE in Burgers’ (BS) equation

#Data2957361471
FNO0.04850.03190.0214
X-MPP0.03370.01460.0038

Table 2. Test NRMSE in Advection (ADV) equation

#Data197491981
FNO0.25160.23710.2323
X-MPP0.07950.04840.0202

Table 3. Test NRMSE in compressed fluid dynamics (CFD) equation

#Data58145290
FNO0.06790.05220.0355
X-MPP0.02890.01250.0094

Table 4. Test NRMSE in diffusion sorption (DS) equation

#Data123060
FNO0.03500.01630.0078
X-MPP0.00980.00420.0025

However, several claims such as axis permutation robustness and generalization to new boundary conditions are not empirically validated.

We did not include experiments on axis-permutation generalization because our focus is on number-of-axes generalization rather than axis-permutation generalization. In fact, in the PDE task, breaking axis-permutation equivariance using positional encoding performed better than enforcing equivariance. This is because retaining the semantics of each axis through positional encoding is beneficial.


The ablation between SXNN and GXNN is performed only on a synthetic task.

The ablation between SXNN and GXNN on real data is a good suggestion. However, on real data, the best performance was achieved by mixing SXNN and GXNN across different layers, which is why we compared them on synthetic data.


No error bars or standard deviations are reported, making it difficult to assess robustness or reproducibility.

As you pointed out, error bars or standard deviations are missing. Unfortunately, we were not able to run a sufficient number of experiments to measure the standard deviation because a single experiment (especially pretraining) takes too much time in our setting. We also mentioned this in the checklist. However, we partially observed that the results are consistent.


The handling of domain-specific features (e.g., BCs, geometry, time) is not entirely clear. How are geometry and boundary conditions encoded? Is the model tied to Cartesian grids, or can it handle irregular domains?

The handling of domain-specific features is determined by the baseline methods we modified for XNN; i.e., X-MPP follows the handling method of MPP. Additionally, the baseline methods target PDE solutions defined only on regular grids. According to MPP, the boundary conditions, geometry, and time are not separately input to the model. Instead, MPP is pretrained on multiple PDE solutions with varying boundary conditions and equations, allowing it to learn general patterns of PDE solutions. Since we eventually finetune the model on a specific PDE with known boundary conditions and geometry, it is not necessary to encode them separately as inputs. The only thing MPP handles separately is a periodicity of the boundary condition. Depending on the periodicity of the boundary condition, MPP determines to use between sequential position bias and periodic position bias in the attention layer. We will clarify this in the final version.


Have you tested whether the model is empirically equivariant to axis permutations? For example, if you permute the input axes (e.g., swap spatial dimensions), does the output permute in the same way? This would verify that the symmetry proven in theory holds in practice. Equivariance to axis permutations: Theorems 3.1 and 3.2 prove that SXNNs and GXNNs are permutation-equivariant by construction, but there are no empirical tests. It’s unclear if the trained models preserve this symmetry under finite data and optimization noise.

After reviewing your comments, we realized that the GXNN used in the synthetic task was not implemented to be equivariant, although it was still dimension-agnostic. Therefore, we modified GXNN to be equivariant and re-evaluated its performance. As shown in Table 5, SXNN and the corrected GXNN satisfy equivariance, whereas CNN does not. In Table 6, the corrected GXNN still demonstrates strong performance compared to CNN on high-dimensional data.

Table 5. Equivariance error

3D CNNSXNN-LCorrected GXNN
Equiv. Err.0.4534230.0000000.000000

Table 6. Test accuracy (%) in the synthetic task.

3D CNNGXNNCorrected GXNN
2D97.7895.4595.83
3D67.3579.8581.09
4D65.9370.8672.94
5D66.3970.4868.37

Can the model generalize when the meaning of axes changes? For example, if trained on data where axis 0 is time and axis 1 is space, can it still perform well when these roles are reversed at test time? This would test the claim that XNNs are agnostic to axis semantics, not just axis positions. Robustness to axis semantics: Claimed in the abstract and introduction is that the model is said to be agnostic to the meaning of each axis.No experiments back this up. There’s no test where axes are reordered or their semantics are changed. These trivial unit tests can be included in the appendix.

We intended number-of-axes generalization rather than axis-permutation generalization from the equivariance. Moreover, axis-permutation generalization is not preferred in an application like a PDE solver. We investigated whether axis-permutation equivariance would help PDE generalization. To this end, we used an axis-permutation equivariant NN (XNN) with axis-permutation invariant positional encoding to guarantee equivariance of the PDE solver. However, it showed lower performance than XNN with non-invariant positional encoding, which breaks equivariance. Thus, although the XNN layers should be axis-permutation equivariant to achieve agnosticism to the number of axes, the final model should not be equivariant in order to preserve the semantics of each axis. We will elaborate on this perspective in the final version.


Generalization to new geometries or BCs: All tasks are defined on structured Cartesian grids with standard boundary conditions. It’s unclear if the model works for irregular domains or non-grid geometries.The tensor structure and axis-based convolutions don’t easily extend to curved or unstructured domains.

Generalization to new geometries or boundary conditions (BCs) strongly depends on the baseline PDE model used in XNN. Neural PDE solvers for irregular domains often introduce new encoders and objective functions [1,2]. Although irregular domains may break the axis-permutation equivariance of the PDE, XNN-based PDE solvers remain dimension-agnostic, and we are already breaking equivariance through non-invariant positional encodings. For example, MPP, one of our baseline PDE models, uses multiple position biases in the attention layers depending on the periodicity of the boundary conditions. The same approach is seamlessly applicable in X-MPP. On the other hand, non-grid approaches such as Neural Processes [3] take feature-coordinate pairs as input. Therefore, they do not require a dimension-agnostic architecture, as the coordinate information is already explicitly provided.

[1] Khara et al., 2022, Neural PDE Solvers for Irregular Domains

[2] Boussif et al., 2022, MAgNet: Mesh Agnostic Neural PDE Solver

[3] Garnelo et al., 2018, Neural Processes

审稿意见
4

The paper introduces Axial Neural Networks, a dimension-agnostic architecture that ensures axis-permutation equivariance. When applied to PDE solvers, XNNs demonstrate strong generalization across 1D, 2D, and 3D tasks, outperforming some of the established baseline models in transfer learning and fine-tuning tasks.

优缺点分析

Strengths

  • The paper presents a compelling deep learning framework that enforces equivariance to axis permutations. The authors carefully design each component of their architecture, resulting in expressive set-based and graph-based equivariant models.

  • They provide formal proofs to establish the equivariance properties of their proposed networks.

  • A well-designed toy example is introduced to demonstrate the expressiveness and dimension-agnostic capability of the model.

  • The authors conduct notable transfer learning experiments, particularly from 2D to 1D and 3D settings, highlighting the model’s ability to generalize effectively across unseen dimensions.

Weaknesses

  • Although the authors claim to evaluate their architecture on state-of-the-art PDE foundation models, this is misleading. In practice, models like Poseidon and DPOT significantly outperform MPP on downstream tasks (see comparisons in [1]).

  • With regard to the evaluation of the foundation model: most downstream tasks (except CFD-3D) are relatively trivial. On CFD-3D, both MPP and X-MPP show very large errors. It is therefore difficult to assess the value of incorporating equivariance. The tested problems are either too easy or too hard. The authors are encouraged to evaluate their model on a wider set of downstream tasks such as those in [1], where foundation models achieve much stronger performance.

  • In Table 2, it is unclear whether some downstream task distributions were also used during pretraining. If so, this undermines the evaluation: in real-world settings, it is unrealistic to encounter downstream data from the exact same distribution as pretraining. The true value of a foundation model lies in its ability to generalize to unseen tasks.

  • The authors should much better clarify the experimental setup and better isolate seen vs. unseen tasks.

  • The paper does not discuss how the model’s error scales with increasing model size or the size of the pretraining dataset. These scaling properties are crucial for understanding the behavior and practical usefulness of foundation models.


[1] Herde, M., Raonic, B., Rohner, T., Käppeli, R., Molinaro, R., de Bézenac, E., & Mishra, S. (2024). Poseidon: Efficient foundation models for pdes. Advances in Neural Information Processing Systems, 37, 72525-72624.

[2] Hao, Z., Su, C., Liu, S., Berner, J., Ying, C., Su, H., ... & Zhu, J. (2024). Dpot: Auto-regressive denoising operator transformer for large-scale pde pre-training. arXiv preprint arXiv:2403.03542.

问题

  • Does the axis-permutation equivariance introduce any inductive bias that could limit performance on inherently asymmetric data?

  • How does the expressivity of GXNN compare to conventional ViT or CNN models when trained on high-dimensional data?

  • What effect would data augmentation techniques like rotations and transpositions during training have, compared to explicitly enforcing equivariance? Could such augmentations implicitly encourage the model to learn equivariant behavior?

局限性

The authors did mention the limitations of their approach.

最终评判理由

The authors have addressed most of my concerns, so I increase the score to 4.

格式问题

/

作者回复

Thank you for your valuable feedback. Below are our responses to the remaining concerns.


Although the authors claim to evaluate their architecture on state-of-the-art PDE foundation models, this is misleading. In practice, models like Poseidon and DPOT significantly outperform MPP on downstream tasks (see comparisons in [1]).

That is a valid point. I followed CViT’s statement claiming SOTA, which is why I referred to our evaluation as SOTA. However, we later figured out CViT was not the best model in our setting. We will correct the claim. As future work, it would be interesting to apply XNN to Poseidon, a SOTA PDE model.


With regard to the evaluation of the foundation model: most downstream tasks (except CFD-3D) are relatively trivial. On CFD-3D, both MPP and X-MPP show very large errors. It is therefore difficult to assess the value of incorporating equivariance. The tested problems are either too easy or too hard. The authors are encouraged to evaluate their model on a wider set of downstream tasks such as those in [1], where foundation models achieve much stronger performance.

We included only 1D and 3D downstream tasks to demonstrate the unseen dimension generalization capability. For a comparable assessment, we evaluated our method on a new downstream task called 2D incompressible fluid dynamics (IFD), corresponding to seen dimensionality but unseen PDE data. As a result, the axial version of MPP, X-MPP, significantly outperformed MPP due to the embedded inductive bias in the architecture.

Table 1. Finetuning performance on unseen PDE, IFD

#PretrainData1195821912
MPP0.03500.0163
X-MPP0.00980.0042

In Table 2, it is unclear whether some downstream task distributions were also used during pretraining. If so, this undermines the evaluation: in real-world settings, it is unrealistic to encounter downstream data from the exact same distribution as pretraining. The true value of a foundation model lies in its ability to generalize to unseen tasks. The authors should much better clarify the experimental setup and better isolate seen vs. unseen tasks.

As you said, Table 2 includes an unrealistic setting, such as including pretraining data in the downstream tasks. We would like to clarify that Table 2 is intended to show the expressivity of XNN, not its functionality as a foundation model. We tested the unseen dimension generalization capability of our model in Fig 4. In this case, the seen data are all 2D PDEs, and the unseen data are all 1D and 3D data (finetuned in the downstream task).


The paper does not discuss how the model’s error scales with increasing model size or the size of the pretraining dataset. These scaling properties are crucial for understanding the behavior and practical usefulness of foundation models.

We investigate the scaling properties with respect to the size of the pretraining dataset, in comparison to the Fourier Neural Operator (FNO), on 1D compressible fluid dynamics (CFD). FNO is trained only on CFD, whereas X-MPP is pretrained on multiple 2D PDEs. As shown in the table, X-MPP also scales well as the size of the pretraining dataset increases and outperforms FNO when provided with a sufficient amount of pretraining data.

Table 2. Finetuning performance on 1D CFD.

ModelX-MPPX-MPPX-MPPFNO
#PretrainData/#TotalData0.010.111
NRMSE0.34690.01980.00520.0232

Does the axis-permutation equivariance introduce any inductive bias that could limit performance on inherently asymmetric data?

Actually the axis-permutation equivariance is used to design a dimension-agnostic architecture rather than to impose an inductive bias. Therefore, we do not need to guarantee equivariance for asymmetric data. Indeed, our PDE solver is not axis-permutation equivariant due to the non-invariant positional encodings, although each layer is equivariant. This is because the semantics of each axis are also important.


How does the expressivity of GXNN compare to conventional ViT or CNN models when trained on high-dimensional data?

Typically, equivariant models including XNNs sacrifice expressivity to guarantee equivariance. Hence, the expressivity of GXNN would also be lower than that of conventional ViT and CNN for the same number of model parameters. However, ViT or CNN requires an exponentially large number of parameters to process high-dimensional data, whereas GXNN can handle the same high-dimensional data with a relatively small number of parameters.


What effect would data augmentation techniques like rotations and transpositions during training have, compared to explicitly enforcing equivariance? Could such augmentations implicitly encourage the model to learn equivariant behavior?

Although rotation is non-trivial in general PDEs, the transposition of spatial dimensions has the potential to produce a similar effect as equivariance. However, equivariant models are theoretically more robust because they directly constrain the loss surface through the restricted architecture [1].

[1] Wang et al., 2022, Data Augmentation vs. Equivariant Networks: A Theory of Generalization on Dynamics Forecasting

评论

I thank the authors for a detailed response. I will increase my score to 4.

评论

Thank you for considering a positive decision. As mentioned in the description, a score of 4 indicates a technically solid paper where the reasons to accept outweigh the reasons to reject. We kindly wonder if you have any extra reasons to reject. If you have any additional questions, we would be glad to address them.

最终决定

The paper addresses the challenge of building foundation models for physical dynamics when the input dimensionality may vary across different physical phenomena—for example, when training on multiple PDE simulations. Classical solutions include designing dedicated encoder-decoders for each case or fixing a maximum dimensionality and padding missing variables with zeros. The paper introduces a novel dimension-agnostic framework, called Axial Neural Networks (XNNs), which can handle tensors of arbitrary rank by treating axes as set elements or graph nodes, and applying message passing to model cross-axis dependencies. The approach is inspired by previous work on equivariant neural networks. Two variants of the method are proposed and formally shown to be equivariant under axis permutations. This framework is instantiated on two PDE foundation model backbones (MPP and CViT) and evaluated on a range of PDE simulation problems.

The reviewers agree on the relevance of the problem for developing foundation models for physics. They find the approach original and non-trivial, with formal proofs supporting the claims and convincing experimental results across diverse settings. They consider that the authors’ responses and additional experiments satisfactorily address most of their questions. I recommend acceptance.