Nonlocal Attention Operator: Materializing Hidden Knowledge Towards Interpretable Physics Discovery
摘要
评审与讨论
This work presents a new neural network for tackling PDEs' forward and backward problems, especially for small physics systems. The forward problem simulates the outcome of a physics system given the inputs, while the backward problem estimates the parameters of a physics system given the inputs and outputs. The key idea of the presented method is to utilize the underlying common rules of multiple physics systems. This work found that using an attention layer to model multiple physics systems can significantly help the model solve the backward problem, also known as the discovery problem. The idea is intuitive: looking at multiple similar rules is helpful for discovering a new rule.
优点
The idea is interesting and clever. The nature of sharing knowledge among multiple systems is somewhat related to in-context learning in LLMs.
缺点
The experiments could be more extensive and solid, especially in supporting the main contribution. Although the experiment on the backward problem (discovery) reports the OOD results, the OOD setting is not far from the in-domain data, which raises concerns about the generalization ability.
The paper's writing is not very friendly to those with a deep learning background. The content around background knowledge and the connections between the attention mechanism and kernel methods is somewhat tangential to the main storyline. The idea is straightforward, so I would prefer a clean and compact narrative. However, this is not a major concern since I only have a deep learning background.
问题
For Section 5.1, consider including a broader range of tasks and different types of kernels. It would also be beneficial to incorporate more OOD testing. Another related question is: Given that multiple training tasks come from the same family, could this be equivalent to a weaker version of learning more data pairs from a single system?
I am unsure whether the reported number of data pairs refers to the total data pairs across different tasks or for each individual task; I suspect it is the latter. If we provide more data pairs from a single task, could it achieve a similar level of performance? Conversely, if we increase the number of tasks and reduce the data pairs for each task, does this improve the generalization ability?
局限性
Agree with the limitation section.
We thank the reviewer for their valuable time and constructive suggestions. We have generated more families of kernels, and performed additional tests on three different training settings (diverse tasks, single task with diverse function pairs, and diverse tasks with a reduced number of samples). Additionally, we have created an OOD test task with a substantially different kernel from the trained ones (i.e., Gaussian-type kernel versus sine, cosine and polynomial-type kernels). All additional test results can be found in the table of the one-page PDF file. We will also modify the manuscript to make the narrative cleaner and more compact. Our other responses:
Broader range of tasks for different types of kernels: Per the reviewer's suggestion, we have added another two types of kernels into the training dataset. The training dataset is now constructed based on 21 kernels of three groups, with samples on each kernel:
sine-type kernels: , .
cosine-type kernels: , .
polynomial-type kernels: (), , where is the degree- Legendre polynomial.
Based on this enriched dataset, beyond the original ''sin only'' setting, three additional settings are considered in Table 1 of the attached PDF. In part II of Table 1 (denoted as ''sin+cos+poly''), we consider a ''diverse task'' setting, where all samples are employed in training. Then, in part III we consider a ''single task'' setting, where only the first sine-type kernel
is considered as the training task, with all samples on this task. Lastly, in part IV we demonstrate a ''fewer samples'' setting, where the training dataset still consists of all tasks but with only samples on each task. For the purpose of testing, besides the in-distribution (ID) and out-of-distribution (OOD) tasks in the original manuscript, we have added an additional (OOD) task with Gaussian-type kernel , which is substantially different from all training tasks.
As shown in Table 1, considering diverse datasets helps in both OOD tests (kernel error reduced from 9.14% to 6.92% in OOD test 1, and from 329.66% to 10.48% in OOD test 2), but not for the ID test (kernel error slightly increased from 4.03% to 5.04%). We also note that since the ''sin only'' setting only sees systems with the same kernel frequency in training, it does not generalize to the OOD test 2. In this test, including more diverse training tasks becomes necessary. Then, as the training tasks become more diverse, the intrinsic dimension of kernel space increases, requiring a larger weight matrices rank size . When comparing the ''fewer samples'' setting with the ''sin only'' setting, the former has better task diversity but fewer samples per task. One can see that the performance on ID test deteriorates but the performance on OOD test2 improves. This observation highlights the balance between the diversity of tasks and the number of training samples per task.
Difference from learning more data pairs from a single system: Although all 7 sine-type kernels are corresponding to the same frequency, they are very different from each other. If we train the model from a single kernel, the model does not generalize at all. Then, the test error on validation task does not improve during training, and the model always predicts an all-zero kernel. As shown in the ''Single task'' setting of attached Table 1, the kernel and operator error are always around 100% for all test tasks.
Reported number of data pairs: When we talk about the data pairs for each task, as in Example 1, the reported number of data pairs refers to the total number of data pairs of each task. Then, we form a training sample of each task by taking pairs from this task. When taking the token size , each task contains samples. When having 7 training tasks corresponding to 7 different sine-type kernels, there are samples in total. Then, when discussing about the total number of data pairs, as in Examples 2-3, we are reporting it across all tasks. In operator learning settings, we note that the permutation of function pairs in each sample should not change the learned kernel, i.e., one should have K[,]=K[,], where is the permutation operator. Hence, we have augmented the training dataset by permuting the function pairs in each task. Taking example 3 for instance, with microstructures (tasks) and function pairs per task, we have randomly permuted the function pairs and taken function pairs for times per task. As such, we have created the total 50000 samples (45000 for training and 5000 for test) in Example 3. We will clarify this in the revised manuscript.
Thanks for the detailed response and new experiment results. I will raise my score.
We thank the reviewer again, for their discussion and for raising the score. We sincerely appreciate the reviewer's valuable time and suggestions, which helped us to improve the quality of this work.
This paper explores using large-language deep-neural network to build a foundation physical model that can solve inverse PDE problem. It proposes Nonlocal Attention Operator model with a kernel map that is dependent on both input and output functions. The intermediate proofs help people understand the underlying math principles behind the attention mechanism, especially the relationship between attention and integration. The paper also does experiments on radial kernel learning, solution operator learning and heterogeneous material learning, all with decent results.
优点
- mathematical proof is comprehensive
- ablation study has compared performance among several variants
- experiments are adequate to show the effectiveness of the model in different domains
缺点
- there are very few comparisons with other papers on experimental results
问题
- is it possible for you publish your source codes on Github/Gitlab so that other people can use your framework for practical applications and researchers can expand on top of your work?
局限性
- since the equations are quite complex, replication of the piece of work might not be consistent among different people, thus, lack of open-source code will limit its use/adoption in practical applications.
We thank the reviewer for their valuable time and constructive suggestions. Our source code, trained models, and datasets will be made publicly available on Github upon acceptance of the paper, so as to guarantee reproducibility of all results. Our other responses:
Limited baselines: We did not compare much with other work because very few existing models are capable of discovering hidden physics or the governing physical parameters directly from data. Beyond the direct application of the attention model as in the discrete NAO setting, the most relevant work to our best knowledge is the adaptive Fourier neural operator that adopted a convolution-based attention mechanism, which we compared to in the ablation study. However, it is unable to learn a good solution across tasks (as indicated by the large errors in the in-distribution and out-of-distribution tests), nor can it discover physics interpretation for the inverse problem.
To further address the reviewer's concern, we have added an autoencoder model as an additional baseline in Example 1, see Table 1 of the one-page PDF. In this setting, an encoder is constructed to map the concatenated data pairs [,] to the hidden kernel K[,] as the latent representation, and then in the decoder the kernel is multiplied with following the last step of Eq.(6): K[,]-> . We note that the concatenated data pairs [,] is of size as the input of the encoder, and the kernel is of size as the output. Hence, even a linear encoder contains around trainable parameters. When the autoencoder model is of a similar size to NAO, its expressivity is very limited.
Dear reviewer eNuS
Could you please take a look at the responses of the authors and let us know your thoughts on them? Are you satisfied with the responses and do you have some updates on your comments? Please also take a look at other reviews and share with us your thoughts on whether the paper is ready for publication. We need your response so that we can make an informed decision on this paper.
AC
This paper introduces a novel approach called the Nonlocal Attention Operator (NAO) for simultaneously solving forward and inverse PDE problems across multiple physical systems. The key innovation is using an attention-based kernel map that learns to infer resolution-invariant kernels from datasets, enabling both physics modeling and mechanism discovery.
Thank you to the authors for their responses. It is great to see a detailed complexity analysis; still, the quadratic scaling is definitely a cause for concern -- hence the reasons for relaxing kernel-based formulations of self-attention to linear scaling versions (e.g., Mamba, informer, etc.). However, I can see how similar relaxations can be applied to NAO. So this is very helpful.
优点
The main strengths I find are as follows:
(i) Bridging forward and inverse PDE modeling: The approach uniquely addresses forward PDE solving (prediction) and inverse PDE solving (discovering hidden mechanisms) in a unified framework. (ii) Interpretability: The discovered kernel provides meaningful physical interpretation, addressing a common criticism of black-box neural network approaches. (iii) Theoretical foundation: The authors provide a theoretical analysis showing how the attention mechanism provides a space of identifiability for kernels, helping to resolve ill-posed inverse problems. The authors also provide, as part of their approach, extensibility to various forms of operators, including those with radial interaction kernels and heterogeneous interactions. (iv) Impact: Novel mechanisms for modeling and learning about physical phenomena with a focus on interpretability is a significant step in the right direction for me. I am unsure that this will directly lead to contributions in the foundation model space (I think it's a bit of a stretch -- see weaknesses for details). However, I can see that this work can be gainfully leveraged to drive insights for foundation model building.
My reading of the analysis provided also yielded the following strengths (specifically regarding the formal rigor of the proposed approach):
The authors present a general form of the operator in Eq. (3), which encompasses many physical systems. This generality is a strength of the approach. The kernel map (Eq. 4-7) uses a multi-layer attention model to learn the mapping from data pairs to kernel estimations. The use of both input (u) and output (f) in the attention mechanism is novel and crucial for addressing the inverse problem. The derivation of a continuum limit of the discrete attention model (Eq. 8) provides insight into the operator's behavior as the number of data points increases. Lemmas 4.1 and 4.2 provide valuable insights into the limit behavior of the attention-parameterized kernel and the space of identifiability for kernels. This analysis helps to explain why the method can effectively solve ill-posed inverse problems.
缺点
I find only one glaring weakness, which is likely to surface during the implementation of this approach for physical systems of appreciable size and complexity: The method involves multiple integrations and matrix operations, which may be computationally expensive for large-scale problems. A discussion of computational efficiency and scalability would be beneficial. On a related note, It is also not clear to me if this method is data-efficient.
A minor weakness I see is that I do not find that this method is easily integrated with current approaches to physics-informed machine learning, e.g., through simple extensions to pre-existing libraries. A brief discussion on this very practical aspect might strengthen the paper.
问题
-
How does the computational complexity of NAO scale with the number of data points and dimensionality of the problem? Provide a detailed analysis of computational costs and compare them with existing methods for both forward and inverse PDE solving.
-
What are the minimum data requirements for NAO to perform effectively, particularly for the inverse problem? It may be beneficial to provide an analysis of performance vs. data quantity to guide potential users in determining data collection needs.
-
How can NAO be integrated with existing physics-based models or simulations to enhance their capabilities? Maybe discuss potential hybrid approaches combining NAO with existing or traditional methods.
局限性
I read the limitation sections, and the authors provide, I think, an inkling of the answers to the questions that I have raised. I hope to see a more comprehensive discussion on the same, with the necessary addendums made to the limitations section.
We thank the reviewer for the valuable comments and suggestions. Our response:
Number of trainable parameters and computational complexity: Denoting as the size of the spatial mesh, the number of data pairs in each training model, the column width of the query/key weight matrices and , the number of layers, and the number of training models, the number of trainable parameters in a discrete NAO is of size . For continuous-kernel NAO, taking a three-layer MLP as a dense net of size for the trainable kernels and for example, its the corresponding number of trainable parameters is . Thus, the total number of trainable parameters for a continuous-kernel NAO is .
The computational complexity of NAO is quadratic in the length of the input and linear in the data size, with flops in each epoch in the optimization. It is computed as follows for each layer and the data of each training model: the attention function takes flops, and the kernel map takes flops; thus, the total is flops. In inverse PDE problems, we generally have , and hence the complexity of NAO is per layer per sample.
Compared with other methods, it is important to note that NAO solves both forward and ill-posed inverse problems using multi-model data. Thus, we don't compare it with methods that solve the problems for a single model data, for instance, the regularized estimator in Appendix B. Methods solving similar problems are the original attention model [1], convolution neural network (CNN), and graph neural network (GNN). As discussed in [1], these models have a similar computational complexity, if not any higher. In particular, the complexity of the original attention model is , and the complexity of CNN is with being the kernel size, and a full GNN is of complexity .
[1] Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Data efficiency: The minimal data for NAO to perform well requires a balance between multiple factors, including the intrinsic dimension (smoothness) of the kernel, the token size for each sample, the rank of the weight matrices, the number of samples in each task, and the diversity of training tasks. Thus, it is difficult to have a theoretical guarantee. In the original manuscript, with all other factors fixed, we have tested the dependence on the token dimension and the rank of the weight matrices in Table 1. To further elaborate the importance of sample number and task diversity, in Table 1 of the one-page PDF we have further perform tests on three settings: 1) diverse tasks; 2) single task with diverse function pairs; and 3) diverse tasks with a reduced number of samples. In particular, in the diverse task setting, the training dataset is constructed based on 21 kernels of three groups, with samples on each kernel:
sine-type kernels: , .
cosine-type kernels: , .
polynomial-type kernels: ( ), , where is the degree- Legendre polynomial.
In the ''diverse task'' setting, all samples are employed in training. In the ''single task'' setting, only the first sine-type kernel:
is considered as the training task, with all samples on this task. Then, in the ''fewer samples'' setting, the training dataset still consists of all tasks but with only samples on each task. For the purpose of testing, besides the in-distribution (ID) and out-of-distribution (OOD) tasks in the original manuscript, we have added an additional (OOD) task with Gaussian-type kernel , which is substantially different from all training tasks.
As shown in the results in Table 1 of the one-page PDF, for the ID test task, a small number of training tasks are sufficient, and the model performs well in all except the ''single task'' setting. When the test task has a larger discrepancy from the training tasks, as in the OOD test2 setting, it becomes necessary to increase the training task diversity, and the required rank size should also increase to obtain better model expressivity. When comparing the ''fewer samples'' setting with the ''sin only'' setting, the former has better task diversity but fewer samples per task. One can see that the performance on ID test deteriorates but the performance on OOD test2 improves. This observation highlights the balance between the diversity of tasks and the number of training samples.
Combination with physics-based modeling approaches: When the physics structure is known, NAO can directly encodes such information in its architecture, such as the kernel structure in the nonlocal diffusion model in Example 1. By doing so, we anticipate to explore the kernel in a lower-dimensional manifold, which reduces the requirement on the size of training data.
On the other hand, NAO could also help physics-based modeling approaches by providing a surrogate foundation model when there is a large class of physical models to be estimated and simulated. In this setting, NAO can be first trained using data generated by these models, and then used as an efficient surrogate model. It will be useful for sampling and uncertainty quantification tasks.
We thank the reviewer again for reading our rebuttal and for the valuable further suggestions. We sincerely appreciate the reviewer's helpful discussions, which helped us to improve the quality of this work.
Dear reviewer J2Mv
Could you please take a look at the responses of the authors and let us know your thoughts on them? Are you satisfied with the responses and do you have some updates on your comments? Please also take a look at other reviews and share with us your thoughts on whether the paper is ready for publication. We need your response so that we can make an informed decision on this paper.
AC
This paper proposes a novel neural operator architecture called Nonlocal Attention Operator (NAO) for learning both forward and inverse problems in physical systems from data. The key contributions are:
- A new attention-based neural operator that can simultaneously learn the forward mapping (physics modeling) and inverse mapping (physics discovery) for PDE systems.
- Theoretical analysis showing how the attention mechanism provides a kernel map that explores the space of identifiability for kernels, helping to resolve ill-posedness in inverse problems.
- Empirical demonstration of NAO's advantages over baseline methods, especially for generalization to unseen systems and data-efficient learning in ill-posed inverse problems.
- Interpretability of the learned kernels, allowing insight into the discovered physical mechanisms.
The authors evaluate NAO on several PDE learning tasks, including radial kernel learning, solution operator learning for Darcy flow, and heterogeneous nonlinear material modeling.
优点
- The paper presents a novel neural operator architecture that leverages attention in a unique way for PDE learning. The idea of using attention to build a kernel map that can generalize across multiple physical systems is innovative.
- The theoretical analysis in Section 4 provides rigorous justification for how the attention mechanism helps address ill-posedness. The empirical evaluations are comprehensive, testing multiple aspects like generalization, data efficiency, and interpretability.
- The paper is generally well-written and clearly structured. The motivation, methodology, and results are presented logically.
- The ability to simultaneously learn forward and inverse mappings for PDE systems, while providing interpretable kernels, could have broad impact in scientific machine learning and physics-informed AI. The zero-shot generalization capability is particularly notable.
缺点
- The experimental evaluation is limited to relatively simple PDE systems. It's unclear how well NAO would scale to more complex, high-dimensional PDEs encountered in real-world applications.
- The interpretability claims could be further substantiated. While learned kernels are visualized, there's limited discussion on how these provide physical insights beyond matching ground truth.
问题
Can you provide more insight into how the learned kernels can be interpreted to gain physical understanding? Perhaps a case study showing how NAO discovers meaningful structure in a physical system would be illuminating.
局限性
Yes.
We thank the reviewer for the insightful comments and questions. Our response:
Interpretability for physical understanding: The physical meaning of the learned kernel is application-dependent: in the context of material modeling, the learned kernel characterizes the interaction between material points; in turbulence modeling problems, the interaction range of the kernel suggests the characteristic length of the eddies; in fracture mechanics, the vanishing of kernel values between two neighboring regions may indicate a crack between these two regions; just to name a few. Generally speaking, identifying the kernel is instrumental in identifying influential factors, characterizing spatial dependence, and analyzing the forward operator.
(i) The learned kernels themselves are fundamental for understanding the physical model. They indicate the range and strength of interactions between components. For example, a learned interaction kernel can reveal interactions between particles or agents. In the Darcy's problem setting, when taking a summation of the kernel strength on each row, one can discover the interaction strength of each material point with its neighbors. Then, since such a strength is related to the permeability field , the underlying microstructure can be recovered. In Figure 2 of the one-page PDF, we demonstrate the discovered microstructure on a test specimen. We note that the discovered microstructure was smoothed-out due to the continuous setting of our learned kernel (as shown in the middle plot), a thresholding step was performed to discover the two-phase microstructure. The discovered microstructure (right plot) matches well with the hidden ground-truth microstructure (left plot), except on the regions near the domain boundary. This is because of the Dirichlet-type boundary condition setting ( on ) in all samples. As a result, the measurement pairs contain no information near the domain boundary , making it impossible to identify the kernel on boundaries.
(ii) The learned kernel characterizes the output's spatial dependence on the input. The learned kernel in the nonlocal diffusion model of our Example 1 demonstrates this spatial dependence. Such a range of spatial dependence characterizes the characteristic lengths, which is especially of interests in multiscale problems (see, e.g., [1]).
(iii) The learned kernels facilitate the analysis of the forward operator, including its spectral properties, stiffness for computation, and error bounds for output prediction under perturbations in the input.
We will add these discussions and the additional figure to the revised manuscript.
[1] You, Huaiqian, et al. "A data-driven peridynamic continuum model for upscaling molecular dynamics." Computer Methods in Applied Mechanics and Engineering 389 (2022): 114400.
Dear reviewer eXH4
Could you please take a look at the responses of the authors and let us know your thoughts on them? Are you satisfied with the responses and do you have some updates on your comments? Please also take a look at other reviews and share with us your thoughts on whether the paper is ready for publication. We need your response so that we can make an informed decision on this paper.
AC
We thank the reviewers for the constructive comments and for recognizing the novelty and broad impact of our work, the soundness and solid theoretical foundation of our formulation, the effectiveness and meaningful physical interpretability of our model, as well as the comprehensive experimental studies. In our response, we have answered all the questions raised by the reviewers and clarified potential misunderstandings. Moreover, we have performed additional tests on: an additional baseline (autoencoder), three different training settings (diverse tasks, single task with diverse function pairs, and diverse tasks with a reduced number of samples), and an OOD test task with a substantially different kernel from the trained ones (i.e., Gaussian-type kernel versus sine, cosine and polynomial-type kernels). All additional test results can be found in the table of the one-page PDF file. Lastly, to provide further physical interpretability of the discovered kernel, we demonstrate NAO's capability in recovering the underlying microstructure in the Darcy's problem (Example 2). A comparison of the true microstructure and the discovered one can be found in Figure 2 of the one-page PDF file.
We sincerely appreciate the time and efforts of the AC and all reviewers. Your comments have helped us improve the quality of our manuscript, and we will incorporate them into the revised manuscript. We will also release the source codes, the trained models, and the datasets upon paper acceptance to guarantee reproducibility of all experiments. We are more than happy to answer any follow-up questions during the discussion period.
I have read the rebuttal. Thanks for your work!
We thank the reviewer again for reading our rebuttal and for the kind response. We sincerely appreciate the reviewer's valuable time and suggestions, which helped us to improve the quality of this work.
This paper presents an interesting study that utilizes a novel attention-based neural architecture for modeling complex physical systems. The authors introduce an attention mechanism specifically tailored to address such physical problems involving PED systems. The paper includes both theoretical discussions and strong empirical results. The reviewers unanimously agree that this is a solid contribution with the potential to bring new insights to the field.