5.5

/10

Poster4 位审稿人

最低5最高6标准差0.5

3.0

置信度

正确性2.8

贡献度2.5

表达2.5

NeurIPS 2024

How does PDE order affect the convergence of PINNs?

Chang hoon Song,Yesom Park,Myungjoo Kang

OpenReview PDF

提交: 2024-05-14更新: 2025-01-05

摘要

关键词

Physics-Informed Neural NetworksVariable SplittingHigher-order PDEsGradient FlowConvergence

评审与讨论

审稿意见

评分: 5置信度: 22024-06-22

Building upon the work of Gao et al. [22], the authors extend the analysis of the gradient flow (GF) of PINNs (two-layer MLP is assumed) to general $k^\mathrm{th}$ order partial differential equations (PDEs) and the $p^\mathrm{th}$ power of Rectified Linear Unit (ReLU) activation function with general $p$ .
The authors achieve tighter bounds than those obtained by Gao et al [22].
The width of the network necessary for the convergence of GF increases exponentially with the power $p$ of $\mathrm{ReLU}^p$ .
The optimal power $p$ is determined by the order $k$ of the governing PDE; $p = k +1$ .
The GF convergence of PINNs deteriorates with increasing dimensions.
To address these challenges, we mathematically demonstrate the efficacy of a variable splitting strategy, a conventional order-reduction approach.
An order reduction through variable splitting is proposed based on the theoretical findings above.

Overall, I enjoyed reading the manuscript; however, it would benefit from some revisions.

优点

The topic is relevant and very interesting. The theoretical relationship between PDE's orders and PINN's convergence has been elusive but is a practically important. A challenge comes from the fact that PINN's objective contains the derivatives of PINN's outputs w.r.t. its inputs (Gao et al. [22] is one of pioneering works in this sense).
Theoretical contributions include:
- Building upon the work of Gao et al., the authors extend the analysis of the GF of PINNs, composed of two-layer MLPs, to general $k^\mathrm{th}$ order PDEs and the $p^\mathrm{th}$ power of ReLU activation function with general $p$ . The authors achieve tighter bounds than those obtained by Gao et al.
- The width of the network necessary for the convergence of GF increases exponentially with the power $p$ of $\mathrm{ReLU}^p$ .
- The optimal power $p$ is determined by the order $k$ of the governing PDE; $p = k +1$ .
- The GF convergence of PINNs deteriorates with increasing dimensions.
- To address these challenges, we mathematically demonstrate the efficacy of a variable splitting strategy, a conventional order-reduction approach.
This paper is easy to follow.
The code is submitted for reproduction, which is important because studies on PINNs are often hard to reproduce due to their strong dependence on random seeds. However, the submitted code does not work, unfortunately (see Weaknesses below).

缺点

Major comments

The most critical concern is that experiments were conducted using Adam, not GD; thus, Section 5 does not validate the theoretical results, which are based on GF.
- While training PINNs with GD is challenging (as noted in line 320), I strongly recommend including experimental results using GD, which would significantly enhance the paper's quality. Although some preceding papers have used Adam or other irrelevant optimizers to validate their theoretical results built upon GF, such experiments are, in my opinion, obviously ill-advised.
- A continuous-time limit of Adam is given in [Malladi et al., 2022 (https://arxiv.org/abs/2205.10287)], which is, of course, different from GF.
While GF is easy to handle in theory, GF's underlying assumptions are too restrictive in practice. GF is be a good approximation only when (1) PINNs are trained without random collocation points or are trained on random collocation points without resampling, (2) full-batch training is assumed, and (3) sufficiently small learning rates are used. There three conditions are too restrictive in view of current PINN experiments in the literature. Convergence analyses of DNNs are often criticized for their gaps between theory and practice. I greatly appreciate if these concerns are resolved.
- Condition (1) is mentioned in line 358. I would like to ask a question about this point, together with Condition (2), in Questions below.
- Condition (3) is mentioned in line 361. It is shown in [Miyagawa, 2022 (https://openreview.net/forum?id=qq84D17BPu)] that theoretical predictions of gradient flow deviate from experimental results as $t \rightarrow \infty$ even when learning rates are impractically small. This work would be very helpful and significantly extend the present work, adapting it to practical experimental settings. Please see Questions below.

Code

Overall, I would recommend that the authors follow the official code submission guidelines and templates (https://github.com/paperswithcode/releasing-research-code).

The 'utils' module is missing in the submitted codes, and they do not work.
Adding requirements.txt would be recommended.

Minor comment

$\mathrm{ReLU}^p$ is called the rectified power unit (RePU) in [Bokati et al. (https://scholarworks.utep.edu/cgi/viewcontent.cgi?article=2717&context=cs_techrep)]. Please see also the references therein. I would recommend citing them in the submitted paper.

Typos

Line 154: Impact of PDE order on convergence of PINNs -> Impact of PDE Order on Convergence of PINNs
Line 203: $\delta << 1$ -> $\delta \ll 1$
Line 372: keller-segel -> Keller-Segel
Line 373: da vinci-euler-bernoulli -> da Vinch-Euler-Bernoulli
Line 414: pinns -> PINNs
Line 462: i -> I
Line 507: Dgm -> DGM

问题

Can the theoretical analysis given in this paper generalize to other activation functions, e.g., leaky $\mathrm{ReLU}^p$ , softplus, and GeLU, sine, etc.?
Can the theoretical analysis given in this paper generalize beyond two-layer networks?
How can we extend this work to non-gradient-flow-based optimization problems, which are more prevalent and practical, e.g., random collocation points with resampling (i.e., SGD-like training)?
To what extent theoretical predictions deviate from its corresponding experimental results when SGD-like training is used?
As mentioned above, it is shown that theoretical predictions of gradient flow deviate from experimental results as $t \rightarrow \infty$ even when learning rates are impractically small as shown in [Miyagawa, 2022 (https://openreview.net/forum?id=qq84D17BPu)]. How can we alleviate this issue, or is this possibly not a problem in the present paper?
(Footnote 3 on page 5) Is there any other evidence or related work that smoother activation functions require larger networks for convergence? The opposite, i.e., "smoother activation functions help convergence and reduce the required size of width," would also sound valid in the context of general DNN training. I would greatly appreciate if the authors discuss this point further.
- Inequality (10) and (16) seem to be sufficient conditions for convergence and are not necessary (please correct me if I missed something). Therefore, the intuitive explanation in footnote 3 on page 5 would be valid only in limited cases.

局限性

Yes (described in Section 6).

作者回复

2024-08-07

In order to provide a concise response to theoretical questions, we structured our proof:

(S1) Determine the condition for the positive definiteness(PD) of the initial Gram matrix.
(S2) Establish an upper bound for the initial loss in terms of a polynomial of the initial weights.
(S3) Identify a radius of an open ball within which the Gram matrix preserves PD.
(S4) Compute the GF and demonstrate that the flow converges within the ball.

Q1. The most critical concern is ...

I strongly recommend ...
R: We agree with the concern that the experiments in Section 5 do not fully validate our theorems as they use Adam.
In response to the feedback, we conducted experiments using the GD. As achieving convergence with GD for higher-order PDEs is challenging, we compare for the heat equation in Eq. (18). Furthermore, in accordance with the request of Reviewer M6m4, experiments were conducted on the convection-diffusion equation. We train models via GD with a learning rate(lr) of 1e-1. PINNs on the convection-diffusion equation employ a reduced lr of 1e-2 as it diverges with a lr of 1e-1. Figures A3 and A4 of the attachment present fast and remarkably stable convergence of VS-PINNs with smaller variance.
Moreover, we have designed experiments with GD that can clearly support our theory. A detailed description and discussion can be found in the common response.
We believe these provide further validation of our findings. The results of the experiments in Section 5 may suggest that results to those with GD can be achieved with Adam although we have not proven.
A limit of Adam ...
R: We append experiments using GD.
extend to non-GF-based ...
R: We coud apply our theory to SGD-like optimization. In the context of stochastic approximation theory (SAT), errors on a gradient originate from the randomness is interpreted as noise when estimating the GF. Following the SAT, the noisy GF converges to a fixed point of the clean GF. Consequently, our theory could be extended.
However, we acknowledge that S1 is not applicable when collocation points are resampled. In S1, we use the fact that gradients at each collocation point are linearly independent if width is sufficiently large, which does not hold as there are infinitely many collocation points in distribution. Assuming S1, we expect that the S2-S4 are applicable, given that expectation forms and uniformly bounded losses are employed.
As mentioned above, ...
R: We acknowledge that GF and GD exhibit different dynamics. However, we believe our theory can be adapted to GD, given backward analysis, including [Miyagawa, 2022].
Theorem 5.1 and its corollaries in [Miyagawa] show that GF and GD dynamics differ in the limit $t\rightarrow\infty$ , assuming a scale-invariant layer is present. Our theory, however, is based on MLPs without normalization layers, and thus does not meet the condition in the theorem.
Instead, we could apply Theorem 3.3 from [Miyagawa], which states that the GD dynamics follows GF with a counter term. Since the term is also a polynomial, we anticipate that S2-S4 can be adjusted. In S1, we expect our induction on a maximum degree can be adapted because the leading term occurs only in the counter term. Therefore, we expect to prove the convergence of GD based on the backward analysis.

Q2. I would recommend that ...
R: We apologize and thank you for pointing this out. We will rectify it by including the 'utils' module and providing a requirements.txt file to specify the necessary dependencies. We regret any inconvenience this may have caused.
Q3. ReLU^p is called the RePU
R: We appreciate your recommendation to cite the references related to the RePU. We will include these citations.
Q4. Generalize to other activation?
R: S1 effectively exploits the non-smooth nature of the RePU. This approach can be expanded to include other non-smooth activation functions like Leaky RePU. Additionally, it seems feasible to bound the loss function using a polynomial of the parameters. Therefore, extending the S2-S4 to this type of activation function seems plausible.
However, our approach in S1 is not applicable for smooth activations and it is difficult to establish the PD of the Gram matrix. This leads some studies to assume the PD of the Gram matrix and analyze the convergence of the optimization flow. Assuming S1, our results would extend to activation functions whose derivatives are polynomially bounded.
Q5. Generalize beyond two-layer?
R: S2-S4 of our theory would remain feasible as the composition of polynomials is a polynomial. However, the non-smooth point can be entangled, making the applicability of S1 less straightforward.
Q6. Is there any other evidence that ...
R: There are theoretical papers that analyze the impact of activation in terms of optimization convergence. For instance, [Panigrahi, 2020] computes the eigenvalue of the Gram matrix for both non-smooth and smooth activations. The authors provided evidence indicating that when data span a low dimensional space, the smallest eigenvalue of the Gram matrix can be small leading to slow training. It is also demonstrated that the smallest eigenvalue of the Gram matrix associated with non-smooth activations is larger than that of smooth activations under minimal assumptions on data. Consequently, the findings imply that training will be more rapid for non-smooth activation.
Q7. Seem to be sufficient condition
R: Although they are sufficient conditions, the leading terms originate from the PD of the Gram matrices, which is a prerequisite for convergence. Given that the PD of the Gram matrices inherently depends on the PDE and p, we contend that the insights regarding the detrimental impact of large p could be derived from the inequalities.
References Panigrahi et al., Effect of Activation Functions on the Training of Overparametrized Neural Nets, ICLR, 2020.

评论- Reply

2024-08-11

Thank you for your time and effort. The authors’ response addressed my concerns, particularly regarding the validity of the experiments, and answered my questions clearly. I encourage the authors to incorporate all the discussion from their response into the manuscript. I have increased my score from 3 to 5.

审稿意见

评分: 6置信度: 32024-07-04

Summary:

The paper discusses the convergence of the gradient flow of a PINN loss function for arbitrary-order linear PDEs. The analyzed neural networks are shallow MLPs with a power-of-ReLU activation function. The results established are of the NTK convergence type in the overparameterized regime and take the following form: Assume the width mm of the network is larger than a critical number $C$ , i.e.,

m > C

then, with high probability with respect to the initialization, the PINN loss will converge to zero when the neural network parameters follow the gradient flow of the typical PINN loss function.

The authors analyze the dependence of the constant $C$ on the PDE order, the dimension of the computational domain, and the power of the activation function. It is proven that a high PDE order, high dimension, and a large power of the employed ReLU activation deteriorate the convergence, i.e., require a larger constant $C$ to guarantee the convergence of the gradient flow. This leads the authors to propose first-order reformulations which, as proven in the manuscript and illustrated with numerical examples, mitigate the aforementioned problems. The main novelty of the preprint lies in the quantification of the constant $C$ .

General Impression: The paper is carefully written. Although the proofs are long and technical and deferred to the Appendix, the main part is an enjoyable and understandable read. However, I have a major issue with how the findings are communicated; see below in the paragraph on the weaknesses of this reply.

优点

Well written paper, understandable to read for a broad audience albeit technical proofs.
Addresses a challenging problem (training of PINNs) and tries to provide theoretical guidance on how to improve the training process.

缺点

There is no sharpness of the lower bound $m > C$ . Thus, the conclusions the authors draw from the lower bound could potentially be wrong. This is briefly touched upon in the conclusion section of the paper, but I strongly recommend changing the presentation; see the discussion below.
Achieving zero training loss is one step in a comprehensive mathematical analysis of PINNs. Other aspects, such as the discussion of the approximation error, the quadrature error, and the coercivity properties of the PINN loss function, complete the picture. I believe the authors can improve their preprint by contextualizing it better and providing pointers to the literature. Below, I have attached a list of references and briefly discussed their relevance to the overall theory.

Expansion on 1.

Drawing conclusions from a lower bound of the form $m>C$ is dangerous as for non-sharp estimates (like the present one) one can not be sure whether dependencies of $C$ on the data (PDE order, dimensionality etc) are artefacts of the proof or are truly required to obtain a lower bound. An exception would be if the lower bound is sufficiently tight, such that it is of practical value. This is not the case here: Setting $d=1$ , $k=1$ , $p=2$ and neglecting the log term (setting it to unity) yields

m \geq 2^{14} \cdot 2^{11} \cdot 2^{12} = 2^{37} \approx. 10^{11}.

Therefore, I strongly advise against presenting the results in a way that suggests a connection between the theoretical results (how $C$ depends on $p, k$ and $d$ ) and the observed optimization struggle of PINNs in the literature. Portraying results in this way leads to folklore in the community which is not backed by rigorous mathematics. At the very least, I advise the authors to clarify this early on in the introduction and before drawing conclusions from Theorem 3.2. I acknowledge that the authors discuss the limitations in the Conclusions section, but this is not where it belongs.

Expansion on 2.

The presented results should be better embedded in the existing literature. As a matter of fact, this will strengthen the authors points, especially with respect to the PDE reformulation in first-order systems. I recommend the following articles:

[1] The article discusses error analysis/coercivity estimates for PINNs with sharp coercivity estimates. See for instance equation (3.10) in [1]. In this equation $L$ is the loss and $\eta$ quantifies the mismatch between discrete loss (analyzed in the authors preprint) and population loss, which behaves according to Monte-Carlo integration and can be analyzed by Rademacher complexity arguments, as for example illustrated in [2]. Furthermore, the validity of equation 3.10 in [1] relies on coercivity properties of the PDE at hand and is illustrated for a number of linear PDEs in [1]. With the aforementioned articles, the errors analysis of PINNs is more comprehensive and I strongly believe readers should be pointed in this direction when reading the authors preprint.
The benefit of first-order reformulations of PINN type losses is well-known in the literature, also for different reasons than the ones the authors point out. For example [3] reports better results for jumping material coefficients and shows that strong formulations will not work at all in this case. Another aspect is treated in [1]. It is shown that first order reformulations lead to improved convergence rates, compare the error estimates for equation to the ones of Poisson equation. To achieve $H^{1/2}$ convergence for Poisson, one needs approximation in $H^2$ , for $H^{1/2}$ convergence for Darcy (which can be viewed as a first order reformulation of Poisson) one needs approximation in $H^1$ for the pressure and $H(\operatorname{div})$ for the velocity, which leads to improved rates.

Overall, I think the paper merits publication if the weaknesses are addressed and I am willing to raise my score in this case.

References:

[1] https://arxiv.org/abs/2311.00529

[2] https://arxiv.org/pdf/2308.16429

[3] https://www.sciencedirect.com/science/article/abs/pii/S0021999120304812

问题

See weaknesses discussion.

局限性

See weaknesses discussion.

作者回复

2024-08-07

Q1. There is no sharpness of the lower bound $m>C$ . Thus, ...

R: We acknowledge the reviewer’s concern regarding the sharpness of the bound we presented in our paper. We concede that we have not demonstrated the sharpness of the bound in our theorems. However, it is important to note that the leading term of the bound we derived is based on conditions necessary for the Gram matrix to be positive definite, which is a crucial property of the Gram matrix for ensuring the convergence of the gradient flow (GF) to a global optimizer. Since the Gram matrix is defined by the PDE loss and the network structure, factors such as the PDE order of the power of the ReLU activation inherently affect the positive definiteness of the Gram matrix.

Therefore, even though the bound we provide may not be sharp, we believe it can still offer valuable insights into how reducing the order and power improves convergence. It underscores how variations in PDE order and ReLU power affect convergence, thus providing the first theoretical demonstration of the relationship between the PDE order and the convergence of PINN, which has been empirically observed frequently. We believe this is an important step in understanding the relationship between PDE order, ReLU power, and the convergence of PINNs.

However, there is a limitation in that we have not proven the sharpness of the bound, and as you point out, we agree that this should be clearly stated in the manuscript. We will revise the manuscript to acknowledge this limitation and adjust our argument accordingly.

Q2. Achieving zero training loss is one step in ...

-Q2-1. [1] The article discusses error analysis/coercivity estimates for PINNs..

R: We would like to express our gratitude for your comprehensive and insightful feedback on our manuscript. We are particularly appreciative of the list of references and the valuable insights you have provided, which help to contextualize our work within the broader framework of mathematical analysis for PINNs.

In order to successfully employ the PINN approach for solving PDEs, it is essential to ensure that the following four conditions are met:

(C1) The network is capable of approximating the solution to the PDE.
(C2) The minimizer of the PINN population loss is the solution to the PDE.
(C3) The minimizer of the empirical loss approximates the minimizer of the population loss.
(C4) The minimizer of the empirical loss is obtainable. The universal approximation theory of the network addresses C1, while the existence of exact solution to the PDE on compact domain underpin C2. Consequently, theoretical analyses of PINNs are primarily centered on the generalization error analysis of C3 and the optimization error analysis of C4.

The papers you suggested focus on C3, which is concerned with examining the conditions for the convergence of the generalization error. In contrast, the present paper targets C4, namely the demonstration of the conditions for convergence of empirical loss. Besides, our result is based on the impact of the order of the governing PDE, whereas the suggested paper employed the coercivity of the PDE.

Despite this, there is a paucity of theoretical research in this area. Therefore, it is imperative to analyze C3 and C4 based on the characteristics of the given PDE.

As you noted, [1] examines generalization error by capitalizing on the coercivity of the PDE operator. On the other hand, Our work delves into the optimization error analysis in conjunction with the order of the PDE. While theoretical studies on PINN optimization exist, few have investigated the impact of the inherent nature of the PDE being solved. A substantial body of empirical evidence indicates that training PINNs becomes increasingly challenging as the PDE order increases, underscoring the significance of our research as a foundational step in this area.

Nevertheless, we concur with your point that a considerable number of steps remain in order to achieve a comprehensive understanding of the fundamental principles underlying PINNs. We believe that extending ideas in the vein of [1] could facilitate an investigation into the influence of the differential order of the PDE operator on the generalization error, thereby advancing our understanding.

We would like to reiterate our gratitude for your invaluable feedback and for sharing these references. We are eager to incorporate your suggestions to enhance our manuscript.

-Q2-2. The benefit of first-order reformulations of PINN type losses is ...

R: As the reviewer mentioned, both [1] and [3] examine the advantages of splitting PDEs into first-order formulations. In contrast to Reference [1], which examines the benefits from the perspective of S3, our paper focuses on how variable splitting enhances optimization in S4 by reducing the differential order included in the loss function. Extending our theoretical framework to analyze the impact of the variable splitting strategy on the generalization error of PINNs, as suggested by [1], represents an intriguing and significant research direction. We believe that incorporating discussions on these references would enrich our paper. Therefore, we plan to include them and the additional discussions in the revised version of our manuscript. We are confident that these enhancements will contribute to the depth and breadth of our work.

Thank you again for your invaluable feedback and suggestions. We appreciate your guidance in improving our manuscript.

2024-08-10

Thank you. My concerns are addressed and I will raise my score.

审稿意见

评分: 6置信度: 42024-07-08

This paper presents a theoretical analysis of the relation between PDE order and the convergence of PINNs. A tighter bound is obtained than the previous work. Inspired by the importance of reducing PDE order, the authors propose the VS-PINN, which employs the variate splitting strategy. Both theoretical analysis and empirical results are included to prove the effectiveness of the proposed method.

优点

Overall, I think this paper is theoretically solid and interesting.
The idea of using variable splitting is reasonable.

缺点

Practical efficiencies are expected for comparison, such as GPU memory, running time and model parameters. Although VS-PINN does not require the calculation of high-order derivatives, the newly added regularization term in Eq. 13 will also bring extra computation costs.
All the conclusions in Section 3 can be directly obtained from the previous paper [22]. I do not think the tighter bound brings better theoretical insights in terms of understanding “How does PDE order affect the convergence of PINNs?”
How about model performance in typical PDEs, such as Convection or Reaction?

问题

I suggest the authors provide a clear description of the proof for Theorem 3.2 and show how they obtain a tighter bound.
I think the equation in line 247 has a typo, which is $l\in[0, L+1]$ .
How to choose $L$ in practice?

局限性

I appreciate that they have discussed the limitations. There are some more powerful and advanced backbones, such as PINNsFormer [1]. More experiments on them will further demonstrate the effectiveness of the proposed method although they are hard to analyze in a theoretical aspect.

[1] PINNsFormer: A Transformer-Based Framework For Physics-Informed Neural Networks, ICLR 2024

作者回复

2024-08-07

Q1. Practical efficiencies are expected ...

R: Table A1 in the attachment shows GPU memory, running time (the mean of the 50 epochs), and the number of model parameters that correspond to experiments presented in our paper. Becasue VS-PINNs need as many networks as auxiliary variables, finer VS-PINN requires more parameters to be trained. However, its reduction of the differentiation order in the loss significantly reduces memory consumption, despite the additional loss terms. Regarding running time, splitting does not increase the overall training cost.

Q2. All the conclusions in Section 3 can be ...

R: We want to clarify that the conclusions of Section 3 of our paper are not directly derived from the previous work [22], as the bound in [22] is not dominated by the PDE and the power of ReLU $^p$ . On the other hand, our paper attains tighter bounds than [22]. Moreover, the dominating term of the proposed bounds is determined by PDE order and activation power, allowing us to observe the impact on GF convergence. To provide a more comprehensive understanding, we will delineate the leading terms of the bound of [22] and ours in the following paragraphs. We hope that the reviewer will carefully consider why our tighter bound is insightful.

(Leading term in [22]) In [22], Theorem 3.8 states that if the width $m$ is $\tilde{\Omega}\left(\delta^{-3}\right)$ then GF converges to a global optimizer of PINNs for second-order linear PDEs. The leading term of the bound is $\delta^{-3}$ , which stems from the product of triple $\delta^{-1}$ . One $\delta^{-1}$ came from the Markov inequality to bound initial loss at Section B.4. The remaining two $\delta^{-1}$ are required for the bound of $R_w$ at Eq. (74) and $R_a$ at Eq. (76), which are also derived from the Markov inequality immediately following Eq. (73). Therefore, the leading term $\delta^{-3}$ originated from the Markov inequality and is independent of the PDE order or the activation power.
(Leading term in ours) In contrast to using the Markov inequality, we obtain the bounds in a deterministic way; refer to Proposition C.6 and Lemma C.5 for the initial loss and $R_w$ , respectively. This yields a more precise bound for our Theorem 3.2, whose dominant term is contingent upon the power of the activation function and, thereby on the PDE order. Consequently, we obtain a considerably tighter bound that depends on the PDE order, whereas the previous bound from [22] was not affected by the order. From the relationship between the PDE order and the bound for convergence, we could derive a theoretical insight that reducing the PDE order enhances the convergence behavior of PINNs.

Q3. How about performance in typical PDEs?

R: In response to the reviewer's request, we conducted experiments on a convection-type equation. Since observing the effect of VS on first-order PDEs is unable, we convection-diffusion equation $u_t+u_x-u_{xx}/4=0$ , whose exact solution is $u\left(t,x\right) = e^{-\frac{1}{4}t}\sin\left(x-t\right)$ . We trained PINNs with $p=3$ and VS-PINNs with $p=2$ , using same settings with the heat Eq. (18) in the manuscript. Figures A2 and A3 in the attachment show that VS-PINNs reach lower train loss and achieve more stable convergence for both GD and Adam.

Q4. The equation in line 247 has a typo..

R: The symbol actually denotes an integer set defined on line 531 of the Appendix, which did not clearly specify in the main manuscript. We apologize for any confusion this may have caused. We will make sure that the necessary definition is included in the revised version of the paper.

Q5. How to choose $L$ in practice?

R: We agree that discussing the optimal choice of $L$ is crucial for effectively utilizing VS-PINNs. However, identifying the optimal $L$ is challenging because training deep learning models involves a multitude of complex considerations.

Our paper is focused specifically on one aspect of deep learning training: convergence of GF. According to our theory, splitting a given PDE into a system of first-order PDEs, such that $| \xi | =1$ , helps GF most likely to converge. This finest splitting results in a loss that includes only first-order derivatives, allowing the use of $p=2$ , which in turn improves the convergence of GF the most.

Experiments also show that breaking down a PDE into a system of lower-order PDEs tends to be most feasible for convergence. Moreover, Table A1 which was referenced in the previous response indicates that VS is an effective approach for optimizing GPU memory utilization. However, while finer splitting enhances convergence, it also increases the number of functions that need to be parameterized by networks. Therefore, in practice, the choice of L should be balanced based on the governing PDE and memory/computational constraints.

Furthermore, it is critical to consider other factors, so we acknowledge that this is an area requiring further exploration for a robust analysis of the optimal $L$ .

Q6. There are some more powerful and advanced backbones, ...?

R: We conducted experiments to assess the efficacy of our approach within the suggested PINNsFormer architectural framework. We implemented our VS strategy, conducting experiments based on the wave equation in Section 4 of the original paper of PINNsFormer [1]. All configurations specified in [1] were adhered to, including the use of the L-BFGS optimizer. To further examine the behavior of models, we also experimented with alternative optimizers, namely Adam and SGD, with a learning rate of 1e-4.

The results are depicted in Figure A5 of the attachment. As evidenced by the results, the VS does not yield a notable impact on the PINNsFormer framework for all three optimizers. It seems that the Transformer architecture is substantially different from the MLP structure examined in our paper, resulting in disparate outcomes and indicating that comparable results may not be attainable in this context.

2024-08-12

I would like to thank the authors for their responses and experiments. Since I have already held positive opinions on this paper, I will keep my score.

审稿意见

评分: 5置信度: 32024-07-09

This paper provides a theoretical understanding about the behavior of PINN when dealing with high-order or high-dimesional PDEs. Variable splitting is then proposed to decomposes the high-order PDE into a system of lower-order PDEs and facilitate the convergence of PINN.

优点

This paper extends Gao et al.'s work to higher order PDEs and ReLU activation functions and demonstrates that the gradient flow of PINNs converges to the global minimum with a high probability when the width of the network is sufficiently large.
Using the derived bound, this paper comprehensively analyzes the impact of PDE order and dimensionality on the behaviour of PINNs.
The numerical results further demostrates the theoretical results and validates the positive effect of utilizing variable splitting.

缺点

The numerical results pertaining to the parameter p indicate a contradiction. Specifically, the convergence of the PINN is observed to be faster with p = 4 compared to p = 3 , which stands in contrast to the assertion that “the convergence of the loss is enhanced as p decreases.” Additionally, there are typographical errors in Figure 1(a), where the legends should display p rather than k.
The derived bound demonstrates the substantial impact of both the PDE order and the activation power on the network’s width, indicating that as p increases, the width m must also increase exponentially to ensure convergence. However, the numerical experiments conducted in the paper do not vary the network width m to empirically verify this theoretical relationship. Instead, they maintain a fixed width across different experiments, focusing primarily on the effects of varying the power p of the ReLU activation function and the impact of variable splitting. While these experiments do validate the theoretical findings related to p and variable splitting, they fall short of demonstrating how changes in network width influence convergence behavior.

问题

See weaknesses.

局限性

See weaknesses.

作者回复

2024-08-07

Q1. The numerical results pertaining to the parameter ...

R: The theoretical results presented in this paper indicate that networks with a lower power $p$ have a higher probability of convergence. A reduction in $p$ would facilitate optimization, which may be observed as an acceleration in the convergence process. It should be noted, however, that our findings pertain to an improvement in the probability of convergence rather than to the speed of convergence.

We acknowledge that the experiments presented in our paper could be misinterpreted as demonstrating faster convergence. To address this, we conducted additional experiments to provide a more robust validation of the theoretical results. Details of these experiments can be found in the common response above. The experimental results indicate that as p increases, a wider network is required for convergence, which is consistent with our theoretical findings. These results will be included in the revised manuscript, and Section 5 of the submitted manuscript will be revised to clarify the existing experiment and avoid misinterpretation of the results as showing the effect of acceleration.

Q2. The derived bound demonstrates the substantial impact of ...

R: We are grateful for the reviewer's thoughtful review and for emphasizing the necessity for empirical validation regarding the relationship between network width $m$ required for convergence and power $p$ of activation. We devised an experiment to validate the impact of power $p$ and PDE order $k$ on the width size $m$ required for convergence, which can be found in the common response above. Please see the description and our discussion in the common response above.

2024-08-10

Thank you for addressing my concerns by conducting extensive numerical studies. I will raise my score.

作者回复

2024-08-07

Response to All Reviewers

We sincerely appreciate all the reviewers for their invaluable comments, recommendations, and suggestions. The opinions of the reviewers have been carefully considered and responding to their questions has enhanced the paper. We first introduce additional experiments that are commonly required to address the comments from the reviewers inw1 and BvZt. Subsequently, we address the individual responses to each reviewer below. We also attach a supplementary file for Figures and a Table. Hopefully, the replies could adequately address all the concerns.

Common Response: In response to the feedback from reviewer inw1, we designed and conducted additional experiments that better demonstrate the theoretical results of our paper. The objective of this experiment was to validate the effect of power and PDE order on the width size required for convergence, which represents the primary result of our theoretical findings. The experimental results obtained were found to be consistent with the theoretical results, which we believe addresses the concerns of the reviewers.

(Experimental setup): To demonstrate the influence of $p$ , $k$ on $m$ , we trained networks with various $m$ , ranging from $10^2$ to $10^6$ , for each $p$ and $k$ and investigated their convergence under GD optimization. Networks were trained by GD with a learning rate of 1e-8. The utilization of this small learning rate facilitates the observation of the preliminary convergence behavior without causing training instability or divergence. Given the considerable number of steps required for full training, we did not undertake full training. Training collocation points were set to uniform grids: 400 in the domain and 100 on each boundary of the domain.
To examine the influence of the PDE order $k$ on the width required for convergence, we additionally considered a Poisson equation ( $k=2$ ) in conjunction with the bi-harmonic equation ( $k=4$ ) presented in Eq. (508) of the submitted manuscript. The Poisson equation is subject to a homogeneous Dirichlet boundary condition in accordance with Eq. (508) and the source function is set to yield the same solution as that of Eq. (508). Other training configurations are the same as the bi-harmonic equation represented in Appendix D.

The results are presented in Figure A1 of the attachment, which illustrate the loss behavior for networks with various widths $m$ for each $p$ . The results clearly support our theoretical finding that the width should increase to ensure convergence as $p$ increases. Moreover, a narrower network would converge when solving a lower-order equation (Poisson), suggesting that a higher-order PDE (bi-harmonic) requires a wider network. In summary, the results show that the width needed for convergence grows with p and the PDE order. This validates our paper's theoretical results and shows the importance of our study in analyzing how the PDE order and the activation power affect convergence of PINNs.

Supplementary file ↓

最终决定Accept (poster)

2024-09-25

The paper studies the behavior of neural network based solvers for high order PDEs. The authors proposed variable splitting approach to reduce the order of the PDE to facilitate the solutions. Overall the reviewers find the work interesting and solid, and all their concerns are addressed over the discussion period. Hence the meta-reviewer recommends acceptance of the paper.

For the final version, it would be worthwhile to add some discussion on the treatment of boundary conditions, as it is not obvious how to impose correct boundary conditions for the new variable introduced in the splitting.