6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.8

置信度

创新性3.0

质量3.5

清晰度3.5

重要性3.3

NeurIPS 2025

Gradient Alignment in Physics-informed Neural Networks: A Second-Order Optimization Perspective

Sifan Wang,Ananyae Kumar bhartari,Bowen Li,Paris Perdikaris

OpenReview PDF

提交: 2025-05-09更新: 2025-10-29

TL;DR

This work identifies gradient conflicts as a key challenge in training PINNs and shows that quasi second-order methods—especially SOAP—effectively resolve them, leading to 2-10x accuracy gains on 10 PDE benchmarks.

摘要

关键词

Scientific Machine LearningPhysics-informed Neural NetworksPartial Differential EquationsSecond-order OptimizationGradient Conflicts

评审与讨论

审稿意见

评分: 5置信度: 32025-06-23

This paper studies the gradient alignment of multiple objectives in Physics-informed neural networks (PINNs). Training of PINNs faces optimization challenges due to the composite nature of the loss function, which consists of competing objectives (e.g. the initial and boundary conditions and a PDE residual). By introducing a new alignment score, defined as

\mathcal{A}(v_1,v_2,\dots,v_n) := 2 \left|\left| \frac{\sum_{i=1}^n \frac{v_i}{\||v_i\||}}{n} \right|\right|^2 -1,

which generalizes the cosine similarity measure to multiple vectors, the authors identify the misalignment of gradient directions of competing objectives as a critical obstacle. Building on this result, the authors show that (quasi) second-order optimization methods can mitigate these gradient conflicts resulting in superior performance over classically used methods, including Adam, and that a recently proposed optimizer SOAP performs best among the tested methods. Finally, this paper shows empirically that PINNs trained with SOAP are able to learn challenging PDEs, including turbulent flows, which was not possible before with PINNs.

优缺点分析

Strengths

This paper is overall clearly written and easy to follow.
This paper introduces a novel alignment score, which shows that the update directions of multiple objectives (they are not strictly gradient directions anymore due to the diagonal preconditioning) in Adam are misaligned, while this is less the case in quasi second-order methods, such as Kron, Muon and SOAP. This alignment score also correlates strongly with the relative L2-error achieved by each optimizer across multiple benchmark problems.
I am not an expert in PINNs, but the empirical results that the authors achieve in this work with the SOAP optimizer appear to be quite impressive, which advances the state-of-the-art of what PINNs can achieve.
The benchmarks that the authors provide can serve as a valuable reference for future work on PINNs.

Weaknesses

The novelty of this paper seems a bit limited, given that SOAP has been proposed already and that gradient conflicts appear to be a known problem in training PINNs given that there is previous work which propose specifically designed optimizers (e.g. ConFiG) to address this problem (even though the overall performance is below SOAP).
The provided explanation of SOAP's superior performance is not very convincing to me. Proposition 3 only addresses the inter-step alignment (which is equivalent to cosine similarity), not the intra-step alignment, which seems to be the more critical point regarding gradient conflicts. Furthermore, Proposition 3 only provides a statement regarding the maximal (local) step size that can be taken to maintain a certain gradient alignment between consecutive steps, but does neither explain why higher alignment scores between consecutive steps is necessarily better nor why a larger step size is more desired.

问题

I am a little bit puzzled regarding the inter-step alignment score of SOAP. The plots show that the alignment score is almost or exactly 1.0, which would imply that the trajectory of the iterates is almost straight. This seems a bit unlikely to me, given that all the methods use mini-batches. Could the authors perhaps clarify on this?
Can the authors also elaborate on why a very high inter-step alignment score close to 1 is desired? (which SOAP seems to achieve, see my question above.) Clearly having a negative or close-to zero inter-step alignment score is undesirable, but given the highly non-convex landscape, I would not expect the trajectory to be very straight. For instance, [1] also studies the directionality of trajectories in neural networks (although not in PINNs), but argue that a cosine similarity score of consecutive gradients close to 1 would be undesired.
I had a closer look at Figure 5 and noticed that the relative L2-error of SOAP dominates all other methods except for the Korteweg-De Vries problem, which aligns with the higher intra-step alignment score. Here Muon has the lowest relative L2-error for three quarter of the training trajectory, while having comparably low alignment scores, even lower than Adam and Kron (after initial 20% of steps). Do the authors have any explanation for this?
Given that all these methods have different computational costs, can the authors also provide figures for relative L2-error vs. wall-clock time?

[1] Singh, Sidak Pal, et al. "The directionality of optimization trajectories in neural networks." The Thirteenth International Conference on Learning Representations. 2025.

局限性

yes.

最终评判理由

After a thorough discussion with the authors the contributions of this submission were further clarified to me.

Concretely, the benefit of the gradient alignment scores is clearer as it is not only a cheaper metric to track than the eigenspectrum of the Hessian (which can be used to study the ill-conditioning of the loss landscape). Moreover, when an explicit preconditioning matrix is not available (for instance because one does some form of implicit preconditioning such as in SOAP) or only coarse approximations are available, the gradient alignment scores appear to be a more reliable metric to study optimization behavior.

Furthermore, during the discussion phase it became more clear that SOAP (while not the novelty of this work) is not "just another quasi-Newton method", which not only performs better, but also more robustly than more "classic" quasi-Newton and second order methods, such as NGD, which require high precision formats, such as float64 to converge and perform well, while SOAP also works robustly with float32. As the authors elaborate, this can be intuitively understood, as previous methods try to approximate the inverse of the Hessian, which is numerically sensitive, while SOAP performs implicit preconditioning. SOAP has been developed in the context of LLMs, where this problem of numerical sensitivity is not present. Thus, identifying this and transferring it to the domain of PINNs is valuable by itself.

Finally, a further discussion provided by the authors putting this contribution in the broader context of other quasi-Newton methods will provide a good reference point for the community to train PINNs more reliably and robustly in the future.

Therefore, after this discussion phase, I have decided to adjust my score from an initial 3 ("Borderline reject") to a 5 ("Accept").

格式问题

作者回复

2025-07-30

We sincerely thank the reviewer for their careful reading and thoughtful feedback. We appreciate the opportunity to clarify our work and address the concerns raised. Below, we provide detailed responses and additional insights that we hope will strengthen the overall understanding of our contributions.

W1. The novelty of this paper seems a bit limited...

While SOAP is used as the primary optimizer, our contributions extend significantly beyond the application of an existing algorithm.

Our core novelty lies in identifying and formalizing directional gradient conflicts as a fundamental bottleneck in PINN optimization. To this end, we introduce gradient alignment score to quantify such conflicts, which generalizes cosine similarity to multiple loss components.
We further provide rigorous theoretical analysis demonstrating that such conflicts are intrinsic to PINNs at initialization (Proposition 2). Additionally, we establish a novel connection between SOAP and Newton’s method (Theorem 1), which offers a principled explanation for why second-order preconditioning can effectively resolve these conflicts.
Our empirical results—including the first successful PINN applications to high-Reynolds-number Kolmogorov flows $\text{Re} = 10^4$ .

SOAP is one instantiation of a broader principle: that quasi second-order preconditioning mitigates directional conflicts in scientific ML tasks. Thus, the main methodological contribution lies in conceptually understanding, diagnosing, and overcoming optimization barriers in PINNs and related scientific ML models, not in the application of a new optimizer per se.

W2. The provided explanation of SOAP's superior performance is not very convincing...

Our rationale for SOAP’s superior performance is as follows:

Figures 2 and 5 show a strong positive correlation between inter-step gradient alignment and the predictive accuracy of PINNs, suggesting that higher alignment scores empirically lead to better performance.
In Theorem 1, we show that SOAP approximates Newton’s method under certain assumptions.
Proposition 3 further implies that more accurate curvature approximations—such as using the Hessian—lead to higher gradient alignment, enabling more stable updates and faster convergence.

In summary, these theoretical results and empirical observations together explain why SOAP’s approximate second-order nature yields improved gradient alignment and, consequently, superior PINN performance.

Q1.The plots show that the alignment score is almost or exactly 1.0, which would imply that the trajectory of the iterates is almost straight.

We compute the inter-step alignment score between consecutive gradient steps and report these scores at regular intervals (e.g., every 100 iterations). To reduce the effect of stochastic noise, alignment scores are computed using the loss evaluated on the full set of test collocation points (i.e., without mini-batching), while training continues to use mini-batches. Note that our training batch size is large (e.g. 8192), so switching to mini-batches for computing alignment scores would not significantly affect the results.

It is worth noting that SOAP shows some small oscillations in alignment scores, these fluctuations are much less pronounced than those of other methods. To improve clarity and readability, we apply a running average to smooth all curves, which explains why the reviewers observe SOAP’s alignment scores as nearly constant.

We emphasize that a high inter-step alignment score close to 1 does not imply that the entire optimization path is nearly straight. Since the alignment is measured only between consecutive steps, small deviations can accumulate over many iterations, resulting in a significantly curved overall trajectory. Therefore, a high inter-step alignment score does not guarantee a perfectly straight optimization path throughout training.

Q2. Why a very high inter-step alignment score close to 1 is possible and desired?

On possibility: It is important to note that achieving an alignment score close to 1 can be trivial if the learning rate is made sufficiently small. In this case, parameter updates become negligible, and consecutive gradients naturally align since they hardly change. What truly matters is maintaining a nearly linear trajectory while using a reasonable step size.

On desirability: Intuitively, high inter-step alignment indicates that the optimization trajectory is more linear, which is often a sign of stable and efficient convergence. This is empirically supported in our results (Figures 2 and 5), where higher alignment correlates with faster error convergence.

To build more intuition, consider the simplest case of a quartic objective function. If the Hessian is used as the preconditioner, the optimization trajectory becomes a straight line, resulting in an inter-step alignment score exactly equal to 1. In this idealized scenario, the optimizer reaches the local minimum in a single step.

Although training deep networks involves complex, non-convex objectives, good preconditioners still help locally whiten the curvature, making the loss landscape more isotropic. This allows updates to approximate those in well-conditioned quadratic problems. As long as the curvature (i.e., Hessian) does not change drastically between steps, the optimization path remains stable, and high inter-step alignment naturally emerges—even in non-convex settings. Thus, high alignment is not only possible, but also reflects desirable optimization behavior when achieved under a meaningful learning rate.

Q3. Alignment Scores in Korteweg-De Vries

While our theoretical insights and metrics such as alignment scores explain general trends across problems, they are not sufficient to capture all problem-specific dynamics.

There are several potential reasons for this discrepancy:

Alignment is a proxy, not a guarantee of performance. It captures how well gradients are aligned across time (inter-step) or loss terms (intra-step), but does not account for other factors like curvature mismatch, optimizer noise characteristics, or problem-specific regularities.
Deep learning optimization inherently involves non-convex landscapes and complex interactions; thus, our theoretical results—while broadly predictive—may not hold uniformly across all settings. This instance could be one where assumptions behind our alignment-based interpretation do not fully apply.
Muon’s update rule may incidentally match the structure of the KdV problem better, even without high alignment scores. This may point to implicit regularization or favorable dynamics specific to this PDE.

Overall, we believe this deviation is indeed intriguing and motivates future work toward a more general and dedicated theory that explains not just average trends but also exceptions.

Q4. provide figures for relative L2-error vs. wall-clock time.

Thank you for the suggestion — this is an excellent point. We will include such figures in the revised version. In our experiments, for the same amount of training time, SOAP consistently achieves the lowest relative L2-error compared to other methods.

2025-08-03

Dear authors,

thank you for providing detailed answers to my concerns.

W1: I do not fully agree that you have sufficient evidence that quasi second-order preconditioning mitigates directional conflicts in scientific ML tasks.
- As you have also mentioned in your work, the problem of directional gradient conflicts has been identified already previously, for instance in Liu et al. [2024]. So the novelty lies solely in the formalization thereof (which is important and valuable.)
- Proposition 2 indeed shows that such conflicts are intrinsic to PINNs at initialization (But as mentioned in the point above, this issue has also been identified earlier), but does not distinguish between different optimizers apart from the constants. I briefly checked the appendix, but was not able to see that the constants for Shampoo or SOAP are smaller than for GD or Adam.
- Theorem 1 establishes the connection between SOAP and Newton's method, but does not necessarily explain why second-order methods resolve gradient conflicts. For instance, Rathore et al. [2024] study the ill-conditioning of the loss landscape in PINNs and the effectiveness of quasi-Newton methods might simply lie in the fact that they improve conditioning. The success of SOAP might then simply be traced back to the fact that it improves conditioning much more than other methods. (In this case I can see more why the connection of SOAP and Newton's method would matter.)
W2:
- I think that it is tricky to use Proposition 3 as an argument for better performance of SOAP on PINNs. What Proposition 3 states to my understanding is a threshold on the maximal learning rate for any preconditioning method to guarantee a certain inter-gradient alignment score. However, this is not how one chooses the learning rate in practice. Furthermore, based on Proposition 3 and the argument that higher alignment scores are more desirable, one could choose infinitesimal small LRs to have $\mathcal{A}(g_t, g_{t+1}) \to 1$ , but of course this will not lead to faster convergence since the weights would change very slowly. Instead, what I see more in proposition 3 is that given the same desired alignment score, having a better preconditioning allows for larger learning rate, but it does not tell us anything about what and why larger alignment scores are actually desired in practice.
- I agree that Figure 2 and Figure 5 show qualitative good correlation between the inter-step alignment score and relative L2-error, but to make such a claim one needs to do a statistical analysis, for instance by computing the $R^2$ coefficient of determination to quantify how much (for instance) the average inter-step and intra-step alignment score can predict the final L2 error for each problem.
Q2: Thank you for your explanation.
- I agree that it is possible in theory to achieve high inter-step alignment score with sufficient small learning rate. However my question was regarding how this is possible with meaningful learning rates.
- I agree with your argument that for quadratic and quartic objective functions, the optimization trajectory becomes a straight line when preconditioned with the Hessian and that one could transfer this to more general, non-convex settings if one assumes that the curvature does not change too much between consecutive steps. This would imply that one can reduce the update frequency of the preconditioner in SOAP, which does not seem to be the case. (Updating every 10 steps for a training of 10.000 steps seems quite frequent and for KdV specifically one can already see a degradation when reducing the frequency from every 2 to every 10 steps. (Fig. 4), yet the alignment scores are very close to 1 (Fig. 5).) Perhaps providing the value range of alignment scores in numerical value for each problem and each optimizer could clarify this?
Q3: Thank you for this explanation, this makes sense to me.

Again I would like to thank the authors for their time and effort in providing a detailed response to my concerns and questions.

To summarize, my two main concerns remains as follows:

Why do preconditioning methods (theoretically) improve intra-step alignment?
Why is it desirable to have an alignment score as high as possible? (At least Fig. 5 suggests this.)

As for now, I see the main contributions in formalizing the gradient alignment in PINNs and providing impressive empirical results. However, since this was achieved using the SOAP [Vyas et al. 2024] optimizer, and the explanation for the superior performance of SOAP (also compared to other preconditioning methods, such as Kron or Muon) is not sufficient, I will keep my score for now.

Liu, Qiang, Mengyu Chu, and Nils Thuerey. "Config: Towards conflict-free training of physics informed neural networks." arXiv preprint arXiv:2408.11104 (2024).

Rathore, Pratik, et al. "Challenges in training pinns: A loss landscape perspective." arXiv preprint arXiv:2402.01868 (2024).

2025-08-05

We sincerely thank the reviewer for the thoughtful follow-up. Regarding your primary concerns:

We can prove that higher intra- and inter-step alignment scores lead to faster loss decay under the assumption of $L$ -Lipschitz continuous gradients. Please see our detailed proposition and proof sketch below.
While we observe strong empirical evidence that quasi-second-order optimizers improve intra-step gradient alignment during training, providing a rigorous theoretical justification remains an open challenge. We are actively investigating this direction.

Proposition

Let $\mathcal{L} = \sum_{k=1}^n \mathcal{L}_k : \mathbb{R}^d \rightarrow \mathbb{R}$ be a smooth function with $L$ -Lipschitz continuous gradient, optimized via gradient descent:

$\theta_{t+1} = \theta_t - \eta g_t, \quad \text{where } g_t = \nabla \mathcal{L}(\theta_t) = \sum_{k=1}^n g_t^k, \quad g_t^k := \nabla \mathcal{L}_k(\theta_t)$

with step size $\eta \leq \frac{1}{L}$ . Define:

Loss drop: $\Delta_t := \mathcal{L}(\theta_t) - \mathcal{L}(\theta_{t+1})$
Inter-step alignment: $A_{t}^{inter} := \frac{\langle g_t, g_{t-1} \rangle}{\|g_t\|\cdot\|g_{t-1}\|}$
Intra-step alignment: $A_{t}^{intra} := 2\left\| \frac{1}{n} \sum_{k=1}^n \frac{g_t^k}{\|g_t^k\|} \right\|^2 - 1$

Then, for any $t \geq 1$ :

If $\|g_t^k\| = \lambda$ for all $k$ , then:

$\Delta_t \geq \left(\eta - \frac{L\eta^2}{2}\right)\|g_t\|^2 = \left(\eta - \frac{L\eta^2}{2}\right) \cdot \frac{n}{2}(A_t^{intra} + 1) \sum_{k=1}^n \|g_t^k\|^2$
For general $g_t$ and $g_{t-1}$ , the sum of two consecutive loss drops satisfies: $\Delta_{t-1} + \Delta_t \geq \eta (1 - L\eta) \|g_{t-1}\| \cdot \|g_t\| \cdot A_t^{inter} + \eta \left(1 - \frac{L\eta}{2}\right) \|g_{t-1}\|^2 - \frac{L\eta^2}{2} \|g_t\|^2$

Consequently, we can see higer intra- and inter-step alignment scores lead to larger loss drop.

Proof Sketch

(1) Loss drop bound via intra-step alignment.

Using the standard descent lemma for $L$ -smooth functions, we get:

$\mathcal{L}(\theta_{t+1}) \leq \mathcal{L}(\theta_t) - \eta \|g_t\|^2 + \frac{L\eta^2}{2} \|g_t\|^2 \quad \Rightarrow \quad \Delta_t \geq \left(\eta - \frac{L\eta^2}{2}\right) \|g_t\|^2$

Assume all component gradients $g_t^k$ have the same norm $\lambda$ . Then:

$g_t = \sum_{k=1}^n g_t^k = \lambda \sum_{k=1}^n u_k, \quad u_k := \frac{g_t^k}{\lambda}$

Compute:

$\|g_t\|^2 = \lambda^2 \left\| \sum_{k=1}^n u_k \right\|^2 = \lambda^2 \left(n + \sum_{i \neq j} \langle u_i, u_j \rangle\right)$

Now, use the intra-step alignment:

$A_t^{intra} + 1 = 2 \left\| \frac{1}{n} \sum_{k=1}^n u_k \right\|^2 = \frac{2}{n^2} \left(n + \sum_{i \neq j} \langle u_i, u_j \rangle\right) \quad \Rightarrow \quad \|g_t\|^2 = \frac{n}{2}(A_t^{intra} + 1) \sum_k \|g_t^k\|^2$

Substitute into the bound on $\Delta_t$ to conclude.

(2) Two-step loss drop bound via inter-step alignment.

Apply the descent lemma to $\mathcal{L}(\theta_{t+1})$ at point $\theta_{t-1}$ :

$\mathcal{L}(\theta_{t+1}) \leq \mathcal{L}(\theta_{t-1}) + \langle g_{t-1}, \theta_{t+1} - \theta_{t-1} \rangle + \frac{L}{2} \|\theta_{t+1} - \theta_{t-1}\|^2$

Use the update rule:

$\theta_{t+1} - \theta_{t-1} = -\eta(g_{t-1} + g_t)$

and expand:

$\Delta_{t-1} + \Delta_t = \mathcal{L}(\theta_{t-1}) - \mathcal{L}(\theta_{t+1}) \geq \eta \|g_{t-1}\|^2 + \eta \langle g_{t-1}, g_t \rangle - \frac{L\eta^2}{2} \left(\|g_{t-1}\|^2 + 2 \langle g_{t-1}, g_t \rangle + \|g_t\|^2 \right)$

By definition of $A_t^{inter}$ :

$\Delta_{t-1} + \Delta_t \geq \eta (1 - L\eta) \|g_t\| \cdot \|g_{t-1}\| \cdot A_t^{inter} + \eta \left(1 - \frac{L\eta}{2}\right) \|g_{t-1}\|^2 - \frac{L\eta^2}{2} \|g_t\|^2$

Additional Comments:

The assumption $|g_t^k| = \lambda$ for all $k$ is not restrictive in the PINNs setting, where balancing the scales of different losses is common practice. In fact, we adopt the weighting scheme proposed in Wang et al. [1] (Appendix G.2)., which ensures that all weighted gradients have equal norm.
We humbly believe that successfully applying a method from domain A to domain B holds value by itself, even if it may seem straightforward. For example, adapting the Transformer from NLP to computer vision via Vision Transformers (ViT) required careful adaptation and validation. Similarly, while SOAP has shown efficiency in LLM training, we explore its value in the context of PINNs. Such transfers are often nontrivial; understanding if, when, and how an idea can meaningfully advance another field requires significant insight. We hope our work demonstrates this effort.

[1] Wang, Sifan, Yujun Teng, and Paris Perdikaris. "Understanding and mitigating gradient flow pathologies in physics-informed neural networks." SIAM Journal on Scientific Computing 43.5 (2021): A3055-A3081.

2025-08-06

Dear authors,

thank you for your thorough response and providing additional proofs to connect the intra-step and inter-step score to a larger loss decrease (albeit only for GD).

Regarding the proof sketches:

1. Loss drop bound via intra-step alignment: Thank you, I think it is also intuitive that a larger intra-step score implies a larger loss decrease under the assumption of balanced gradients, since this implies that the gradient norm is larger if all gradients are more aligned.
1. Two-step loss drop bound via inter-step alignment: How does this connect to Proposition 3, as GD can be thought of as having an identity as preconditioner?

Regarding "applying a method from domain A to domain B":

I absolutely agree with the authors that transferring concepts and methods from a domain to another is valuable in itself. Could the authors highlight the specific challenges in applying SOAP to PINNs, which were different from applying SOAP to LLMs? For instance, the sensitivity of SOAP to the update frequency was also discussed in Vyas et al. [2025] (Section 6.2 therein). The requirement of balancing gradients, which the authors have elaborated above is indeed specific to PINNs, but has already been known for some time based on the reference provided, and is not specific to SOAP.

Some of my concerns and questions from my previous comment remain. My main concern is the following:

While the empirical experiments do suggest the superior performance of preconditioned methods, in particular SOAP, and one at least observes empirical correlation to inter- and intra-step alignment scores, it is not clear why (theoretically) one would expect quasi-second order methods to do so.
In contrast, it seems more natural to me to study the more classic notion of ill-conditioning by studying the spectrum of the Hessian, as preconditioned methods were indeed developed for this. Since ill-conditioning is a known problem in PINNs, e.g. Rathore et al. [2024], I wonder why the authors have not studied this before.
If preconditioned methods do improve the conditioning of the landscape significantly, and SOAP does this particularly better than other methods, then this might already be sufficient to explain the performance of these methods (and one could ask whether one needs inter- and intra-step alignment scores at all). In contrast, if this is not the case, this would also add value to the proposed alignment scores and to this work. I have the suspicion that the alignment scores are also highly connected to the (local) conditioning of the loss landscape. (Which would also provide valuable understanding, but I am not asking for such a discussion.)
I know that computing the eigenspectrum is computationally expensive, so I'm not asking to do this for all problems. If you have time do it for one or two of the smaller settings in your benchmark, this would already be much appreciated.

I again would like to thank the authors for their time and effort that they have spent so far in the discussion.

2025-08-06

We thank the reviewers again for their active and continued engagement, which has greatly improved the paper. Regardless of acceptance, we will incorporate all relevant revisions into the next version.

Extend new proposition to preconditioned gradient descent

We thank the reviewer for this question. Our new proposition can indeed be naturally extended to preconditioned gradients by replacing $g_t$ with $P_t g_t$ , and the same results still hold. This follows from our use of the L-Lipschitz condition, which holds for any vectors $\phi$ and $\theta$ :

$\mathcal{L}(\phi) \leq \mathcal{L}(\theta) + \langle\nabla \mathcal{L}(\theta), \phi-\theta\rangle + \frac{L}{2}\|\phi-\theta\|^2$

For completeness, we state the revised proposition below.

Proposition

Let $\mathcal{L} = \sum_{k=1}^n \mathcal{L}_k : \mathbb{R}^d \rightarrow \mathbb{R}$ be a smooth function with $L$ -Lipschitz continuous gradient, optimized via preconditioned gradient descent:

$\theta_{t+1} = \theta_t - \eta h_t, g_t = \nabla \mathcal{L}(\theta_t) = \sum_{k=1}^n g_t^k, \quad g_t^k := \nabla \mathcal{L}_k(\theta_t)$

and $h_t = P_t g_t$ for some positive definite preconditioning matrix $P_t$ with $\lambda_{\min}(P_t) \geq \mu > 0$ and $\lambda_{\max}(P_t) \leq M$ , with step size $\eta \leq \frac{1}{LM}$ . Define:

Loss drop: $\Delta_t := \mathcal{L}(\theta_t) - \mathcal{L}(\theta_{t+1})$
Inter-step alignment: $A_{t}^{inter} := \frac{\langle h_t, h_{t-1} \rangle}{\|h_t\|\cdot\|h_{t-1}\|}$
Intra-step alignment: $A_{t}^{intra} := 2\left\| \frac{1}{n} \sum_{k=1}^n \frac{h_t^k}{\|h_t^k\|} \right\|^2 - 1$ , where $h_t^k := P_t g_t^k$

Then, for any $t \geq 1$ :

If $\|h_t^k\| = \lambda$ for all $k$ , then: $\Delta_t \geq \left(\eta - \frac{L\eta^2}{2}\right)\|h_t\|^2 = \left(\eta - \frac{L\eta^2}{2}\right) \cdot \frac{n}{2}(A_t^{intra} + 1) \sum_{k=1}^n \|h_t^k\|^2$
For general $h_t$ and $h_{t-1}$ , the sum of two consecutive loss drops satisfies: $\Delta_{t-1} + \Delta_t \geq \eta (1 - L\eta) \|h_{t-1}\| \cdot \|h_t\| \cdot A_t^{inter} + \eta \left(1 - \frac{L\eta}{2}\right) \|h_{t-1}\|^2 - \frac{L\eta^2}{2} \|h_t\|^2$

Superior performance of quasi-second order methods

If we agree that quasi-second-order methods can be interpreted as preconditioned gradient methods, then Proposition 3 shows that all such preconditioning schemes promote stronger inter-step gradient alignment under a fixed learning rate. Combined with the revised proposition, this implies that quasi-second-order methods lead to higher alignment scores, which in turn yield faster loss decay.

Applying a method from domain A to domain B

We appreciate the reviewer’s point. Our main message is that the challenge lies not only in adapting a method from domain A to domain B, but also in recognizing its value in a context where its relevance may not be obvious. With numerous methods being proposed every day, many claiming state-of-the-art performance, it is far from obvious which ones will truly be effective for the problem of interests. Once the answer is known, it often feels trivial or expected—but reaching that answer is not.

To elaborate further, assuming that without our work, if someone were asked what is the best or alternative optimizer to use for PINNs, the likely answer would still be Adam, with no clear alternative in common use. Our work provides a concrete step forward: we show strong evidence that SOAP can serve as a drop-in replacement for Adam, offering improved performance with minimal tuning.

Studying the spectrum of the Hessian

We appreciate the reviewer's thoughtful suggestion. We would like to clarify that quantifying the spectrum of the preconditioned loss landscape is not straightforward in practice. Modern preconditioning methods like SOAP do not construct an explicit preconditioner matrix; instead, they operate on gradients directly with the underlying preconditioning matrix implicitly approximated. In practice, we only have access to the raw gradients g_t and their preconditioned versions P_t g_t (i.e., gradients after applying the optimizer) at each iteration, but the underlying preconditioning matrix P remains implicit.

While computing the Hessian spectrum of the original (unpreconditioned) loss is feasible for small-scale networks, this does not capture the conditioning effects induced by the optimizer. This limitation is precisely what motivated us to introduce gradient alignment scores as a more tractable alternative. These scores capture how the optimizer affects the optimization trajectory, even when an explicit preconditioned matrix is unavailable.

2025-08-07

Thank you for your timely response and further elaborations that you provide.

I strongly appreciate the effort of revising your proposition and additionally explaining your point on applying a method from domain A to domain B.

Admittedly, I am not working on PINNs myself, so I do not know what is considered as "common" optimizers used to train PINNs, but I am aware of a number of recent work which suggest that quasi-Newton or second order methods can be beneficial to PINNs, see for instance [Bonfanti et al., 2024], which use a Levenberg-Marquart algorithm, [Dangel et al., 2024], which uses K-FAC or [Rathore et al., 2024] that uses a combination of Newton's method with preconditioned conjugate gradient.

Thus, I would also argue that it would be a slight overstatement to say that this work identifies the benefit of quasi-Newton methods to training PINNs. Nevertheless, I agree with your elaboration that often things appear more trivial in retrospect.

Regarding studying the spectrum of the Hessian: Thank you for this explanation. This is indeed a strong point to study the the gradient alignment scores. I also agree with you that tracking these scores is much more tractable than studying the eigenspectrum.

As I am not an expert on SOAP, I had a look at the original work at noticed that at least for (idealized) Shampoo, a single step can be written as $W_{t} = W_{t-1} - \eta H^{-1/2} G_{t}$ , with $H = L \otimes R/Trace(L)$ , so at least for this "close relative" of SOAP, one could at least estimate the (per-layer) conditioning I believe. Also, if I remember correctly, Theorem 1 connects SOAP to Newton's method and states that it estimates the (inverse square-root) of the Gauss-Newton matrix efficiently. Would it be possible to use this to estimate the conditioning effects somehow? Given the limited amount of time, I am not asking the authors to conduct these experiments, but I would appreciate if you could add - if feasible - such experiments and add a discussion on this regard in a revised version. In any case, I see the merit in the gradient alignment scores. If it is strongly correlated with the condition number, it will provide a cheaper alternative metric to track (especially when the explicit preconditioned matrix or an approximation thereof is not readily available as you say). And if it is not connected to the condition number, then the benefit of quasi-Newton methods for PINNs goes beyond only improving the ill-conditioning of the landscape and requires the gradient alignment scores to explain the performance.

I thank the authors for the valuable discussion which clarified the contributions of this paper more clearly to me. I will increase my score and suggest acceptance.

I would appreciate if the authors could add a section on previous quasi-Newton methods for PINNs in the related work section and also add a discussion regarding the aspect of ill-conditioning/preconditioning of the loss landscape.

Bonfanti, Andrea, Giuseppe Bruno, and Cristina Cipriani. "The challenges of the nonlinear regime for physics-informed neural networks." Advances in Neural Information Processing Systems 37 (2024): 41852-41881.

Dangel, Felix, Johannes Müller, and Marius Zeinhofer. "Kronecker-factored approximate curvature for physics-informed neural networks." Advances in Neural Information Processing Systems 37 (2024): 34582-34636.

Rathore, Pratik, et al. "Challenges in training pinns: A loss landscape perspective." arXiv preprint arXiv:2402.01868 (2024).

2025-08-07

We thank the reviewer for their follow-up feedback and for their openness to our perspective.

We fully agree that it would be an overstatement to claim that this work identifies the benefit of quasi-Newton methods for training PINNs.

Over the years, we have been striving to push the frontier of PINN accuracy, motivated by the hope that PINNs could potentially become reliable solvers for real-world applications. As part of this effort, we have systematically explored the effectiveness of various (quasi-)Newton optimization methods for PINNs, including but not limited to L-BFGS, SSBroyden, Gauss-Newton, LM, and natural gradient descent.

However, through empirical studies, we found that many of these methods lack robustness in practice:

perform poorly under float32 precision,
unstable with mini-batches,
often fail to deliver improved accuracy when scaling to larger networks.

A key underlying reason is that these methods aim to approximate the inverse of the full Hessian, which becomes increasingly ill-conditioned as the number of parameters grows. This ill-conditioning makes the resulting update steps highly sensitive to precision and noise. Please refer to our response to Reviewer aDSZ, where we show that NGD exemplifies all of the aforementioned limitations.

Because of these, we were excited to discover that SOAP provides a scalable and robust alternative. It performs reliably across large networks, supports mini-batch training, and is stable under float32 precision. These properties make SOAP particularly well-suited for practical PINN applications and open a promising path toward real-world deployments.

In light of this, and in conjunction with the feedback from Reviewer aDSZ, we will revise the paper accordingly:

We will remove or revise the overstatement regarding the role of quasi-Newton methods in PINNs.
We will update the introduction to more clearly position our work within the broader context of (quasi) second-order optimization methods applied to PINNs.
We will add the new proposition to theoretically justify why higher alignment scores are preferable.
We will add a new set of comparative experiments with natural gradient descent (NGD), presented in a dedicated subsection.
We will expand the discussion on when second-order methods may be beneficial, as well as the limitations that motivate SOAP—such as memory overhead, instability in float32, and sensitivity to initialization.

Regarding the investigation of the preconditioned loss landscape and its conditioning, we share the reviewer’s interest, as it could further strengthen our work. To this end, we consider Burgers' and Allen-Cahn equations. We trained a simple MLP with architecture $[2, 32, 32, 1]$ , which allows us to compute the exact layer-wise Hessian.

We compare the maximum eigenvalue of the exact layer-wise Hessian with that of the Kronecker-approximated Hessian suggested by the reviewer. If the eigenvalues match closely, this provides evidence that the preconditioner behaves similarly to the Newton preconditioner.

Let

$H_{exact}$ denote the exact Hessian of the largest hidden layer ( $W \in R^{32 \times 32}$ ) ,
$H_{approx} = L \otimes R / Tr(L)$ , where $L = GG^T$ , $R = G^TG$ and $G$ is the gradient matrix after training.

Hessian Type	Max Eigenvalue (Order of Magnitude)
Exact Layer-wise Hessian ( $H_{exact}$ )	$\sim O(10^2)$
Kronecker Approximation ( $H_{approx}$ )	$\sim O(1)$

Upon further investigations, we found that the elements of the gradient matrix $G$ typically lie in the range $O(10^{-3})$ to $O(10^{-2})$ . Since

$\lambda_{\max}(G^TG) = \lambda_{\max}(GG^T) = \|G\|_2^2 \leq \|G\|_F^2 \approx 1$ ,
$\lambda_{\max}(A \otimes B) = \lambda_{\max}(A) \cdot \lambda_{\max}(B)$ ,
$\text{Tr}(GG^T) \geq \lambda_{\max}(GG^T)$

We obtain $H_{\text{approx}} = \frac{\lambda_{\max}(L) \cdot \lambda_{\max}(R)}{\text{Tr}(L)} \leq \|G\|_F^2 \approx 1$

At this point, due to approximation, we are unable to definitively conclude which of the following is the case:

The Kronecker-approximated Hessian does not approximate to the true (implicit) preconditioner induced by SOAP,
The formula used does not approximate the true Hessian,
The SOAP optimizer may not significantly alter the curvature of the loss landscape at convergence.

We are continuing this line of investigation and will include any conclusive findings in the revised version of the paper.

2025-08-07

I thank the authors for their thorough response and outlining the changes for the revised version. My questions and concerns have been addressed properly and I am suggesting the acceptance of this work.

审稿意见

评分: 4置信度: 42025-06-23

The paper thoroughly investigates the issue of gradient alignment that arises during training of PINNs, which arise when combining the physics-informed loss (residuals of the PDE) and the supervised one (boundary conditions/ initial condition/ data). The authors propose a novel gradient alignment score, inspired from the cosine similarity, to quantify these conflicts and demonstrate how and why quasi-second-order optimization alleviate them. The paper combines the theoretical analysis with extensive numerical validation across benchmark PDEs.

优缺点分析

Overall, the paper is in a very good format. In particular:

The paper is very well written and provides a clean and complete understanding of the problem studied, both numerically and theoretically.
The underlying problem is a critical issue in PINNs for years, and the analysis proposed by the authors gives a clear and easy understanding of the problem
The alignment score introduced gives a very intuitive understanding of the empirical results, while carrying interesting theoretical advancements

However, there are some factors that are concerning.

The problem of loss balancing in PINNs has been already studied and solved in several literature.
The authors strongly recommends the usage of SOAP, but the theoretical backbone suggests that a pure second-order method would be the optimal choice. Comparison with methods as Natural Gradient Descent, Levemberg-Marquardt or other quasi-Newton approaches would bring value to the choice of SOAP.
The improvements in the relative $L^2$ error shown in Table 2 are in general unimpressive. The usage of second-order methods can boost the accuracy of PINNs of several orders of magnitude, even when the underlying network is small (making it feasible to adopt such methods).
According to the experimental setup described in the appendix, PINNs trained in the numerical evaluation use random weight factorization, random Fourier features, a PirateNet architecture and gradient norm-based loss weighting. Some of these methods are implicitly addressing the gradient alignment issue, making it harder to understand the standalone impact of SOAP.

问题

I have the following questions:

Second-order training for PINNs can be very effective even with small (approx. 2000 parameters) and vanilla NN architectures. How does the SOAP optimizer perform in this scenario? Why not including such results to highlight the effect of SOAP alone?
Based on the theoretical findings, I would have expected SOAP to reach much higher accuracy. Quasi-Newton methods as L-BFGS can at times face struggle when the hessian matrix they approximate is ill-conditioned (as often the case with PINNs). Can it be that the SOAP pre-conditioner is sensible to the approximation of singular hessians?

局限性

The authors clearly discuss the limitations of their method.

最终评判理由

The thorough discussion with the authors, especially the additional experiments and discussion on second-order optimizer gives a different cut to the paper, which I recommend acceptance for.

格式问题

I have no concerns regarding the formatting of the paper.

作者回复

2025-07-30

We appreciate the reviewer's careful reading and insightful questions, particularly regarding the comparison with second-order methods and the theoretical expectations for SOAP's performance. These points have prompted valuable clarifications. We will address each point in detail below.

W1. The problem of loss balancing in PINNs has been already studied and solved in several literature.

While the issue of loss balancing in PINNs has indeed received significant attention, we respectfully disagree with the claim that it has been solved. As summarized in our paper, gradient conflicts in PINNs can be broadly categorized into two types:

Type I: Conflicts due to imbalanced gradient magnitudes across loss terms.
Type II: Conflicts due to gradients having opposing directions.

Type I conflicts have been extensively explored, beginning with Wang et al. [1] and followed by numerous loss weighting schemes proposed in subsequent works. In contrast, Type II conflicts remain largely underexplored. To the best of our knowledge, only two recent works [2, 3] explicitly discuss this issue.

Moreover, even methods specifically designed to handle Type II conflicts—such as ConFiG [2]—do not outperform our baseline on standard benchmarks (see Figure 4). This suggests that directional gradient conflict in PINNs remains an open challenge.

Our work makes three key contributions in this regard:

We show that directional gradient conflicts (Type II) are pervasive during PINNs training, as evidenced by the proposed gradient alignment score (Figure 5).
We prove that Type II conflicts are intrinsic to PINNs optimization—even for simple problems like the Poisson equation (Proposition 2).
We demonstrate that quasi-second-order preconditioners, such as SOAP, can effectively alleviate Type II conflicts and lead to improved optimization and accuracy (Theorem 1 and Proposition 3).

Therefore, while prior work has made important progress on loss balancing, especially in mitigating Type I conflicts, our results suggest that Type II conflicts are far from resolved. We believe our contributions offer new insights and practical tools for tackling this underexplored yet critical challenge in PINNs.

W2, W3, Q1. Unimpressive Results from SOAP and Comparison with Second-Order Methods

We would like to clarify that while second-order methods (e.g., [4]) can achieve near machine-precision accuracy, this has only been demonstrated on relatively simple benchmarks—where solutions are smooth and exhibit no sharp transitions or discontinuities. In such cases, very small networks are often sufficient to approximate the solution, and convergence and memory issues typically do not arise.

Indeed, none of the existing second-order methods have shown strong performance on the more challenging benchmarks we consider, which involve sharper features and more complex dynamics. This is not merely due to missing evaluations but second-order methods face fundamental scalability challenges: they do not support mini-batching, are sensitive to hyperparameters, and require high-precision float64 arithmetic for stable convergence. However, float64 is less efficiently supported on GPUs, consumes 2x much memory, and is typically 2–4× slower than float32, which is the precision used in most neural network training—including all experiments reported in this manuscript.

To better illustrate these issues, we revisited the 2D Poisson and heat benchmarks from [4], which uses natural gradient descent (NGD) to train PINNs for solving PDEs. Using the official implementation from [4], we varied only the hyperparameters such as random seed and depth or width of the network and evaluated their performance under both float32 and float64 precision, comparing against SOAP method.

We tested MLPs with one to three hidden layers, such as [2, 32, 1], [2, 32, 32, 1], and [2, 32, 32, 32, 1], and used wider networks with the same depth for SOAP, such as [2, 256, 1], [2, 256, 256, 1], and [2, 256, 256, 256, 1]. All models were trained to convergence over ten random seeds. It is worth noting that a perfectly fair comparison is inherently difficult due to differing architecture preferences (e.g., SOAP benefits from overparameterization). Therefore, we report the best result for each method under its respective optimal hyperparameters.

We note that for some seeds, NGD failed to converge and produced highly inaccurate results; we excluded these runs from the mean/standard deviation. However, such failures are also acknowledged in [4], as reflected in the large maximum errors reported in Table 3 and Table 5 of that work. The mean and standard deviation of the resulting relative $L_2$ errors are summarized in the table below.

Poisson 2D

Architecture (Method)	Float32	Float64
[2, 32, 1] (NGD)	4.87e-2 ± 3.61e-2	3.84e-7 ± 2.92e-7
[2, 32, 32, 1] (NGD)	1.65e-1 ± 3.93e-2	1.27e-6 ± 2.86e-7
[2, 32, 32, 32, 1] (NGD)	2.48e-1 ± 1.75e-2	3.14e-6 ± 5.10e-7
[2, 256, 1] (SOAP)	3.06e-6 ± 7.12e-7	6.08e-7 ± 2.13e-7
[2, 256, 256, 1] (SOAP)	1.87e-6 ± 5.19e-7	4.06e-7 ± 1.81e-7
[2, 256, 256, 256, 1] (SOAP)	1.35e-6 ± 3.45e-7	2.99e-7 ± 1.05e-7

Heat 2D

Architecture (Method)	Float32	Float64
[2, 32, 1] (NGD)	5.98e-2 ± 5.46e-2	7.68e-6 ± 1.85e-6
[2, 32, 32, 1] (NGD)	5.95e-1 ± 2.29e-4	2.32e-6 ± 1.17e-6
[2, 32, 32, 32, 1] (NGD)	5.95e-1 ± 4.58e-4	5.13e-6 ± 5.25e-7
[2, 256, 1] (SOAP)	4.61e-6 ± 8.12e-7	3.03e-6 ± 6.06e-7
[2, 256, 256, 1] (SOAP)	2.74e-6 ± 9.52e-7	2.04e-6 ± 4.10e-7
[2, 256, 256, 256, 1] (SOAP)	2.65e-6 ± 5.00e-7	1.33e-6 ± 2.61e-7

These results highlight several key limitations of NGD:

Extreme sensitivity to precision: NGD's performance deteriorates significantly under float32, often leading to divergence.
Inconsistent scalability: Increasing network depth/width does not always improve accuracy for NGD and often leads to worse performance or instability.

In contrast, SOAP achieves consistently strong performance across architectures and numerical precision, demonstrating both robustness and scalability. It can match or outperform NGD in accuracy, while avoiding the computational cost and stability issues associated with second-order methods.

W4. Combined Techniques in PINN Training May Confound SOAP’s Effect

To clarify, the alignment scores reported in Figures 2 and 5 are computed using the full training procedure described in the manuscript (Appendix G.2), which includes random feature embeddings, the PirateNet architecture, and gradient norm-based loss weighting, etc.

While it is true that these components can improve optimization to some extent, the effect of SOAP is both distinct and consistent. Specifically, under identical training configurations, we observe that SOAP significantly enhances both intra-step and inter-step gradient alignment compared to baseline methods (Figures 2 and 5). This suggests that SOAP directly addresses gradient misalignment and provides benefits beyond what is achievable by architecture or loss-balancing heuristics alone.

Q2. Concerns About SOAP’s Accuracy and Sensitivity to Ill-Conditioned Hessians

We thank the reviewer for this insightful comment. Our theoretical analysis (see Theorem 1) draws a connection between SOAP and Newton’s method under certain assumptions. However, as is often the case with theoretical guarantees in deep learning, these assumptions may not hold precisely in practice—particularly in the highly nonlinear and nonconvex optimization landscape of PINNs. Consequently, it is not unexpected that SOAP may not consistently match the idealized performance of a true second-order method.

Regarding the comparison to L-BFGS: while L-BFGS approximates the inverse Hessian using historical gradients and updates, it is known to be sensitive to noises, especially in mini-batch or stochastic training regimes common in PINNs. From our experience, we find that L-BFGS often underperforms our baselines likely due to these instabilities.

In contrast, SOAP demonstrates greater robustness to mini-batch noise and appears to yield a more stable approximation of the inverse Hessian, even in ill-conditioned settings. This empirical robustness suggests that SOAP may better handle singular or near-singular Hessians. Nonetheless, a formal theoretical analysis of SOAP’s behavior in such regimes remains an important direction for future work.

[1] Understanding and mitigating gradient flow pathologies in physics-informed neural networks, SIAM SISC

[2] ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks, ICLR 2025

[3] Dual cone gradient descent for training physics-informed neural networks. Neurips 2024

[4] Achieving High Accuracy with PINNs via Energy Natural Gradient Descent, ICML 2023

2025-08-05

W1 Indeed, perhaps "solved" is too harsh of a term for this case. What I refer to in practice is that second-order methods have in general shown to be effective in balancing the interplay between the two loss functions. Moreover, in cases where second-order methods can not be applied, one could also resort to imposing the boundary conditions structurally in the PINN model, which is also a classical enhancement recommended in training PINNs [1]. Even in the latter scenario PINN training can be complex, however, no gradient alignment issue arise as only the PDE residuals have to be minimized.

W2, W3, Q1 I appreciate the authors' effort in providing further experiments, however, I do not find the experimental settings to be satisfactory:

The difference in model size is huge. In particular for the model with a single hidden layer, the authors could have easily used 256 hidden neurons for the NGD training as well.
Despite the difference in number of parameters, the float64 training reports comparable results to the SOAP implementation. The authors reports a slower training (float 32 vs float 64) but do not present time comparison/comparison number of iterations of SOAP vs NGD
It is natural that, for some seed NGD fails in convergence of problems as Poisson which presents high frequency components. In particular, the update step of NGD is sensible to ill conditioned matrices, for this case a more suitable choice as the Levemberg-Marquardt step regularization would be in place [2].
The authors do not mention the adoption of RFF, PirateNet, and RWF for the training with NGD.

While I do understand that the authors apply their model to more complex scenarios, I consider that a thorough and fair comparison with suitable second-order methods - even on simple scenarios - would be in place, as the theoretical framework developed by the authors suggest them as the optimal choice. A comparison of the training time/ thorough discussion of second-order method would act as a better motivation to adopt SOAP instead, in my opinion.

W4 I do agree that the effect of SOAP is visible in Figure 2 and Figure 5. However, I do not think that from this experimental setup it can be determined whether this effect is given by the use of SOAP alone, or its interaction with the other enhancements included in the architecture.

Q2 I thank the authors for the clarification.

[1] Wang, Sifan, et al. "An expert's guide to training physics-informed neural networks." arXiv preprint arXiv:2308.08468 (2023). [2] Bonfanti, Andrea et al. "The challenges of the nonlinear regime for physics-informed neural networks". Advances in Neural Information Processing Systems 37 (NeurIPS), 2024

2025-08-06

We thank the reviewer again for their comments and hope our responses have clarified the key points of our contribution.

W1

We respectfully disagree with the reviewer’s further assessment, which we believe is either inaccurate or lacking sufficient justification.

"Second-order methods have in general shown to be effective in balancing the interplay between the two loss functions."

We would greatly appreciate the reviewer providing specific references especially in general and PINNs context.

"Even in the latter scenario PINN training can be complex, however, no gradient alignment issue arise as only the PDE residuals have to be minimized."

This claim only holds in highly simplified cases involving a single PDE. In realistic and complex problems, minimizing multiple PDE residuals is unavoidable. For example, in the Navier–Stokes equations, even with boundary and initial conditions enforced, one must still minimize the momentum equations and the continuity equation, amounting to at least three coupled loss terms. Gradient conflicts naturally emerge in such multi-objective formulations and are central to understanding and improving training dynamics in PINNs.

W2, W3, Q1

"The authors could have easily used 256 hidden neurons for the NGD training as well."

We use 32 neurons in our architecture because attempting to use [2, 256, 256, 1] results in an out-of-memory (OOM) error on the NVIDIA A6000 GPU (48 GB), as it requires approximately 77.47 GB of memory. For completeness, we also include NGD results for the [2, 256, 1] architecture.

Our goal is to compare the best accuracy that each method can achieve, which is the foremost priority in PINNs research—as in any numerical method. So we run all methods for sufficiently long training durations to allow full convergence.

Poisson

Architecture (Method)	Float32	Float64
[2, 32, 1] (NGD)	4.87e-2 ± 3.61e-2	3.84e-7 ± 2.92e-7
[2, 32, 32, 1] (NGD)	1.65e-1 ± 3.93e-2	1.27e-6 ± 2.86e-7
[2, 32, 32, 32, 1] (NGD)	2.48e-1 ± 1.75e-2	3.14e-6 ± 5.10e-7
[2, 256, 1] (NGD)	1.56e-1 ± 6.34e-2	6.10e-7 ± 1.68e-7
[2, 256, 1] (SOAP)	3.06e-6 ± 7.12e-7	6.08e-7 ± 2.13e-7
[2, 256, 256, 1] (SOAP)	1.87e-6 ± 5.19e-7	4.06e-7 ± 1.81e-7
[2, 256, 256, 256, 1] (SOAP)	1.35e-6 ± 3.45e-7	2.99e-7 ± 1.05e-7

Heat 2D

Architecture (Method)	Float32	Float64
[2, 32, 1] (NGD)	5.98e-2 ± 5.46e-2	7.68e-6 ± 1.85e-6
[2, 32, 32, 1] (NGD)	5.95e-1 ± 2.29e-4	2.32e-6 ± 1.17e-6
[2, 32, 32, 32, 1] (NGD)	5.95e-1 ± 4.58e-4	5.13e-6 ± 5.25e-7
[2, 256, 1] (NGD)	5.95e-1 ± 1.05e-3	8.69e-6 ± 6.49e6
[2, 256, 1] (SOAP)	4.61e-6 ± 8.12e-7	3.03e-6 ± 6.06e-7
[2, 256, 256, 1] (SOAP)	2.74e-6 ± 9.52e-7	2.04e-6 ± 4.10e-7
[2, 256, 256, 256, 1] (SOAP)	2.65e-6 ± 5.00e-7	1.33e-6 ± 2.61e-7

As observed, increasing neurons worsens NGD performance due to ill-conditioning of the preconditioning matrix as parameters grow. This explains higher errors in larger architectures and the need for double precision, highlighting a fundamental scalability bottleneck that limits NGD's practical applicability.

"The float64 training reports comparable results to the SOAP implementation."

We note that the reviewer emphasizes the float64 performance of the second-order method while overlooking a key advantage of SOAP—its stability and effectiveness under float32 precision. Since float32 is standard in practical deep learning due to memory and compute constraints, the second-order method's sensitivity to reduced precision is a significant limitation. We would appreciate the reviewer’s perspective on this issue.

"It is natural that, for some seed NGD fails in convergence of problems as Poisson which presents high frequency components. "

For both the Poisson and heat equations considered in our experiments (same problem setup in NGD paper), the exact solutions are:

$u^*(x, y)=\sin (\pi x) \sin (\pi y)$

and

$u^*(t, x)=\exp \left(-\frac{\pi^2 t}{4}\right) \sin (\pi x)$

which involve only low-frequency modes. The fact that the second-order method exhibits sensitivity to random initialization even on such simple problems raises additional concerns about its reliability and robustness in practical settings.

“The authors do not mention the adoption of RFF, PirateNet, and RWF for the training with NGD."

As we stated previously, for both the Poisson and heat problems, we use a standard MLP as the backbone architecture for both the SOAP and NGD optimizers.

2025-08-06

The first rationale is by experience, since second-order methods are a common choice in many multi objective optimization scenarios; secondly in [2] cited above the convergence results shows a unitary convergence rate in each optimization direction for second-order methods; and finally the abstract of the authors claims "second-order optimization methods inherently mitigate gradient conflicts". Nevertheless, the authors mentioned an interesting point which I was overlooking, as the case for multiple PDEs. I would also like to clarify to the authors that I do not want to argue that pure second-order optimization is the optimal choice, rather that I would have expected in the draft more attention to it, since their theoretical analysis points towards that direction.
I thank the authors for including the test on the [2, 256, 1] architecture (the only computationally feasible one). In my perspective, using float64 for is a reasonable approach for optimization algorithms that concern engineering applications (as solving a PDE for the case of a PINN which is not using data from the correct solution). However, this is only my perspective as also the author's point - stability under float32 precision - is of great importance.

I would like to emphasize my great appreciation of the authors' effort to address my concerns, which has been thoroughly covered. I still believed that the gap between second-order and quasi second-order to be a weakness of the paper. However, the comments of the authors clearly position their perspective on pure second order optimization and additional results do give a good support to their claims. I am assuming that the authors would include (part of) our discussion in a revised draft, so I will increase my score.

2025-08-06

We sincerely appreciate the reviewer’s continued engagement and thoughtful comments. The requested experiments have truly strengthened the paper and helped clarify key distinctions between second-order and quasi-second-order methods in practice.

We fully agree that providing a fair and detailed comparison with classical second-order methods such as NGD improves the motivation and positioning of our proposed approach. Regardless of the paper’s acceptance status, we will revise the draft accordingly:

We will move Table 1 to the appendix.
We will incorporate the new comparison results with NGD in a dedicated subsection, including performance metrics for different architectures, both in float32 and float64 precision.
We will include a clearer discussion of when pure second-order methods may be preferable, and when their limitations (e.g., memory cost, instability under float32, sensitivity to initialization) motivate the use of SOAP.

Once again, we thank the reviewer for their thoughtful feedback and openness to multiple perspective, and we are glad that our dialogue has helped clarify our intent and contributions. We are committed to revising the paper to reflect this exchange.

审稿意见

评分: 4置信度: 42025-07-03

This work focuses on the gradient alignment problem of PINNs, which belongs to an important topic that has been researched extensively. The authors locate the issue both theoretically and empirically. They propose to use quasi second-order methods, especially SOAP, to resolve gradient conflicts, which shows superior accuracy.

优缺点分析

Strengths:

The decomposition of two types of gradient conflicts and inter-step/intra-step gradient alignment.
Theoretical explanation of the model's dynamics.
Good performance on a range of datasets

Weaknesses:

Some datasets are not very challenging, on which Adam has already achieved sub 1% error rates. I recommend testing SOAP on some hard datasets such as Poisson2d-MS and Wave2d-CG from PINNacle, where the boundary conditions are more difficult.
Didn't show why second-order methods can help with inter-step gradient alignment.
The method-wise contribution is limited by borrowing SOAP, which affects the novelty of this work.
One key contribution as claimed is "We reveal that SOAP [1] can be viewed as an efficient approximation of the Newton preconditioner." But this is revealed by SOAP, not this work.

问题

The authors only focus on type II of the gradient conflicts, "where gradients have similar magnitudes but opposing directions." It would be interesting to see type I of such during PINN's training, particularly while second-order methods are applied, and whether the two types are orthogonal. By orthogonality I mean whether different learning rate annealing algorithms can work with SOAP.
As the authors listed, there are many quasi second-order approaches. Why does specifically SOAP (and sometimes Muon) perform better than others? Insights or analyses are appreciated.

局限性

They are mentioned above.

最终评判理由

In discussions with authors, W2, W3, and Q1 are solved; Q2 is partially solved, and I hold concerns about W1 and W4. However, I think this paper is a good and insightful paper overall.

格式问题

None that I am aware of

作者回复

2025-07-30

We sincerely thank the reviewers for their careful reading, thoughtful comments, and constructive suggestions. We appreciate the opportunity to clarify our contributions and address the raised concerns. Below, we respond to each point in detail.

W1. Some datasets are not very challenging ...

Target Accuracy: While sub-1% relative error may be acceptable in some machine learning tasks, it remains insufficient for many scientific computing applications, where high-fidelity simulations often require accuracy approaching machine precision. Our goal is therefore not merely to outperform Adam in moderately challenging cases, but to reduce errors as much as possible across a broad spectrum of PDEs. As demonstrated in our results, SOAP consistently outperforms our strong baseline with Adam by up to an order of magnitude. This improvement is highly significant in scientific computing contexts.

Benchmarks: We believe our current experimental suite is already comprehensive and representative. Our paper includes 10 diverse benchmarks, ranging from relatively simple problems (e.g., 1D wave equation) to significantly more complex ones, such as coupled reaction-diffusion systems and Navier–Stokes equations at high Reynolds numbers. This coverage exceeds that of most related PINNs papers published at top conferences [1–7], which typically evaluate on only 3–5 relatively simple tasks.

Importantly, across all benchmarks, SOAP consistently demonstrates superior performance and robustness compared to other optimizers. This provides strong empirical support for our claim that (quasi) second-order preconditioning helps resolve directional gradient conflicts, thereby improving the accuracy of PINNs.

W2. Didn't show why second-order methods can help with inter-step gradient alignment.

In Proposition 3, we prove that preconditioned gradient descent with preconditioner $P$ maintains high inter-step alignment, specifically
$A(g_t, g_{t+1}) \geq 1 - \epsilon$

provided the learning rate satisfies
$\eta \leq \sqrt{\frac{2\epsilon \|g_t\|^2}{\|H P^{-s} g_t\|^2}}.$

For Newton’s method, where $P = H$ and s = 1, this upper bound simplifies to

$\eta_{\max} = \sqrt{2\epsilon}$ ,

removing dependence on the Hessian’s condition number. This enables larger and more stable learning rates while preserving alignment. In fact, we show that the closer the preconditioner is to the true Hessian, the greater the inter-step gradient alignment and the more aggressive the learning rate that can be safely used.

W3. The method-wise contribution is limited by borrowing SOAP, which affects the novelty of this work.

While SOAP is used as the primary optimizer, our contributions extend significantly beyond the application of an existing algorithm.

Our core novelty lies in identifying and formalizing directional gradient conflicts as a fundamental bottleneck in PINN optimization. To this end, we introduce gradient alignment score to quantify such conflicts, which generalizes cosine similarity to multiple loss components.
We further provide rigorous theoretical analysis demonstrating that such conflicts are intrinsic to PINNs at initialization (Proposition 2). Additionally, we establish a novel connection between SOAP and Newton’s method (Theorem 1), which offers a principled explanation for why second-order preconditioning can effectively resolve these conflicts.
Our empirical results—including the first successful PINN applications to high-Reynolds-number Kolmogorov flows $\text{Re} = 10^4$ .

W4. One key contribution as claimed is "We reveal that SOAP can be viewed as an efficient approximation of the Newton preconditioner." But this is revealed by SOAP, not this work.

We respectfully disagree with this assessment and kindly ask the reviewer to revisit the original SOAP paper by Vyas el al. [10], which provides key theoretical insights:

Vyas el al. establishes that Shampoo with power 1/2 is equivalent to Adafactor operating in the eigenbasis of Shampoo’s preconditioner (Claim 1, Section 4.1).
Building on prior work by Morwani et al. (2024), Vyas el al. note that Shampoo with power 1/2 approximates the optimal Kronecker factorization of the Adagrad preconditioner.
Leveraging these insights, SOAP was designed to apply Adam (rather than Adafactor) in the eigenbasis of the Shampoo preconditioner, aiming for improved stability and efficiency.

Importantly, the SOAP paper does not provide any theoretical analysis connecting SOAP to Newton's method, nor does it claim that SOAP approximates Newton steps. In contrast, our work is the first to formally establish this connection.

Q1. The authors only focus on type II of the gradient conflicts...

Our decision to study Type II gradient conflicts was deliberate and grounded in the following rationale:

Type I conflicts (i.e., imbalanced gradient magnitudes) have been extensively studied in the PINN literature:
- Prior work, including Wang et al. [8,9] and many follow-ups, has shown that such conflicts are prevalent.
- A wide range of adaptive loss weighting schemes (e.g., GradNorm, NTK-based weighting) has been developed to address them.
In contrast, Type II conflicts (i.e., directional gradient misalignment) remain largely underexplored:
- To the best of our knowledge, only a few works ([5,6]) have explored these issues.
- Our work is among the first to systematically demonstrate that directional conflicts are widespread in PINN training, see Figures 2 and 5 for empirical evidence using our proposed gradient alignment score.
- We further provide theoretical analysis showing that Type II conflicts are intrinsic to PINNs even for simple equations like the 1D Poisson problem (Proposition 2).

Importantly, the two types of conflicts often coexist and require complementary solutions. In our training procedure (described in Appendix G.2), we apply learning rate annealing (as in [7]) to help mitigate Type I conflicts. SOAP specifically targets Type II conflicts. All ten benchmarks reported in our paper use this unified training protocol, demonstrating the robustness and compatibility of SOAP with standard loss-weighting techniques.

In summary, both conflict types are present in PINN training. Our work complements prior research on Type I conflicts by introducing new tools and theoretical insights to understand and resolve the less-addressed but equally important Type II conflicts.

Q2. Why does specifically SOAP (and sometimes Muon) perform better than others?

Formally, we prove that SOAP approximates the Newton preconditioner $H^{-1}$ whereas Muon and Kron approximate only $H^{-1/2}$ . From Proposition 3, we show that full inverse preconditioning (i.e., $H^{-1}$ ) enables higher stable learning rates and better inter-step gradient alignment. While Muon and Kron provide some alignment benefits, they are inherently limited by the weaker curvature information encoded in their preconditioners. Consequently, their performance falls short of SOAP.

In summary, SOAP performs better because it more closely approximates Newton’s method, achieving superior alignment and optimization efficiency.

[1] Characterizing possible failure modes in physics-informed neural networks, Neurips 2021

[2] Separable Physics-Informed Neural Networks, Neurips 2023

[3] Mitigating Propagation Failures in Physics-informed Neural Networksusing Retain-Resample-Release (R3) Sampling, ICML 2023

[4] PINNsFormer: A Transformer-Based Framework For Physics-Informed Neural Networks, ICLR 2024

[5] ConFIG: Towards Conflict-free Training of Physics Informed Neural Networks, ICLR 2025

[6] Dual cone gradient descent for training physics-informed neural networks. Neurips 2024

[7] PIG: Physics-Informed Gaussians as Adaptive Parametric Mesh Representations, ICLR 2025

[8] Understanding and Mitigating Gradient Flow Pathologies in Physics-Informed Neural Networks, SISC

[9] When and why PINNs fail to train: A neural tangent kernel perspective, JCP

[10] Vyas, Nikhil, et al. "Soap: Improving and stabilizing shampoo using adam." arXiv preprint arXiv:2409.11321 (2024).

2025-08-01

Thanks authors for their detailed responses to my questions, most of which are addressed. I would like to make a comment on Weakness #4. In my opinion, this contribution is not surprising because Shampoo and related methods are preconditioning methods that approximate H^-1, which is the Newton preconditioner in the Gaussian-Newton algorithm. Overall, I do acknowledge the contribution of this work and its leading role.

2025-08-06

We sincerely thank the reviewer again for their constructive feedback and for acknowledging the overall contribution of our work.

审稿意见

评分: 4置信度: 42025-07-16

This paper investigates a key optimization challenge in PINNs, where conflicting gradients from different loss terms hinder effective training. The authors propose a novel gradient alignment score, distinguishing between intra-step (within a step) and inter-step (across steps) misalignments. They reveal that first-order optimizers like Adam suffer from poor alignment, especially during early training.

To address this, they propose a quasi-second-order optimizers, especially their proposed method SOAP, which leverages preconditioning to mitigate gradient conflicts. Extensive experiments across 10 PDE benchmarks and achieving better accuracy and faster convergence.

优缺点分析

Strengths

The idea of formalizing and quantifying gradient conflicts via alignment scores in PINNs is novel.
The paper provides solid theoretical definitions (alignment score), along with analysis on why quasi-second-order methods mitigate conflicts. The proposed SOAP optimizer is well-motivated and rigorously tested. 3..Writing is very clear.

Weakness

The theoretical analysis focuses mainly on 1D Laplace equations at initialization.
What's the results of Adam+ L-BFGS as it is a very useful technique in training PINNs.
The sensitivity of performance to learning rates in SOAP vs. Adam is not deeply discussed.

问题

None.

局限性

None.

格式问题

None.

作者回复

2025-07-30

We appreciate the reviewer's thoughtful and constructive feedback. Below, we address each of the raised concerns in detail, incorporating additional experiments and clarifications to strengthen our submission.

W1. The theoretical analysis focuses mainly on 1D Laplace equations at initialization.

We agree that extending the analysis to higher-dimensional or nonlinear PDEs would strengthen the theoretical contribution. However, the 1D Laplace equation provides the most tractable setting for establishing rigorous analytical results, allowing us to clearly demonstrate that gradient conflicts are intrinsic to PINNs at initialization. Importantly, our key theoretical insights are consistently validated through extensive experiments across a broad range of benchmarks (Figure 2 and Figure 5 of the manuscript), supporting their generality beyond the 1D setting.

W2. What's the results of Adam+ L-BFGS as it is a very useful technique in training PINNs.

The Adam + L-BFGS optimization procedure was introduced in the original PINNs paper by Raissi et al [1]. At the time, Adam alone often struggled to reduce the PINN loss to sufficiently small values, motivating the use of L-BFGS as a second-stage optimizer for further refinement.

Since then, numerous techniques—such as adaptive loss weighting, Fourier features, PirateNet, and causal training—have substantially improved the loss convergence of PINNs. With these advances, the training loss can now reach very small values (e.g., below $10^{-8}$ , as demonstrated in Figures 7–9 of our paper. In this regime, the additional benefit of L-BFGS becomes negligible. This is because the networks are typically trained with float32 precision, making L-BFGS prone to numerical instability when the loss is already very small.

For benchmarks without time-marching, we conducted additional experiments by fine-tuning the trained models with Adam followed by L-BFGS. The results are summarized below. As can be seen, the effect of adding L-BFGS is minimal across all cases.

Benchmark	Adam	Adam + L-BFGS
Wave	$5.15 \times 10^{-5}$	$5.08 \times 10^{-5}$
Burgers	$8.20 \times 10^{-5}$	$8.20 \times 10^{-5}$
Allen–Cahn	$2.24 \times 10^{-5}$	$2.25 \times 10^{-5}$
Korteweg–de Vries	$7.04 \times 10^{-4}$	$7.33 \times 10^{-4}$

W3. The sensitivity of performance to learning rates in SOAP vs. Adam is not deeply discussed.

To investigate the sensitivity of performance to the learning rate, we conducted an ablation study using the Allen–Cahn benchmark. The table below summarizes the final relative $L^2$ errors obtained by Adam and SOAP across a range of learning rates.

Both optimizers demonstrate reasonable stability over learning rates ranging from 0.01 to 0.0001. However, Adam exhibits slightly higher sensitivity. In contrast, SOAP maintains consistently low error across all tested learning rates, highlighting its robustness to learning rate selection.

Learning Rate	Adam	SOAP
$0.01$	$8.70 \times 10^{-5}$	$5.45 \times 10^{-6}$
$0.005$	$1.31 \times 10^{-4}$	$4.40 \times 10^{-6}$
$0.001$	$2.68 \times 10^{-5}$	$\mathbf{2.84 \times 10^{-6}}$
$0.0005$	$\mathbf{2.14 \times 10^{-5}}$	$3.87 \times 10^{-6}$
$0.0001$	$3.68 \times 10^{-4}$	$6.34 \times 10^{-6}$

[1] Raissi, Maziar, Paris Perdikaris, and George E. Karniadakis. "Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations." Journal of Computational physics 378 (2019): 686-707.

最终决定Accept (poster)

2025-09-17

This paper identifies directional gradient conflicts as a key challenge in training PINNs and introduces a gradient alignment score to diagnose them. It shows that quasi-second-order methods mitigate these conflicts, with the SOAP optimizer—closely tied to Newton’s method—achieving 2–10× accuracy gains on 10 PDE benchmarks, including turbulent flows up to Reynolds 10,000. The work is praised for its clear formalization, intuitive alignment score, strong theory–experiment balance, and effective optimizer design. Concerns include limited novelty, narrow theory, modest gains relative to known second-order methods, missing baselines, and unclear standalone impact of SOAP given prior conflict-mitigation techniques.

During the rebuttal period, the authors offer comprehensive comparisons with NGD, a new theoretical link between gradient alignment and faster loss decay, detailed ablations on hyperparameters and costs, and an extended discussion of the benefits and limitations of second-order methods, which solve most of the concerns of reviewers. During the AC-reviewer discussion period, Reviewer SoJw and Reviewer aDSZ express a positive attitude toward acceptance.

Overall, the pros outweigh the cons, and the AC recommends for acceptance, while strongly encourages the authors to incorporate the rebuttal into the final version of the paper.