Integration Matters for Learning PDEs with Backwards SDEs

审稿意见

评分: 5置信度: 32025-06-26

The article is concerned with deep learning based methods for PDEs and more specifically, the deep BSDE method based on a stochastic reformulation of the PDE. Authors study specific loss function, so-called self-consistent loss, with Euler-Maruyama scheme for discretization of underlying stochastic processes. Paper analyses mathematically the limiting behaviour of loss function as the discretization step-size tends to zero and proves that a bias arises. Then, authors introduce an alternative discretization based on Stratonovich formulation of underlying stochastic differential equations and prove that in this case no bias arises.

优缺点分析

Strengths

The paper raises an important point: PINNs and deepBSDE type methods have not been systematically compared in the literature
The paper is mostly well-written and presented
The observation that the Euler scheme might create a bias for the deep BSDE method is interesting and new
The numerical experiments appear to be sound and indicate that the new method performs well, at the expense of higher computational cost
The method appears to be new and innovative

Weaknesses

Imprecise mathematical statements: My main concern with the paper is its lack of mathematical rigour. Broadly speaking, the paper is written in a way that indicates that probably there are conditions under which the conclusion holds true. However, in the present form of the paper it is impossible to see clearly what these conditions are. To provide a few examples:

To derive (4.3), which is crucial for the main contribution of the paper, the authors do not spell out formally any assumptions, but just mention basic regularity assumptions on $R$ , $H$ , and $u_\theta$ and strong convergence of the EM scheme. What precisely are these conditions? What are the conditions for strong convergence of the EM scheme? Can you provide a reference where this statement can be found?
Also in the proofs, there are several statements that are too informal to be checked properly. For example, in line 903: assuming enough regularity on R. Again, can you specify what are the precise assumptions here? Similarly, line 905: where the penultimate equality holds from the standard approximation properties of a left Riemann sum to the Riemann integral. Can you be more precise here? How do you get the desired decay with $\tau$ ? Moreover, some assumptions are just swept under the carpet, see line 909: We assume that sufficient regularity for the order 1/2 strong convergence of EM jointly on … This seems to be a strong assumption, that is not mentioned in the main text. Here I feel like the presentation would need to be significantly improved to clarify what are the assumptions under which the claimed statements hold.

Background on PDEs limited: The paper does not provide context on the considered class of PDEs. For example, deep BSDE methods apply to specific types of parabolic PDEs (so-called Kolmogorov PDEs). Why are the considered PDEs relevant? In what applications do they arise? Which important PDEs are of this type?

Benchmarks not reflecting state-of-the-art The overall motivation is based on numerical experiments from reference [6], which is from 2021 and does not reflect the state-of-the-art. In [6] it appears that PINNs outperform BSDE methods, but in the meantime, several other papers improving upon the deep BSDE method have been employed. To justify the strong conclusions made in the paper, some of these methods would need to be considered as well. Examples:

Beck, Christian and Becker, Sebastian and Cheridito, Patrick and Jentzen, Arnulf and Neufeld, Ariel. Deep Splitting Method for Parabolic PDEs. SIAM Journal on Scientific Computing.
Riu Naito, Toshihiro Yamada. An acceleration scheme for deep learning-based BSDE solver using weak expansions. International Journal of Financial Engineering 2020 07:02
Akihiko Takahashi a, Yoshifumi Tsuchida, Toshihiro Yamada . A new efficient approximation scheme for solving high-dimensional semilinear PDEs: Control variate method for Deep BSDE solver. Journal of Computational Physics, 2022.
Jean-François Chassagneux, Junchao Chen, Noufel Frikha. Deep Runge-Kutta schemes for BSDEs, arXiv Preprint 2212.14372.
Huyên Pham, Xavier Warin & Maximilien Germain. Neural networks-based backward scheme for fully nonlinear PDEs, SN Partial Differential Equations and Applications.

问题

Can you spell out formally the assumptions that guarantee that (4.3) holds? Moreover, what are the conditions for strong convergence of the EM scheme? Can you provide a reference where this statement can be found?
Can you provide a summary of all the assumptions needed to make the overall conclusion of the paper valid?
Can you specify what precise assumptions are needed to make lines 903, 905, 909 valid?
Can you provide more background on the considered PDEs? Why are these relevant? In which applications does each of them arise?
Can you discuss how your method compares to state-of-the art methods based on stochastic representations?

局限性

Yes

最终评判理由

Overall, the paper is well-written and mathematically rigorous. It provides an analysis of the deep BSDE method and shows how the Euler scheme might create a bias for the deep BSDE method. The numerical experiments indicate that the new method performs well. In the rebuttal, the authors have addressed my concerns about mathematical rigour and have suggested a new theorem to summarize the main findings. This will greatly clarify the presentation. Thus, in my view the only weakness is the limited benchmark comparisons. Experiments could have been strengthened by comparing to further benchmark methods. However, even without these the paper is already good, as the main finding is the convergence issue of the Euler scheme.

格式问题

None

作者回复

2025-07-30

We thank the reviewer for their thoughtful and constructive review. We greatly appreciate the recognition of the importance of comparing PINNs and BSDE methods, and the novelty of our approach.

Response to W1/Q1/Q2/Q3 (mathematical statements and assumptions):

Thanks for bringing this up. We would like to clarify a few points. First, our main theorems (Thm. 4.1 and 4.2) are fully rigorous and include all the necessary assumptions in the theorem statement. On the other hand, the discussions that follow the formal theorem statements (including Eq. (4.3) and Line 903, 905) do omit exact technical assumptions, as we opted for clarity/brevity of exposition—we wanted to convey the message of how our main theorems could be interpreted without getting bogged down into too many technical details which we felt distracted from the main message. Nevertheless, given the feedback, we will add these details back into the draft.

To answer the reviewer’s questions, and to give a sense of how we plan to edit the paper (recall we are not allowed to upload a new draft), we’ll derive Eq. (4.3) more carefully below. This will not only illustrate the kinds of assumptions we make, but also fill in some of the analysis details (e.g., how we go from Riemann sum to integral) that the reviewer is asking about. The details presented here will be added to our revision.

The starting point for Eq. (4.3) is from the proof of Theorem 4.1 (Appendix C.1), where we show that:

\left|\ell_{\mathrm{EM},\tau}(\theta, x, t) - \tau^2\left( (R[u_\theta](x, t))^2 + \frac{1}{2} \mathrm{tr}( (H(x, t) \cdot \nabla^2 u_\theta(x, t))^2 ) \right)\right| \leq L_{d,u_\theta} \tau^{5/2},

where $L_{d,u_\theta}$ is a finite constant that depends on the dimension $d$ , in addition to the Lipschitz constant of $D^2 u_\theta$ . Note that the latter is bounded, since we assume that $u_\theta \in C^{2,1}$ ( $C^{2,1}$ is defined in Line 136).

Hence at this point, we have that:

$\left|L_{\mathrm{EM},\tau}(\theta) - \frac{1}{N\tau^2} \sum_{n=0}^{N-1} \mathbb{E}_{\hat{X}_n}\left[ \tau^2\left( (R[u_\theta]( \hat{X}_n, t_n))^2 + \frac{1}{2} \mathrm{tr}( (H(\hat{X}_n, t_n) \nabla^2 u(\hat{X}_n, t_n))^2 ) \right) \right]\right| \leq L_1\tau^{1/2},$

where $L_1 = L_{d,u_\theta}$ (this was necessary to get openreview to render the previous equation).

Our next step is to argue that the map $(x, t) \mapsto F_\theta(x, t) := (R[u_\theta](x, t))^2 + \frac{1}{2} \mathrm{tr}( (H(x, t) \cdot \nabla^2 u_\theta(x, t))^2 )$ is Lipschitz for some constant $L_{F_\theta}$ . This is what the role of “basic regularity assumptions on $R[u_\theta]$ , $H$ , and $u_{\theta}$ ” (Line 150) plays; there is not a unique set of assumptions that render the map $F_\theta(x, t)$ Lipschitz. Nevertheless, we specify one sufficient condition now: (a) $H, f, h[u_\theta]$ are all bounded, and (b) $u_\theta$ and all its derivatives up to 2nd order are Lipschitz and bounded. This is certainly not the weakest assumption; one could instead assume growth conditions of all the relevant functions and their derivatives as $(x, t)$ to go infinity, and then combine this with a high probability bound on the trajectories $X_t$ on $t \in [0, T]$ . In the interest of simplicity for this rebuttal, we opt for our simpler stated assumption.

With the Lipschitz condition on $F_\theta$ in place, we can now apply order $1/2$ strong convergence of EM. Note that reference [34] (cited on Line 147)---specifically Theorem 10.2.2 of [34]---gives conditions on the drift $f$ and diffusion $g$ of the forward SDE for strong convergence to occur; these conditions are roughly Lipschitz $f, g$ and some growth conditions of $f, g$ as a function of $x$ . We then combine strong convergence with Lipschitz $F_\theta$ to argue: $|\mathbb{E}[ F(\hat{X}_n, t_n) - F(X(t_n), t_n) ]| \leq L_F \mathbb{E}\| \hat{X}_n - X(t_n) \| \leq L_F C \tau^{1/2}.$

(we had to drop the subscripts on $\theta$ and write $X_{t_n}$ (from the paper) as $X(t_n)$ to get the equation to render). Note that if in the previous step we derive a Lipschitz constant $L_F(x)$ which depends on $x$ , the definition of strong convergence involving second moments becomes important as it allows us to decouple the Lipschitz constant from the bound on $\| \hat{X}_n - X(t_n) \|$ via Cauchy-Schwarz. Combining this together with the previous bound, we have

$\left| L_{\mathrm{EM},\tau}(\theta) - \frac{1}{N}\sum_{n=0}^{N-1} \mathbb{E}[ F(X(t_n), t_n)) ]\right| \leq L_1 \tau^{1/2} + L_F C \tau^{1/2}.$

Our final step is to treat the second term on the LHS as a Riemann sum approximation to a Riemann integral. From standard approximation properties of the left Riemann sum, we have $\left| \frac{1}{T} \int_0^T \mathbb{E}[F(X(t), t)] dt - \frac{1}{N}\sum_{n=0}^{N-1} \mathbb{E}[ F(X(t_n), t_n))] \right| \leq \frac{L_F T}{2N} = \frac{L_F \tau}{2}.$

This verifies the claimed Eq. (4.3): $\left| L_{\mathrm{EM},\tau}(\theta) - \frac{1}{T}\int_0^T \mathbb{E}[ F(X(t), t)) ] dt \right| \leq L_1 \tau^{1/2} + L_F C \tau^{1/2} + L_F \tau/2 = O(\tau^{1/2}).$

We hope this clarifies the conditions under which Eq. (4.3) hold, and makes precise what is assumed in Lines 903, 905, 909. We note that Eq. (4.8) is also derived using similar arguments; we will also add its derivation to the paper.

Response to W2/Q4 (PDE background):

(Note that this question was also brought up by Reviewer sGAy, so we have duplicated our response in both rebuttals to make the rebuttals self-contained).

The non-linear PDEs we consider (cf. Eq. (3.1)) are the standard kind of PDEs considered for BSDE methods (see e.g., references [4-6]). There are many types of applications, including in quantitative finance (e.g., multi-asset options pricing via Black-Scholes, local/stochastic volatility extensions from the work of Duprie and Heston), optimal control (e.g., Hamilton-Jacobi-Bellman equations for finding value functions, Hamilton-Jacobi-Issacs for reach-avoid problems), reaction-diffusion in chemistry (e.g., Allen-Cahn and Fisher-KPP equations), committer/exit-time probabilities for transition path analysis (e.g. [R1]), and so on. We will include a paragraph about these applications to our revised draft.

We also note that another reason we consider the specific elliptic/parabolic form (3.1) given in the paper (i.e., $h[u]$ does not depend on $\nabla^2 u$ ) is to favor the standard EM-BSDE formulation, since the form (3.1) means that the EM-BSDE approach does not need to compute Hessians of $u$ in the backwards SDE, whereas the Heun-BSDE does. In general, $h[u]$ can also depend on $\nabla^2 u$ as noted in e.g., [Pham et al., 19] referenced by the reviewer, allowing for fully nonlinear second order PDEs.

[R1]: E, Vanden‑Eijnden (2006), Towards a Theory of Transition Paths. J. Stat. Phys.

Response to W3/Q5 (other benchmarks/references):

Thanks for pointing us to these references, and we will further incorporate them into our related work. For ease of reference, we will refer to these papers as [C1] to [C5]. In regards to direct comparison, our work focuses on the impact that the most widely used integration scheme in practice (Euler-Maruyama) has on the performance of BSDEs. Almost all the references above are focused on enhancements to the BSDE loss, and are also compatible with our proposed Heun-BSDE scheme. For example, we could modify our Heun-BSDE method to also utilize a control variate [C3] (i.e., split $u = u_1 + u_2$ each with their own properties), or to use Heun-BSDE in operator splitting settings [C1], or to apply to fully non-linear PDEs [C5], comparing these methods to their EM-BSDE + technique baseline. While this is an interesting direction which we will certainly incorporate in our revision, we believe the merits of our work can be judged on the basis of our current experiments which are focused on a direct comparison of EM vs. Heun, without extra confounding factors.

We next specifically discuss higher order Itô integration schemes and the paper [C4]. The next step up in complexity from a standard Euler scheme is a trapezoidal integration scheme; Heun integration is the explicit form of this. Direct application of Heun to integrate an SDE converges to the Stratonovich solution, which forms the basis of the study in our work. However, this is not the only approach. There are e.g., implicit Crank-Nicholson (CN) schemes and also higher order explicit Runge-Kutta (RK) approaches, as discussed in [C4]. We will certainly consider an empirical comparison in our future draft. However, we note several points:

First, the Heun-BSDE + Stratonovich formulation presented has several algorithmic advantages compared to other approaches: (a) it is an explicit method, and hence does not require solving implicit equations at every step, (b) it is very simple to implement, unlike e.g., higher order RK methods for Itô which require non-trivial implementations for handling diffusion terms that are not commutative---which is essentially always the case for backwards SDEs where the diffusion term depends on $\nabla u$ . Thus, part of our contribution is also to advocate for the advantages of reinterpreting the SDEs as Stratonovich, in addition to using more sophisticated integrators.
Second, the schemes presented in [C4] are not directly comparable as described, since they are designed for the recursive BSDE loss formulations, where a separate network is trained at every discretization point. On the other hand, in this paper we focus on self-consistency losses (see the discussion starting on Line 116 for the difference), which are more practical as only one network is trained. The integration schemes can be applied in the self-consistency setting as well (which we will investigate in the revision), but this is not what is specifically considered in [C4].

2025-08-02

First, I would like to thank the authors for their detailed response, addressing most of my concerns and leading to a positive overall impression of the paper.

As this was only partially addressed in the response, I would like to emphasise again that - given the strong statements made in the paper - it would be important that the paper provides a concise collection of all assumptions that are needed to derive these statements (and not just scatter them in the text). Although some assumptions are purely technical, for people aiming to build on this work it is important that all assumptions are given in one place.

2025-08-04

We thank the reviewer for their response and appreciate the suggestion to clearly state all assumptions in one place. To implement this in the paper, we plan to do the following. We will turn what is currently Eq. (4.3) and Eq. (4.8) both into formal theorem statements (with full proofs in the appendices), which will contain a list of all necessary assumptions as part of the hypothesis of the statement. For example, Eq. (4.3) will be turned into the following:

Assumption X (Strong Order $1/2$ EM Convergence): The forward SDE pair $(f, g)$ satisfies the following conditions. There exists finite constants $L_1, L_2, C_1$ such that for all $t \in [0, T]$ and $x, y \in \mathbb{R}^d$ :

$| f(x, t) - f(y, t) \| + \| g(x, t) - g(y, t) \| \leq L_1 \| x - y\|$ ,
$\| f(x, s) - f(x, t) \| + \| g(x, s) - g(x, t) \| \leq L_2 (1 + \|x\|) |s - t|^{1/2}$ ,
$\| f(x, t) \| + \| g(x, t) \| \leq C_1 (1 + \|x\|)$ .

Theorem: Suppose that (a) $H, f, h[u_\theta]$ are all bounded, (b) $u_\theta$ and all its derivatives up to 2nd order are Lipschitz and bounded, and (c) that Assumption X holds. Then,

$\left| L_{\mathrm{EM},\tau}(\theta) - \frac{1}{T}\int_0^T \mathbb{E}[ F(X(t), t)) ] dt \right| \leq O(\tau^{1/2}).$

We hope this plan for revising the manuscript addresses the reviewers concern regarding collecting all assumptions in one place. Again, we thank the reviewer for bringing this point up and we agree it makes the results more easier to extend in future work.

2025-08-05

Thank you for your reply. I think this is an excellent suggestion that will greatly clarify the presentation. Overall, the rebuttal has addressed my concerns. I have revised my rating and recommend the paper for acceptance.

审稿意见

评分: 5置信度: 32025-06-28

The authors introduce a new integrator based on Stratonovich viewpoint of SDEs to reduce biases induced with Euler based solvers. The authors consider this in the case of solving backwards SDEs for use in semi-linear parabolic PDEs with applications in control. They illustrate theoretically and numerically the advantages of the Stratonovich based BSDE integrator.

优缺点分析

Strengths:

The proposed integrator has theoretical and empirical advantages over existing Euler Maruyama and PINNs based approaches for solving high dimensional semi linear parabolic PDEs.

The problem is studied from a numerical analysis perspective, which provides useful results for other fields beyond machine learning.

Weaknesses:

The main weakness is the induced computation time from the proposed method.

The empirical results could probably use a bit more analysis. For example, in one case the vanilla EM method performs better which is quite surprising given that there’s only four examples of empirical studies conducted on the methods.

This applies to a limited set of PDEs and perhaps some more examples should be motivated within the text.

问题

It’s surprising to see how well FS-PINNs works in the experiments, do the authors have an idea as to why this is the case?

Do the authors have any suggestions for how they can improve upon the computation speed of the proposed method?

Under a similar computational budget, how do the different methods compare? E.g. for a time constrained problem, which would be the correct method to use?

I’m a bit curious as to why PINNs are much faster, my initial impression is that computing derivatives manually especially in high dimensions would be slower? Is this mainly because these are vector hessian products?

I think a study on dimensionality vs performance could also be quite a bit helpful.

From table 3, is there any intuition why the 32 bit performance makes the Euler Maruyama baselines perform better than 64 bit?

局限性

Yes

最终评判理由

While I still believe that the authors should do more numerical studies and should adequately describe what performance improvements were done, I think the paper could be an interesting contribution to the field (while echoing some of the other reviewers' comments that more motivation should be added within the manuscript to relate this class of PDEs to more well known concepts within the field).

格式问题

None

作者回复

2025-07-31

We thank the reviewer for their thoughtful and constructive review. We greatly appreciate the recognition of the usefulness of our numerical analysis perspective, with impact possibly beyond the ML community.

Response to W1/Q2/Q3 (Computational Costs):

(Note that similar questions were also brought up by Reviewers LXC6 and Fvez, so we have duplicated our response to make the rebuttals self-contained).

We agree that the computational costs of the Heun integration presented in the original draft are key limitations for practical implementation. Hence, we spent a bit of time investigating how we could improve our implementation. We are happy to report that we have improved our implementation—for all methods—and the result is that the computational overhead of the Heun implementation is much more reasonable. Due to the given rebuttal time constraints, we were unfortunately only able to re-run our new implementation for all methods on the 100D HJB case, but we expect the computational benefits from the implementation changes we made (which we describe below) to carry over to all PDE cases. We will of course re-run all methods over all PDE cases for the revised draft.

In the original Heun-BSDE implementation, the forward and backward SDEs rollouts were performed simultaneously, evaluating $L_{Heun_{k,\tau}}(\theta)$ at each step. As a result, the loss gradient calculation required back-propagation through the entire rollout; we had to resort to gradient checkpointing when combined with the Hessian calculation of $u$ in order to avoid running out of memory. In our new implementation, we first perform a forward SDE rollout to generate trajectories, $\mathcal{X}=\lbrace(x^i_t,t)\mid t\in[0,T],i=1,\dots,N\rbrace$ , then calculate the loss in one pass by batching across $\mathcal{X}$ . Additionally, we further optimize and calculate the loss on $\mathcal{B}\subset\mathcal{X}$ , which importantly decouples the batch size from trajectory length. Both EM-BSDE and FS-PINNs methods can be (and were) optimized in the same manner, with the exception of the no-resetting method (EM-BSDE (NR)) (see Lines 229-232) which requires additional considerations due to its propagation of $Y_t$ . In the following table, we report the average performance and overhead across three runs for each method with a batch size of 256:

100D HJB	RL2	Runtime (s)	Overhead
PINNs	0.1349 ± .0368	2170 ± 9	1x
FS-PINNs	0.0611 ± .0137	2227 ± 4	1.03x
EM-BSDE	0.3549 ± .0119	384 ± 2	.18x
Heun-BSDE	0.0464 ± .0050	3846 ± 78	1.77x

These results align closely with the RL2 errors reported in Table 1, albeit with significantly lower runtimes (cf. Table 4). We also note some key details:

The updated trajectory rollout schemes for FS-PINNs and EM-BSDE are more efficient with nearly zero overhead. This is reflected in the runtime of FS-PINNs vs. PINNs.
The Heun-BSDE scheme requires approximately double the compute for Heun-based trajectory rollouts and loss calculation versus FS-PINNs, which is reflected in its overhead.
For the specific elliptic/parabolic PDE we consider in (3.1), EM-BSDE does not require the computationally expensive Hessian computation of $u$ in the backwards SDE, while Heun-BSDE (and (FS)-PINNs) does, leading to significantly faster runtimes. This does not necessarily hold true for all PDEs however. In general, $h[u]$ can also depend on $\nabla^2u$ as noted in e.g., [Pham et al., 19] referenced by Reviewer DQqj, allowing for fully nonlinear second order PDEs. Alternatively, when solving first-order PDEs, the Stratonovich formulation $\mathrm{d}X_t=f(X_t,t)\mathrm{d}t+\sigma\circ\mathrm{d}B_t$ $\mathrm{d}Y_t=[\partial_tu(X_t,t)+\nabla_xu(X_t,t)\cdot f(X_t,t)]\mathrm{d}t+Z_t^T\circ\mathrm{d}B_t$ does not introduce a Hessian term, while the Ito formulation $\mathrm{d}X_t=f(X_t,t)\mathrm{d}t+\sigma\mathrm{d}B_t$ $\mathrm{d}Y_t=\left[\partial_tu(X_t,t)+\nabla_xu(X_t,t)\cdot f(X_t,t)+\frac{\sigma^2}{2}\mathrm{Tr}\left[\nabla^2_{xx}u(X_t,t)\right]\right]\mathrm{d}t+Z_t^T\mathrm{d}B_t$ requires the Hessian term for calculation. In these cases, we actually expect Heun-BSDE to either match or be faster than EM-BSDE.

In addition, we ran the original EM-BSDE solver to equal compute time as the original Heun-BSDE method on the 100D HJB case, and as predicted by our theoretical analysis, we found that EM-BSDE does not improve significantly with more iterations (see table).

Iterations	run time	RL2
100,000	.61 hours	0.3626 ± .0113
2,000,000	14.5 hours	0.3394 ± .0093

While EM-BSDE converges faster than all other methods, the RL2 error quickly plateaus, leading to little improvement despite further optimization.

In the final revision, we will update all relevant results with the updated implementation discussed above and further include our insights on Hessian computations. Additionally, we agree that a study on dimensionality and run-time normalized results would provide valuable insights for the paper. Therefore, we plan to include results from both studies as plots for added clarity and completeness.

Response to W2/W3 (Empirical Analysis and Generality of PDEs considered):

First, we hypothesize that EM-BSDE vs Heun-BSDE performance on the 100D BZ case in Table 1 is due to the optimization challenges which arise from the high-dimensional, coupled nature of the problem, which may be confounding performance results (cf. Lines 252-256). This is evidenced by the lack of convergence for PINNs, FS-PINNs, and EM-BSDE (NR), in addition to recovering the expected ordering at lower dimensions (10D BZ).

(Note that the following response’s question was also brought up by Reviewer DQqj, so we have duplicated our response to make the rebuttals self-contained).

Additionally, the non-linear PDEs we consider (cf. Eq. (3.1)) are the standard kind of PDEs considered for BSDE methods (see e.g., references [4-6]). There are many types of applications, including in quantitative finance (e.g., multi-asset options pricing via Black-Scholes, local/stochastic volatility extensions from the work of Duprie and Heston), optimal control (e.g., Hamilton-Jacobi-Bellman equations for finding value functions, Hamilton-Jacobi-Issacs for reach-avoid problems), reaction-diffusion in chemistry (e.g., Allen-Cahn and Fisher-KPP equations), committer/exit-time probabilities for transition path analysis (e.g. [R1]), and so on. We will include a paragraph about these applications to our revised draft.

We also note that another reason we consider the specific elliptic/parabolic form (3.1) given in the paper (i.e., $h[u]$ does not depend on $\nabla^2u$ ) is to favor the standard EM-BSDE formulation, since the form (3.1) means that the EM-BSDE approach does not need to compute Hessians of $u$ in the backwards SDE, whereas the Heun-BSDE does.

The empirical examples tested in the paper were deliberately chosen for their simplicity and existence of analytical solutions in order to focus on isolating the integration bias introduced by EM-BSDE. For additional examples of PDEs, we are currently working to apply these methods to more complex non-linear optimal control problems (e.g., a high-dimensional task involving collision avoidance for many Dubin’s cars), and will report on our results in the final paper.

[R1]: E, Vanden‑Eijnden (2006), Towards a Theory of Transition Paths. J. Stat. Phys.

Response to Q1 (FS-PINNs Performance):

The FS-PINNs performance is in some sense not surprising, since the RL2 metric we evaluate performance over uses the forward SDE trajectories to compute the collocation points (cf. Lines 239-240, Eq. (6.2)). Hence, the FS-PINNs sampling distribution in training matches the performance metric better versus the PINNs sampling distribution in training, which uses a normal distribution fit to match the spatial distribution of the forward SDE trajectories. This explains the difference in performance.

We note that our choice of RL2 metric is deliberate—using the forward SDE as the sampling distribution is a much more meaningful measure than using e.g., uniform or isotropic Gaussian over a high-dimensional spatial domain. Furthermore, for problems such as solving Hamilton-Jacobi-Bellman (HJB) equations, the forward SDE trajectories actually have semantic meaning for the problem (e.g., either open-loop or closed-loop trajectory rollouts for HJB), which is substantially more useful for downstream tasks than the typical uniform distributions used in standard PINNs approaches.

Response to Q4 (PINNs Speed):

With the exception of EM-BSDE, all models require Hessian computations in the tested cases. The perceived speed advantage of PINNs stems from the other methods requiring trajectory rollouts for sampling, which impacted speed in our original implementation. We address this issue in our new implementation described in (Computational Costs).

Response to Q6 (f32 vs f64):

We observed this as well and believe that one plausible explanation is that float32 works as a form of implicit regularization due to the stochasticity introduced by lower precision (but importantly, the benefits are not enough to overcome the issues with EM-BSDE vs Heun-BSDE, cf. Table 3). Equally likely, there could also be subtle differences in how either jax, the Cuda layer, or the underlying GPU hardware handles f32 vs f64 which causes the discrepancy—these types of numerical issues are unfortunately not uncommon in modern NN training pipelines. To further investigate, we ran a similar test for both EM-BSDE variants on the 100D HJB benchmark which showed minimal differences between f32 and f64 (see table below, cf. Table 3). Interestingly however, there does still remain a persistent gap between f32 and f64, which we leave to future investigation.

100D HJB	`float32`	`float64`
EM-BSDE	0.3776 ± .0365	0.3820 ± .0219
EM-BSDE (NR)	0.4459 ± .0410	0.4676 ± .0153

2025-08-01

Thank you for your response, I'd be curious to hear about what led to the significant speed up (possibly something to do with jitting some of the functions in Jax?). I've updated my score accordingly.

2025-08-02

First and foremost, thank you for the updated score and continued engagement.

As described in paragraph 2 of “Computational Costs”, the major speed-up is attributed to the separation of forward and backward SDE rollouts. For additional detail, we first rollout the forward trajectory using jax.scan, then calculate the BSDE loss simultaneously for each trajectory point, utilizing a stop gradient to separate the forward rollout from the loss calculation. This gives us the added benefit of subsampling the loss calculation and a more streamlined gradient calculation without compromising on performance. Additionally, we implemented minor algorithmic improvements, such as pre-allocating gaussian noise, bypass flags for redundant calculations, and vectorized initial state functions, which also contributed to the speed up.

审稿意见

评分: 5置信度: 42025-07-03

Backward SDE-based deep learning methods offer a powerful framework for solving high-dimensional PDEs. Most existing approaches rely on the Euler–Maruyama (EM) scheme for time integration. This paper revisits that choice carefully and points out a discretization bias introduced by the EM scheme. To address this, the authors propose a Stratonovich-based BSDE formulation paired with a stochastic Heun integrator, which eliminates the bias. The paper presents detailed theoretical analysis of both schemes and supports the findings with three high-dimensional PDE examples, demonstrating notable improvements with the proposed method.

优缺点分析

Strengths:

The paper challenges a commonly used but under-examined component—time integration—in BSDE-based PDE solvers and highlights the significant role that the choice of integration scheme plays in solution quality. It is of great originality.
The authors conduct a careful and thorough numerical analysis of discretization errors associated with both schemes. The implications are consistent with prior observations in the literature and are well supported by numerical experiments.

Weaknesses:

As the authors themselves note, the method proposed in Ref [6] is closely related but is not included as a baseline. This seems like a notable omission.
The additional computational cost introduced by the second-order derivatives required in the Stratonovich formulation appears to be a key limitation of the method. Further comparisons that account for this overhead would provide a clearer assessment of the method’s practical value.

问题

Beyond the points raised above, here are several additional comments:

Literature discussion contains several inaccuracies:
- The claim in the abstract that "existing BSDE-based solvers have empirically been shown to underperform relative to PINNs in the literature" is quite misleading. In the main text, this is attributed to Ref [6], but the diffusion loss proposed in that work itself is based on a BSDE formulation and shows notably better accuracy than both PINNs and the original Deep BSDE method. So this statement undercuts the actual strengths shown by BSDE-based solvers in the high-dimensional PDE setting relative to PINNs; see also the recent development in arXiv:2409.08526.
- In the introduction, the authors cite Refs [1,2]—the original PINN papers—as “scalable alternatives” meant to address the curse of dimensionality. This characterization is inaccurate. Those early PINN works (and many follow-ups) primarily focused on low-dimensional problems and were not designed with high-dimensional PDEs in mind. In contrast, BSDE-based methods were motivated from the start by high-dimensional settings, as discussed in Refs [4,22]. To the best of reviewer's knowledge, interest in applying PINNs to high-dimensional PDEs only developed more recently; see, e.g., arXiv:2307.12306 and references therein.
- Ref [3], the Deep Ritz method, is not a PINN-type approach; it is a distinct variational method and should not be grouped together with PINNs.
Table 2 presents computational time differences across methods. It would be helpful to also compare performance under equal computation time rather than fixed iteration counts, as that would give a more meaningful picture of practical efficiency. Again, including diffusion loss from Ref [6] as a baseline is helpful.
There is a substantial body of theoretical work on Itô-based BSDE formulations that supports the analysis of deep learning–based methods (e.g., arXiv:1811.01165, arXiv:2307.15496). It would be helpful if the authors could comment on whether a similar theoretical framework could be developed for the Stratonovich-based formulation. Such discussion could strengthen the theoretical appeal of the proposed method.

局限性

As acknowledged, the overhead due to second-order derivatives is a main limitation.

最终评判理由

The paper is technically sound and is likely to have a significant impact on the actively studied field of solving PDEs and related problems through an SDE formulation.

格式问题

None

作者回复

2025-07-31

We thank the reviewer for their thoughtful and constructive review. We greatly appreciate the recognition of both the originality of our work and its careful and thorough numerical analysis.

Response to W1/Q2 (Comparisons to [6]):

Thank you for raising this point. While we do not specifically use the term “diffusion loss” from [6] in the baselines of our experiments, the EM-BSDE implementation of the multi-step self consistency losses $L_{BSDE,H}(\theta)$ evaluated in Section 6.2 is precisely the method described in [6]; we discuss this in both Footnote 2 and Lines 205-207. In our revision, we will improve the clarity of the writing in the paper describing the connection between the two losses. Given this connection, Figure 4 demonstrates exactly the comparison that the reviewer is inquiring about: tuning the skip length does improve performance consistent with the empirical results in [6], but ultimately does not solve the bias issue in EM-BSDE (cf. Lines 303-305). Again, we will make this connection more explicitly clear in our revision.

Response to W2/Q2 (Computational Costs):

(Note that similar questions were also brought up by Reviewers LXC6 and sGAy, so we have duplicated our response to make the rebuttals self-contained).

We agree that the computational costs of the Heun integration presented in the original draft are key limitations for practical implementation. Hence, we spent a bit of time investigating how we could improve our implementation. We are happy to report that we have improved our implementation—for all methods—and the result is that the computational overhead of the Heun implementation is much more reasonable (e.g., <2x overhead over PINNs/FS-PINNs). Due to the given rebuttal time constraints, we were unfortunately only able to re-run our new implementation for all methods on the 100D HJB case, but we expect the computational benefits from the implementation changes we made (which we describe below) to carry over to all PDE cases. We will of course re-run all methods over all PDE cases for the revised draft.

We now describe the implementation changes we made. In the original Heun-BSDE implementation, the forward and backward SDEs rollouts were performed simultaneously, evaluating $L_{Heun_{k,\tau}}(\theta)$ at each step. As a result, the loss gradient calculation required back-propagation through the entire rollout; we had to resort to gradient checkpointing when combined with the Hessian calculation of $u$ in order to avoid running out of memory. In our new implementation, we first perform a forward SDE rollout to generate trajectories, $\mathcal{X}=\lbrace (x^i_t,t)\mid t \in [0,T],i=1,\dots,N \rbrace$ , then calculate the loss in one pass by batching across $\mathcal{X}$ . Additionally, we further optimize and calculate the loss on $\mathcal{B}\subset \mathcal{X}$ , which importantly decouples the batch size from trajectory length. Both EM-BSDE and FS-PINNs methods can be (and were) optimized in the same manner, with the exception of the no-resetting method (EM-BSDE (NR)) (see Lines 229-232) which requires additional considerations due to its propagation of $Y_t$ . In the following table, we report the average performance and overhead across three runs for each method with a batch size of 256:

100D HJB	RL2	Runtime (s)	Overhead
PINNs	0.1349 ± .0368	2170 ± 9	1x
FS-PINNs	0.0611 ± .0137	2227 ± 4	1.03x
EM-BSDE	0.3549 ± .0119	384 ± 2	.18x
Heun-BSDE	0.0464 ± .0050	3846 ± 78	1.77x

These results align closely with the RL2 errors reported in Table 1, albeit with significantly lower runtimes (cf. Table 4). We also note some key details:

The updated trajectory rollout schemes for FS-PINNs and EM-BSDE are more efficient with nearly zero overhead. This is reflected in the runtime of FS-PINNs vs. PINNs.
The Heun-BSDE scheme requires approximately double the compute for Heun-based trajectory rollouts and loss calculation versus FS-PINNs, which is reflected in its overhead.
For the specific elliptic/parabolic PDE we consider in (3.1), EM-BSDE does not require the computationally expensive Hessian computation of $u$ in the backwards SDE, while Heun-BSDE (and (FS)-PINNs) does, leading to significantly faster runtimes. This does not necessarily hold true for all PDEs however. In general, $h[u]$ can also depend on $\nabla^2 u$ as noted in e.g., [Pham et al., 19] referenced by Reviewer DQqj, allowing for fully nonlinear second order PDEs. Alternatively, when solving first-order PDEs (e.g. deterministic HJB problems), the Stratonovich formulation $\mathrm{d}X_t=f(X_t,t)\mathrm{d}t+\sigma \circ \mathrm{d} B_t$ $\mathrm{d}Y_t=[\partial_tu(X_t,t)+\nabla_xu(X_t,t)\cdot f(X_t,t)]\mathrm{d} t + Z_t^T\circ \mathrm{d}B_t$ does not introduce a Hessian term, while the Ito formulation $\mathrm{d}X_t=f(X_t,t)\mathrm{d}t+\sigma \mathrm{d} B_t$ $\mathrm{d}Y_t=\left[\partial_tu(X_t,t)+\nabla_xu(X_t,t)\cdot f(X_t,t)+\frac{\sigma^2}{2}\mathrm{Tr}\left[\nabla^2_{xx}u(X_t,t)\right]\right]\mathrm{d} t + Z_t^T \mathrm{d}B_t$ requires the Hessian term for calculation. In these cases, we actually expect Heun-BSDE to either match or be faster than EM-BSDE.

In addition, we ran the original EM-BSDE solver to equal compute time as the original Heun-BSDE method on the 100D HJB case, and as predicted by our theoretical analysis, we found that EM-BSDE does not improve significantly with more iterations (see table).

Iterations	Run Time	RL2
100,000	.61 hours	0.3626 ± .0113
2,000,000	14.5 hours	0.3394 ± .0093

While EM-BSDE converges faster than all other methods, the RL2 error quickly plateaus, leading to little improvement despite further optimization.

Overall, the updated results align with the paper’s original claims while demonstrating the potential for practical application. In the final revision, we will update all relevant results with the updated implementation discussed above and further include our insights on Hessian computations. Additionally, we will include runtime-normalized performance plots (e.g. error vs wall-clock time) with each method for added clarity and completeness.

Response to Q1 (Literature review):

We thank the reviewer for the helpful clarifications regarding literature discussions. We agree that the abstract and introduction could be more precise in the positioning of the PINNs and BSDE literature. Specifically,

We acknowledge that as noted by the reviewer, [6] demonstrates that BSDE solvers can exceed the performance of standard PINNs solvers in high dimensional settings, with the use of e.g., the proposed diffusion loss. However, as shown in [6], standard (EM-)BSDE methods do indeed struggle compared with PINNs; our statement about existing (EM-)BSDE solvers in the abstract was intended to refer to this phenomenon. It is this gap in performance, in addition to trying to answer why the diffusion loss empirically improves performance, that motivated the present analysis. We apologize for the confusion, and in the final version we will revise the abstract and introduction to clarify this point.
We agree that the characterization of early PINNs as “scalable alternatives" for high-dimensional PDEs should be clarified. As pointed out by the reviewer, the original PINNs works [1,2] cited primarily focused on low-dimensional PDEs. However, we do believe that the motivation for using PINNs like methods for solving high-dimensional PDEs did arise quite early in the development. For example, the initial work on Deep Galerkin Methods (DGMs) from [Sirignano and Spiliopoulos, 2018] was motivated by looking for meshfree algorithms for high-dimensional PDEs. Furthermore, as another example, practitioners from the control theory community started to use PINNs methods to solve Hamilton-Jacobi reachability PDEs [Bansal and Tomlin, 2020] not long after the DGM work, with the explicit motivation to avoid the curse of dimensionality from classic grid-based PDE solvers. Nevertheless, we do agree that the work of [Hu et al., 2023] does seem to be the first work that explicitly studies the extent to which PINNs methods can be scaled to high dimension (e.g., 100k dimensions) through the use of algorithmic modifications to the training. We will certainly update our related work to reflect this chronology more accurately.
We agree that [3] is a distinct method from PINNs and should not be grouped under the PINNs umbrella– thanks for catching this mistake. We will correct this in our revision.

Response to Q3 (Stratonovich-based BSDE theory for deep-learning based methods):

This is an interesting question, and we believe the answer to be yes. We answer the question in the context of the self-consistency losses we consider, but similar answers should also hold for the recursive formulations (where separate models are estimated at every timestep, see Line 116 for the distinction). Our result Theorem (4.2) combined with Eq. (4.8) is essentially the starting point for such a Stratonovich theory, by showing that the error of the $L_{Heun,\tau}(\theta)$ loss scales as the PDE residual term $R[u_\theta]$ (integrated along trajectories of $X_t$ ) plus an $O(\tau^{1/2})$ term that arises from the discretization. Hence, by choosing the discretization small enough, the difference between the Heun loss and the PDE residual integral can be made small enough. What is left to derive is that having small PDE residual error implies control of e.g., the true backwards SDE paths $(Y_t)$ and approximate path $(\hat{Y}^\theta_{nk})$ generated by the model $u_\theta$ , or other measures of quantify solution quality that is useful for downstream applications. This part, while technical in nature, should at a high level follow similar arguments as the cited papers for the Ito setting.

2025-08-04

Thanks for the response. My main concerns were addressed, and I have updated my score.

审稿意见

评分: 5置信度: 22025-07-03

This paper studies backward stochastic differential equation (BSDE) based deep learning methods for solving high-dimensional partial differential equations (PDEs). These methods provide an alternative to the popular PINNs and have an advantage that the underlying equation do not need to be known, but their performance has been lagging behind PINNs. The authors identify the reason of this performance gap as the Euler-Maruyama (EM) integrator typically used for BSDE methods, which is based on ito SDE formulation and is shown by the authors to induce significant discretization bias. The authors propose to employ Stratonovich SDE interpretation and use Heun integration scheme, which eliminates the raised issue. The authors empirically demonstrate that the fixed BSDE algorithm is competitive to PINNs in solving three PDE benchmark problems.

优缺点分析

Strengths

S1. The paper addresses an important problem of engineering alternative approaches to PINNs for PDE solving which have qualitatively different characteristics (e.g., in the required information for training).

S2. The authors theoretically show that BSDE methods with EM integrator suffers from a bias term which can be problematic even in the limit of infinite training data, and demonstrate it through an informative example (Theorem 4.1; Figure 1). The algorithmic fix based on Stratonovich BSDEs and Heun integration are well-motivated and analyzed to avoid the discretization error (Theorem 4.2; Figure 1). A potential fix based on multi-step losses is further analyzed in Section 5.

S3. The experimental results in Section 6 support the theoretical claims made in Sections 4 and 5. The baselines are properly chosen, the results are presented clearly, and limitations are transparently discussed.

Weaknesses

My concerns are mostly about the the limitations already discussed in the paper by the authors.

W1. As the authors discussed in Section 7, the fact that the proposed method requires computing the Hessian leads to a large computational overhead, including the requirement to use float64 precision. While this is properly acknowledged and discussed (including Appendix E.2), it would be informative to, e.g., have a controlled EM-BSDE baseline which is allowed to use a similar amount of computation for integration, if it is possible.

Minor comment

Line 35: involving -> involves?

问题

I have no particular questions, but would like to hear the authors' opinion on W1.

局限性

The authors discussed limitations in Sections 6, 7 and Appendix A.

最终评判理由

The authors have clearly addressed the original main concern on computational cost through a careful algorithmic optimization.

格式问题

I did not find particular formatting concerns.

作者回复

2025-07-31

We thank the reviewer for their thoughtful and constructive review. We greatly appreciate the recognition of the importance of studying alternatives to PINNs approaches for solving PDEs, in addition to the well-motivated approach that we take in this work.

Response to W1 (Computational Costs):

(Note that similar questions were also brought up by Reviewers Fvez and sGAy, so we have duplicated our response to make the rebuttals self-contained).

We agree that the computational costs of the Heun integration presented in the original draft are key limitations for practical implementation. Hence, we spent a bit of time investigating how we could improve our implementation. We are happy to report that we have improved our implementation—for all methods—and the result is that the computational overhead of the Heun implementation is much more reasonable (e.g., <2x overhead over PINNs/FS-PINNs). Due to the given rebuttal time constraints, we were unfortunately only able to re-run our new implementation for all methods on the 100D HJB case, but we expect the computational benefits from the implementation changes we made (which we describe below) to carry over to all PDE cases. We will of course re-run all methods over all PDE cases for the revised draft.

We now describe the implementation changes we made. In the original Heun-BSDE implementation, the forward and backward SDEs rollouts were performed simultaneously, evaluating $L_{Heun_{k,\tau}}(\theta)$ at each step. As a result, the loss gradient calculation required back-propagation through the entire rollout; we had to resort to gradient checkpointing when combined with the Hessian calculation of $u$ in order to avoid running out of memory. In our new implementation, we first perform a forward SDE rollout to generate trajectories, $\mathcal{X}=\lbrace (x^i_t,t)\mid t \in [0,T],i=1,\dots,N \rbrace$ , then calculate the loss in one pass by batching across $\mathcal{X}$ . Additionally, we further optimize and calculate the loss on $\mathcal{B}\subset \mathcal{X}$ , which importantly decouples the batch size from trajectory length. Both EM-BSDE and FS-PINNs methods can be (and were) optimized in the same manner, with the exception of the no-resetting method (EM-BSDE (NR)) (see Lines 229-232) which requires additional considerations due to its propagation of $Y_t$ . In the following table, we report the average performance and overhead across three runs for each method with a batch size of 256:

100D HJB	RL2	Runtime (s)	Overhead
PINNs	0.1349 ± .0368	2170 ± 9	1x
FS-PINNs	0.0611 ± .0137	2227 ± 4	1.03x
EM-BSDE	0.3549 ± .0119	384 ± 2	.18x
Heun-BSDE	0.0464 ± .0050	3846 ± 78	1.77x

These results align closely with the RL2 errors reported in Table 1, albeit with significantly lower runtimes (cf. Table 4). We also note some key details:

The updated trajectory rollout schemes for FS-PINNs and EM-BSDE are more efficient with nearly zero overhead. This is reflected in the runtime of FS-PINNs vs. PINNs.
The Heun-BSDE scheme requires approximately double the compute for Heun-based trajectory rollouts and loss calculation versus FS-PINNs, which is reflected in its overhead.
For the specific elliptic/parabolic PDE we consider in (3.1), EM-BSDE does not require the computationally expensive Hessian computation of $u$ in the backwards SDE, while Heun-BSDE (and (FS)-PINNs) does, leading to significantly faster runtimes. This does not necessarily hold true for all PDEs however. In general, $h[u]$ can also depend on $\nabla^2 u$ as noted in e.g., [Pham et al., 19] referenced by Reviewer DQqj, allowing for fully nonlinear second order PDEs. Alternatively, when solving first-order PDEs (e.g. deterministic HJB problems), the Stratonovich formulation $\mathrm{d}X_t=f(X_t,t)\mathrm{d}t+\sigma \circ \mathrm{d} B_t$ $\mathrm{d}Y_t=[\partial_tu(X_t,t)+\nabla_xu(X_t,t)\cdot f(X_t,t)]\mathrm{d} t + Z_t^T\circ \mathrm{d}B_t$ does not introduce a Hessian term, while the Ito formulation $\mathrm{d}X_t=f(X_t,t)\mathrm{d}t+\sigma \mathrm{d} B_t$ $\mathrm{d}Y_t=\left[\partial_tu(X_t,t)+\nabla_xu(X_t,t)\cdot f(X_t,t)+\frac{\sigma^2}{2}\mathrm{Tr}\left[\nabla^2_{xx}u(X_t,t)\right]\right]\mathrm{d} t + Z_t^T \mathrm{d}B_t$ requires the Hessian term for calculation. In these cases, we actually expect Heun-BSDE to either match or be faster than EM-BSDE.

In addition, we ran the original EM-BSDE solver to equal compute time as the original Heun-BSDE method on the 100D HJB case, and as predicted by our theoretical analysis, we found that EM-BSDE does not improve significantly with more iterations (see table).

Iterations	Run Time	RL2
100,000	.61 hours	0.3626 ± .0113
2,000,000	14.5 hours	0.3394 ± .0093

While EM-BSDE converges faster than all other methods, the RL2 error quickly plateaus, leading to little improvement despite further optimization.

Overall, the updated results align with the paper’s original claims while demonstrating the potential for practical application. In the final revision, we will update all relevant results with the updated implementation discussed above and further include our insights on Hessian computations. Additionally, we will include runtime-normalized performance plots (e.g. error vs wall-clock time) with each method for added clarity and completeness.

2025-08-01

Thank you for the response. My main concern has been addressed and I accordingly updated my assessment.

最终决定Accept (spotlight)

2025-09-17

This paper investigates a critical issue in BSDE-based deep learning methods for solving PDEs: a performance gap relative to PINNs caused by a discretization bias from the standard Euler-Maruyama (EM) integrator. Through both compelling theoretical analysis and empirical evidence, the authors demonstrate that this bias is a major drawback. To resolve it, they introduce a novel approach that uses the Stratonovich SDE interpretation and the Heun integration scheme, which successfully eliminates the bias. While initial concerns about high computational overhead were noted, the authors' rebuttal effectively addresses this weakness, showing substantial efficiency improvements that make their fixed algorithm highly competitive. Overall, the work is a significant contribution, backed by rigorous theory and solid experiments, and presents a promising, viable alternative to PINNs.

Integration Matters for Learning PDEs with Backwards SDEs

摘要

评审与讨论

优缺点分析

Strengths

Weaknesses

问题

局限性

最终评判理由

格式问题

优缺点分析

问题

局限性

最终评判理由

格式问题

优缺点分析

问题

局限性

最终评判理由

格式问题

优缺点分析

问题

局限性

最终评判理由

格式问题