PaperHub
7.8
/10
Spotlight4 位审稿人
最低4最高5标准差0.4
5
5
4
5
3.5
置信度
创新性3.0
质量3.0
清晰度2.5
重要性2.5
NeurIPS 2025

Diffusion-Based Hierarchical Graph Neural Networks for Simulating Nonlinear Solid Mechanics

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29
TL;DR

We combine denosing diffusion probabilistic models and hierarchical graph neural networks to autoregressively simulate physical dynamics on unstructured meshes.

摘要

关键词
Neural PDE SolversDiffusion ModelsGraph Neural NetworksLearned SimulationAI4Science

评审与讨论

审稿意见
5

The paper introduces a method for learned, GNN-based simulation targeted at solid mechanics applications. It introduces a novel rolling diffusion method and hierarchical message passing scheme which allows fast & accurate predictions.

优缺点分析

Strengths:

  • Quasistatic elasticity simulations are challenging for GNN simulators; yet the presented approach does seem to be quite effective at them
  • All the absolutely necessary ablations are there (even though comparisons to other HGNN or diffusion approaches would have been nice)
  • This is a nice demonstration that rolling diffusion can be a useful approach for learned simulators
  • Overall, the paper is well written and easy to follow

Weaknesses:

  • The main issue I see is that the paper splits its attention too much between rolling diffusion, HGNNs and applications in solid mechanics. The paper does the minimum diligence in all of these areas, but doesn't deliver any fundamental insights in either that might transfer to other work. E.g., there are so many HGNN papers with slightly different choices of graph construction and message propagation, and this paper adds yet another option without properly evaluating it against other options. Similarly, I feel the effect of conditioning on not fully denoised previous timesteps is underexplored. There is one ablation on this with downstream performance, but it would have been much more interesting to develop an understanding to what this does to simulation rollouts; e.g. does this affect coherence of high-frequency details?
  • The choice of experimental domain is a bit odd. Quasistatic elasticity is a good domain for studying hierarchical message passing, but less so for diffusion-- these simulations are very deterministic and smooth. So diffusion a) seems much less important here but also b) the tradeoffs in conditioning on not fully resolved diffusion might look very different in a less smooth domain. So it would have been good to also include a domain with more chaotic dynamics on multiple scales (e.g. CFD, weather, fracture,...).
  • (Minor) The description of ROBI and corresponding figure could be improved; it's a bit of a notation-fest, and the figure is hard to parse. It took me a few passes to understand which step is conditioned on what exactly in diffusion & simulation time.

问题

No major open questions. Overall I think this a paper worth accepting as this is an interesting approach which shows progress on a challenging problem. I would however encourage the authors to think about more thoroughly investigating the effect of some of the choices made in the paper, either as follow-up work if this paper is accepted, or a major revision of this paper if it is not. E.g. I'd love to see a proper review of different ways to perform HGNN, and how this affects different physical domains, from smooth simulations with long-range effects such as the ones studied here, to local chaotic CFD simulations. Similarly, a deep dive into the various choices in diffusion & denoising and the effect this has on consistency and frequency distribution could be very interesting.

局限性

yes, except for the points mentioned above.

最终评判理由

I think this is a promising paper, and I'm in favor of accepting it to NeurIPS. The authors addresses all the issues raised in review, and performed additional experiments which will further strengthen the paper.

格式问题

none

作者回复

We appreciate the positive comments of the author on the effectiveness of our approach for quasistatic elasticity, the inclusion of necessary ablations and the clarity of the paper. We address the concerns raised below.

The choice of experimental domain is a bit odd. Quasistatic elasticity is a good domain for studying hierarchical message passing, but less so for diffusion-- these simulations are very deterministic and smooth.

It might seem counterintuitive, but we found that preserving smoothness is often difficult for HGNNs, which makes diffusion attractive in this domain. Models trained on an MSE loss must accurately capture global deformations, and even slight shifts or rotations can increase error. HGNNs often accumulate errors from high frequencies that are less penalized, which leads to local mesh distortions and drifting out of the training distribution over time. Non-smooth solutions result in significant deviations in the displacement gradient, which directly impacts stress.

To analyze this behavior, we added a truncation parameter to the ROBIN algorithm. This allows us to stop the denoising process early. For example, we can stop after the first KI=5K_I = 5 diffusion steps (instead of denoising all K=20K = 20 steps) and treat the reconstructed clean sample as our ground truth. Regarding the rollout RMSE, we have evaluated ROBIN for different truncation valuess KIK_I. For each edge (i,j)(i,j) of the fine mesh, we have also evaluated the gradient (mathbfu_imathbfu_j)2/(mathbfx_imathbfx_j)2||(\\mathbf{u}\_{i} - \\mathbf{u}\_{j})||_2 / ||(\\mathbf{x}\_{i} - \\mathbf{x}\_{j})||_2 along that edge and calculated the RMSE.

BendingBeamDeformingPlateImpactPlate
KIK_IRollout RMSE [10310^{-3}]Rollout gradient RMSE [10310^{-3}]Rollout RMSE [10310^{-3}]Rollout gradient RMSE [10310^{-3}]Rollout RMSE [10310^{-3}]Rollout gradient RMSE [10310^{-3}]
ROBIN - one step164.232 +/- 11.862324.047 +/- 285.65410.787 +/- 0.521157.279 +/- 8.33821.462 +/- 2.11523.414 +/- 4.853
ROBIN - two step248.936 +/- 7.248200.437 +/- 128.2857.263 +/- 0.179150.325 +/- 48.47615.788 +/- 1.7148.609 +/- 1.019
ROBIN - three step342.831 +/- 4.536147.966 +/- 65.1296.422 +/- 0.225110.958 +/- 22.32914.576 +/- 1.3966.066 +/- 0.560
ROBIN - five step535.812 +/- 3.29993.999 +/- 20.8975.729 +/- 0.29973.134 +/- 5.63613.011 +/- 0.9773.677 +/- 0.371
ROBIN - ten step1030.349 +/- 2.40937.582 +/- 4.7115.181 +/- 0.34741.587 +/- 1.75812.379 +/- 0.9232.610 +/- 0.316
ROBIN2029.002 +/- 1.99712.765 +/- 0.7164.960 +/- 0.35529.098 +/- 1.49412.330 +/- 0.9582.553 +/- 0.334

The initial diffusion steps concentrate on global solution frequencies, significantly reducing the global RMSE. Later steps focus on high frequencies, reducing local gradient error. Over time, these high-frequency errors can lead to mesh distortions, as observed in MSE-trained HGNNs.

Similarly, I feel the effect of conditioning on not fully denoised previous timesteps is underexplored. […] e.g. does this affect coherence of high-frequency details? […] Similarly, a deep dive into the various choices in diffusion & enoising and the effect this has on consistency and frequency distribution could be very interesting.

We agree that investigating how conditioning partially denoised states affects the high-frequency component would be helpful. The table below shows how the denoising stride mm affects the rollout RMSE and rollout gradient RMSE for all the datasets.

BendingBeamDeformingPlateImpactPlate
Inference variantsmmRollout RMSE [10310^{-3}]Rollout gradient RMSE [10310^{-3}]Rollout RMSE [10310^{-3}]Rollout gradient RMSE [10310^{-3}]Rollout RMSE [10310^{-3}]Rollout gradient RMSE [10310^{-3}]
ROBIN129.002 +/- 1.99712.765 +/- 0.7164.960 +/- 0.35529.098 +/- 1.49412.330 +/- 0.9582.553 +/- 0.334
Modest denoising stride528.987 +/- 1.93212.741 +/- 0.6534.916 +/- 0.34628.828 +/- 1.56613.865 +/- 1.9001.894 +/- 0.084
Conventional Inference2028.896 +/- 1.90612.747 +/- 0.6964.960 +/- 0.35528.977 +/- 1.55614.431 +/- 1.9331.916 +/- 0.074

ROBIN with m=1m=1, in which the previous physical time step is always denoised only one diffusion step further, does not increase gradient error on BendingBeam or DeformingPlate. However, there is a modest increase on ImpactPlate, and ROBIN achieves the lowest RMSE overall. We hypothesize that partially denoising states keeps low-frequency components anchored and reduces drift. In contrast, fully denoising states (larger mm) reduces short-term, high-frequency error accumulation. However, this phenomenon does not seem to affect long rollouts. We plan to investigate this further in future work, as well as how it is influenced by alternative diffusion backbones and schedulers.

[…] there are so many HGNN papers with slightly different choices of graph construction and message propagation, and this paper adds yet another option without properly evaluating it against other options.

We acknowledge the many HGNNs in the literature and emphasize that our primary contribution is the synergy of diffusion with our AMPNs, not a new HGNN. Our AMPNs follow the architecture of algebraic multigrid methods to solve for frequencies at different scales. For example, we use an explicit solver layer at the coarse level to solve for global frequencies and shared intra-level message passing stacks for high frequencies w.r.t. the mesh scale of the level. We combine this approach with diffusion to specifically address all frequencies during training and inference. Our ablation studies highlight the synergy between the two. We lose accuracy on all tasks if we remove one of them, but we significantly outperform the baselines when they are used together.

To further demonstrate the benefits of combining our AMPNs with diffusion, we conducted an ablation study on the Bending Beam dataset, replacing our AMPNs with the HCMT architecture. Additionally, we compared our results to those of another hierarchical GNN baseline: the Bi-stride multiscale-Graph Neural Networks (BSMS-GNNs) [1].

BendingBeam
Rollout RMSE [10310^{-3}]
ROBIN28.896 +/- 1.906
ROBIN - HCMT111.049 +/- 1.146
HCMT121.526 +/- 1.865
BSMS-GNNs141.979 +/- 7.960
MGN189.515 +/- 70.865

Although integrating diffusion into HCMT improves RMSE compared to HCMT alone, ROBIN significantly outperforms both. BSMS-GNNs are nearly as accurate as the HCMT baseline, but significantly more accurate than non-hierarchical MeshGraphNets. Only the combination of diffusion and AMPNs can accurately resolve the wide frequency spectrum of BendingBeam.

To further demonstrate the advantages of the AMPN architecture, we have fine-tuned ROBIN on the BendingBeamLarge dataset. This dataset contains simulations spanning 100 time steps and meshes with up to 16K nodes. Although ROBIN was trained on BendingBeam with 3 hierarchies (~750 nodes), the shared blocks of the AMPNs enable direct application of ROBIN to BendingBeamLarge meshes with 4 levels. We fine-tuned ROBIN for 750K iterations with a batch size of 1 and a learning rate of 10610^{-6}. Similarly, we have trained an untrained ROBIN with the same settings, but with a learning rate decay ranging from 10410^{-4} to 10510^{-5}.

BendingBeamLarge
ROBIN variantRollout time [s]Rollout RMSE [10310^{-3}]
pre-trained30.936 +/- 0.62778.929 +/- 4.971
untrained30.945 +/- 0.580215.500 +/- 12.135

As expected, the pre-trained ROBIN model predicts solutions with a much lower RMSE than an untrained model. The pre-trained model can be quickly adapted to a large-scale dataset without architectural changes, which highlights ROBIN’s generalizability to different mesh sizes.

(Minor) The description of ROBI and corresponding figure could be improved; […] It took me a few passes to understand which step is conditioned on what exactly in diffusion & simulation time.

We thank the reviewer for pointing that out. We will clarify the description of ROBI, as well as Figure 2. In particular, we will clarify the conditioning process on the partially denoised, reconstructed state u~0kt1\tilde{u}_{0|k}^{t-1} and explain how the reconstruction process of that state depends on previous time steps.

We appreciate the positive and constructive feedback from the reviewer and will make the required updates in the final version. We are pleased to address any additional inquiries during the discussion phase.

[1] Yadi Cao, et al. Efficient learning of mesh-based physical simulation with bi-stride multi-scale graph neural network. In International conference on machine learning. PMLR, 2023.

评论

Thank you for the additional experiments. I think it's definitely worth including those in the paper, or least the appendix-- I found especially the first experiment on truncated runs quite insightful (maybe with a figure, instead of a table).

I think this is a good paper, and will retain my (positive) score.

评论

Dear reviewer,

Thank you for your positive feedback and continued support of our work. We will follow your recommendation to include the additional experiments in the appendix and present the truncated runs as a figure.

审稿意见
5

This work proposes an auto-regressive hierarchical graph neural network and diffusion model based framework for modelling the dynamics of systems in solid-mechanics from a Lagrangian frame. The paper extends the work of the paper PDE-Refiner Lippe et al. 2023 with several key extensions. They extend onto problems on unstructured meshes with Lagrangian dynamics using a hierarchical GNN. The mesh hierarchy utilises a novel AMG based approach from classical FE meshing, which creates visually more balanced and appealing coarse meshes compared to recent ML baselines. The authors also introduce a strategy Rolling Diffusion Batched Inference (ROBI), that allows accelerated roll-out at inference time by initialising the denoising process of future states conditioned by "early-denoising" of preceding states. Experiments are performed on 3 dynamical problems from solid mechanics, baselined against 2 SOTA models MGN and HCMT. A sensitivity analysis of the ROBI procedure wrt to mm the denoising stride and KK the denoising steps and a model ablation study of key architecture components are also performed.

优缺点分析

This paper would be a straight accept for me if the quality of the write up was better but I had to spend quite some time to extract the details. In particular the clarity of the write up in sections 3.1 and 3.2 regarding ROBI and the HGNN could be improved, I list my concerns below to see they can be improved.

In section 3.1: Regarding the v-prediction target, the explanation was not clear. My eventual understanding is when k=K (first denoising steps) this is where we sample pure noise which is input to the model but also the target is closest to the 1 step direct target. Conversely when k~0 (latter denoising steps) the target is the remaining noise on the sample. So when the SNR is the highest the target is noise and when the SNR is the lowest the target is the signal. I found this quite counter intuitive and perhaps it can be clarified/reworded. Also, "One step models" are not defined, this confused me for a long time until I read PDE-Refiner paper.

  • There is some abiguity if there are also neural nets used to predict μ\mu and Σ\Sigma. But are these different to the main denoiser that predicts the denoising velocity and if so how are they trained?
  • There are no definitions of t_B/k_B/t_b/k_b
  • It was not immeadiate to derive the formula on line 148, it could be pointed out this comes from substituting ϵ\epsilon from the noising step into 1.
  • Clarifications on figure 3, what is meant by "pooling", as pooling is usually associated as a step within coarsening. In the caption does "1 (top)" refer to line c) ?

In the ROBI section, the writing would benefit from clarifying/reasserting that the network has 2 inputs, the t-1 condition and the current t sample at denoise level k, as this is non-standard in diffusion models, AR is a niche case. I now understand this is because the model takes in 2 arguments, previous step as condition and current noisey sample. It is the "early-denoising" of the preceding t-1 conditioning that gives a "trade-off" against accuracy when m<<K as it is utilised from a less developed state. Similarly, The sentence that starts "Using these properties, we begin", I thought for a long time the N(0,1) was a typo, given the prior sentence is motivating "sufficient to condition the next step".

In section 3.2: The description of root-node AMG coarsening is also unclear, I would not be able to implement it from the description. Neither the rote-node or smoothing aspects from "algebraic root-node-based smooth aggregation" are explained. The description of upsampling/downsampling are not well explained "Those matrices are constructed by smoothing the sparse initial aggregation mapping using the adjecency matrix" I think I understand there is some kind of graph diffusion rewiring but providing more details would again strengthen the paper.

I have some small evaluation concerns. No numerical figures are reported just bar charts that are hard to read and interpret. For the rollout in Figure 5b, I understand we compare this is a sensitivity study for different mm and KK but it is a slight red flag that ML baseline and even classic data generation times are not provided. Especially since this is a diffusion model where inference time is a known challenge.

问题

No

局限性

Yes

最终评判理由

The provided clarifications and model inference timings have addressed my concerns.

格式问题

No

作者回复

This paper would be a straight accept for me if the quality of the write up was better but I had to spend quite some time to extract the details. In particular the clarity of the write up in sections 3.1 and 3.2 […].

We thank the reviewer for the positive feedback and for identifying areas that can be improved in our write-up. We will revise sections 3.1 and 3.2 for the final version. Below, we address the reviewer’s questions and clarify several ambiguities.

Regarding the v-prediction target, the explanation was not clear. […] I found this quite counter intuitive and perhaps it can be clarified/reworded.

The v-prediction vkt=αˉkϵt1αˉku0t\mathbf{v}_k^t = \sqrt{\bar{\alpha}_k} \mathbf{\epsilon}^t - \sqrt{1-\bar{\alpha}_k} \mathbf{u}^t_0 indeed behaves in the opposite way to the SNR. At early denoising steps (low SNR, αˉk0\bar{\alpha}_k \approx0), the target behaves like the clean signal u0t\mathbf{u}^t_0. At later steps (high SNR, αˉk1\bar{\alpha}_k \approx1), the target behaves like the residual noise ϵt\mathbf{\epsilon}^t. Although this may seem counterintuitive, it simplifies the learning process. When the sample is good (high SNR) correcting the small residual is easier than regenerating the entire signal. When the sample is random (low SNR), the input contains little useful structure, so directly predicting the signal is easier than estimating the error. This reparameterization also keeps target magnitudes more uniform across timesteps.

Also, "One step models" are not defined, this confused me for a long time until I read PDE-Refiner paper. […] There are no definitions of t_B/k_B/t_b/k_b.

We will introduce one-step models as autoregressive models that are trained to predict the solution for the next time step using the solution from the previous step. We will define tBt_B and kBk_B more clearly and use them consistently.

There is some abiguity if there are also neural nets used to predict μ\mu and Σ\Sigma. But are these different to the main denoiser that predicts the denoising velocity and if so how are they trained?

In Denosing Diffusion Probalistic Models [1] we train a model to predict either the clean sample u0tu_{0}^t, the noise ϵt\epsilon^t or, as in our case, the velocity vtv^t. The Gaussian mean μθ\mu_{\theta} is than computed analytically from this output. The covariance matrix Σθ\Sigma_{\theta} can also be learned by the same network, but we assume isotropic covariance Σ=σk2I\Sigma = \sigma_k^2 \mathbf{I}.

It was not immeadiate to derive the formula on line 148, it could be pointed out this comes from substituting from the noising step into 1.

Given a clean sample u0tu_0^t and noise ϵt\epsilon^t, the forward diffusion process produces at step kk a noisy sample ukt=αˉku0t+1αˉkϵtu_k^t=\sqrt{\bar\alpha_k}\,u_0^t+\sqrt{1-\bar\alpha_k}\,\epsilon^t [1]. We eliminate ϵt\epsilon^t by substituting the v-prediction definition from Equation (1) vkt=αˉkϵt1αˉku0tv_k^t=\sqrt{\bar\alpha_k}\,\epsilon^t-\sqrt{1-\bar\alpha_k}\,u_0^t and solve for the clean sample u0t=αˉkukt1αˉkvktu_0^t=\sqrt{\bar\alpha_k}\,u_k^t-\sqrt{1-\bar\alpha_k}\,v_k^t [1]. Given a model prediction vθ(ukt,k,u0t1)v_\theta(u_k^t,k,u_{0}^{t-1}), we can reconstruct a clean sample u0kt=αˉkukt1αˉkvθ(ukt,k,u0t1)u_{0|k}^t=\sqrt{\bar\alpha_k}\,u_k^t-\sqrt{1-\bar\alpha_k}\,v_\theta(u_k^t,k,u_{0}^{t-1}) using this equation.

Clarifications on figure 3, what is meant by "pooling", as pooling is usually associated as a step within coarsening. In the caption does "1 (top)" refer to line c) ?

In our AMG-based hierarchy, we create coarse meshes and build up- and downsampling graphs that connect nodes across levels. Figure 3 d) shows the edges between levels 0 and 1 at the top and between levels 1 and 2 at the bottom. All the nodes at the top correspond to level 0 (Figure a), while the subset of nodes marked in bright blue corresponds to level 1 (Figure c). Similarly, all the nodes in the bottom part of Figure 3d correspond to the nodes in Figure 3c. Since "pooling" is misleading, Figure 3 d) will be renamed "AMG-based down- and upsampling graphs.”

I understand we compare this is a sensitivity study for different and but it is a slight red flag that ML baseline and even classic data generation times are not provided.

The table below shows a comparison of Rollout RMSE between the baselines.

BendingBeamDeformingPlateImpactPlate
Rollout time [s]Rollout RMSE [10310^{-3}]Rollout time [s]Rollout RMSE [10310^{-3}]Rollout time [s]Rollout RMSE [10310^{-3}]
MGN21.398 +/- 0.153189.515 +/- 70.86524.418 +/- 0.2178.761 +/- 0.2932.927 +/- 0.03154.068 +/- 5.878
HCMT25.818 +/- 0.280121.526 +/- 1.86531.572 +/- 0.1448.035 +/- 0.1334.331 +/- 0.05019.571 +/- 0.382
ROBIN15.026 +/- 0.04029.002 +/- 1.99761.601 +/- 0.2914.978 +/- 0.3324.943 +/- 0.01012.330 +/- 0.958

The numerical solver required 46.2 s (max. 186.1 s) on average on BendingBeam, 1157.2s on DeformingPlate [2] and 742.6 s on ImpactPlate [3].

We also evaluated ROBIN on BendingBeamLarge (200 training, 20 validation and 20 test simulations), containing simulations of 100 time steps on beam meshes with 5–16K nodes. On average, the solver required 108.3 s (max. 4248.0 s) to complete. We compare the fine-tuning of a pre-trained model of BendingBeam to training from scratch. Both are trained for 750K iterations and a batch size of 1, while fine-tuning uses a learning rate of 10610^{-6} and training from scratch an exponential decay from 10410^{-4} to 10510^{-5}.

BendingBeamLarge
ROBIN variantRollout time [s]Rollout RMSE [10310^{-3}]
pre-trained30.936 +/- 0.62778.929 +/- 4.971
untrained30.945 +/- 0.580215.500 +/- 12.135

We extended ROBI to include the option of truncating the denoising process after KIK_I steps and using the partially denoised state as the final prediction. The table below shows for 5 seeds how different truncation levels KIK_I compare to conventional inference and ROBIN (5 seeds). All evaluations are based on the same trained models.

BendingBeamDeformingPlateImpactPlate
Inference variantKKKIK_ImmRollout time [s]Rollout RMSE [10310^{-3}]Rollout time [s]Rollout RMSE [10310^{-3}]Rollout time [s]Rollout RMSE [10310^{-3}]
Conventional Inference202020162.426 +/- 0.89128.896 +/- 1.906190.100 +/- 1.3334.960 +/- 0.35522.607 +/- 0.35814.431 +/- 1.933
ROBIN2020115.026 +/- 0.04029.002 +/- 1.99761.601 +/- 0.2914.978 +/- 0.3324.943 +/- 0.01012.330 +/- 0.958
ROBIN - ten step2010111.030 +/- 0.09930.349 +/- 2.40933.705 +/- 0.0985.181 +/- 0.3472.896 +/- 0.00912.379 +/- 0.923
ROBIN - five step20519.879 +/- 0.18835.812 +/- 3.29920.243 +/- 0.0385.729 +/- 0.2991.907 +/- 0.01113.011 +/- 0.977
ROBIN - two step20219.475 +/- 0.10548.936 +/- 7.24812.588 +/- 0.1277.263 +/- 0.1791.424 +/- 0.00515.788 +/- 1.714
ROBIN - one step20119.488 +/- 0.11764.232 +/- 11.8627.113 +/- 0.04210.787 +/- 0.5211.324 +/- 0.00621.462 +/- 2.115

In the ROBI section, the writing would benefit from clarifying/reasserting that the network has 2 inputs, the t-1 condition and the current t sample at denoise level k.

In the revision, we will state explicitly that at level kk during inference the model takes the current noisy sample uktu_{k}^{t} and the partially denoised state of tildeu_0kt1\\tilde{u}\_{0|k}^{t-1} as input to predict the denoising velocity v_\\theta(u\_k^t,k,\\tilde{u}\_{0|k}^{t-1}). We will clarify that the refinement level of the reconstructed sample \\tilde{u}_\{0|k+1-m}^{t-1} increases with mm.

The description of root-node AMG coarsening is also unclear […] but providing more details would again strengthen the paper.

We use PyAMG’s rootnode_solver [4] to create the AMG hierarchy. The solver takes a sparse square matrix as input and returns a hierarchical cycle. We pass the fine-mesh adjacency matrix as the system matrix, and the solver yields coarsened adjacency matrices, corresponding coarse nodes, and up- and downsampling matrices. The nonzeros in these matrices define the edges of our up- and downsampling graphs. Thus, the implementation requires only the fine-mesh adjacency, the rootnode_solver with its default settings, and graph construction based on the returned matrices (i.e., non-zero values).

The rootnode_solver itself aggregates nodes based on connectivity within the provided matrix and selects one root per aggregate. These roots become the next-level coarse nodes, and each fine node connects to its aggregate's root in the up- and downsampling matrices. These operators are then smoothed using the adjacency, effectively enlarging each coarse node’s receptive field beyond its aggregate. The implementation will be included in the code publication after the paper is accepted.

We thank the reviewer for their comments and for pointing out the ambiguities. We will incorporate the revsions into the camera-ready version, and we encourage the reviewer to contact us during the discussion.

[1] Jonathan Ho, et al. Denoising diffusion probabilistic models. NeurIPS, 2020.

[2] Tobias Pfaff, et al. Learning Mesh-Based Simulation with Graph Networks. ICLR, 2020.

[3] Youn-Yeol Yu, et al. Learning Flexible Body Collision Dynamics with Hierarchical Contact Mesh Transformer. ICLR, 2023.

[4] Nathan Bell, et al. PyAMG: Algebraic multigrid solvers in python. Journal of Open Source Software, 2022

评论

I thank the authors for the clarifications, which I am confident can be polished in the final transcript. Thank you for providing the timings, this is particularly encouraging that the model inference time (in part thanks to ROBIN) is comparable to non-diffusion baselines. I update my score accordingly.

评论

Dear Reviewer,

Thank you for your encouraging feedback and updated assessment. We're glad our clarifications were helpful and appreciate your valuable comments and feedback, which we will incorporate into the final version.

审稿意见
4

Graph based simulators are widely used for the mesh based data modalities, however these models struggle with long range dependencies and correlations. Furthermore autoregressive rollouts leads to error accumulation which introduces another source of error. This paper introduces ROBIN a diffusion based learned simulator to address these issues. A rolling diffusion denoising scheme is introduced to accelerate the overall generation process by parallelizing the diffusion across time. Furthermore a hierarchical graph neural network built on multigrid coarsening is introduced to allow for multiscale message passing.

优缺点分析

Strength

  1. The concept of a rollout diffusion inference and autoregressive diffusion is rapidly growing popularity for world models and other scenarios where actions and inputs are continuously introduced to the model. The rollout diffusion idea of the paper is interesting to see in the context of solid mechanics.

  2. The multi-resolution message passing opens doors to mixing simulations at different scales as well. This is something that can be explored by other researchers to have a generations informed from variety of scales.

Weakness

  1. A major concern is the time parallel denoising scheme for inference. Authors mention that for generation of T frames (timesteps) with K denoising steps the time complexity will be O(KT). However the first diffusion models introduced for spatiotemporal data were not trained or used at inference with rollouts. These models (and most current video generative models) denoise all frames (timesteps) in parallel with an inflated UNet so the actual complexity would be O(K). It seems the paper uses the argument of efficiency throughout to develop the the rollout style denoising scheme to achieve O(K-m + mT).

  2. Paper provides the arguments to how they get to O(K-m + mT) in "Rolling Diffusion-Batched Inference (ROBI)" section however their claim in introduction is that they reduce the inference time from O(KT) to O(T)

  3. Novelty of the ROBI is unclear. How does ROBI differ from the current autoregressive diffusion frameworks.

  4. Provided experiments are limited. Given the claim of a new denoising scheme is introduced authors potentially should provide comparisons with other autoregressive diffusion frameworks (mostly in image or video generation). Otherwise there is a wide range of PDE problems that are more commonly used for benchmarking that can be used to compare methods with baselines.

  5. Baselines are old and and limited. There has been major researches focusing on PDE generations in recent years however the paper only compares to two baselines one of which is older than 5 years. MGN is commonly is not used as a standalone baseline and it helps complete an overall picture when also having other baselines.

问题

  1. Can authors provide a clearer picture of the novelty and comparison with existing autoregressive diffusion methods? It is not clear how the rollout denoising differs from others!

  2. If it is possible to put the dependence of the denoising between steps into equation it would become much more clear to the reader. Can authors formulate such equation (Even if it is a small inline equation)?

  3. Can authors clarify why a parallel generation is not possible in this concept? This way the complexity simply becomes O(T)!

  4. Why did the authors focus on a very limited area? Is it possible to expand this to other PDEs? This will help understanding the generalizability.

局限性

Yes

最终评判理由

Authors provided further comparison with other frameworks demonstrating improved rollout. Furthermore the denoising scheme is novel in this context. I believe the experiments are done on a very specific subset of the PDEs, and generalizability is still of concern however given the novelties I believe the reasons to accept the paper are marginally more than the reasons to reject.

格式问题

No concerns

作者回复

We thank the reviewer for their careful reading and constructive criticism. Below, we clarify our contributions and explain how they differ from existing work.

Can authors provide a clearer picture of the novelty and comparison with existing autoregressive diffusion methods?

ROBIN is designed for deterministic physical simulations rather than perceptual video generation. It combines a diffusion model with hierarchical graph neural networks on unstructured meshes. This allows it to capture different solution frequencies and propagate information across large spatial distances and mesh resolutions. CNN-based video models cannot do this. In addition, ROBIN outputs residual predictions rather than absolute states, thereby improving accuracy and stability.

ROBIN advances the simulation from time t1t−1 to time t,t, conditioned only on the previous clean state. This preserves causality and time shift equivariance while reducing memory usage and training time. Sequence-based diffusion models that denoise entire histories often require learning warm-up and cool-down phases [1]. Conditioning on multiple past steps has been found to reduce the accuracy of learned simulators [2,3].

Our model is trained using a v-prediction target, residual predictions and an exponential diffusion scheduler. It requires only 5 diffusion steps to surpass the baseline accuracy. During the rebuttal, we evaluated inference truncation. We stop after a chosen number of denoising steps and use the reconstructed sample as the prediction, which decreases inference time. The table below reports (over 5 seeds) on variants that stop denoising after the second denoising step (ROBIN - two step), along with the full ROBIN setting and the baselines. We added Bi-stride multi-scale GNNs (BSMS-GNNs) [4] as a third baseline on BendingBeam and will include results for DeformingPlate and ImpactPlate in the camera-ready version.

BendingBeamDeformingPlateImpactPlate
Rollout time [s]Rollout RMSE [10310^{-3}]Rollout time [s]Rollout RMSE [10310^{-3}]Rollout time [s]Rollout RMSE [10310^{-3}]
ROBIN - two step9.475 +/- 0.10548.936 +/- 7.24812.588 +/- 0.1277.263 +/- 0.1791.424 +/- 0.00515.788 +/- 1.714
ROBIN15.026 +/- 0.04029.002 +/- 1.99761.601 +/- 0.2914.978 +/- 0.3324.943 +/- 0.01012.330 +/- 0.958
MGN21.398 +/- 0.153189.515 +/- 70.86524.418 +/- 0.2178.761 +/- 0.2932.927 +/- 0.03154.068 +/- 5.878
HCMT25.818 +/- 0.280121.526 +/- 1.86531.572 +/- 0.1448.035 +/- 0.1334.331 +/- 0.05019.571 +/- 0.382
BSMS-GNNs4.896 +/- 0.124141.979 +/- 7.960----

ROBIN outperforms state-of-the-art baselines with only two diffusion steps, whereas video diffusion models typically require many more. While extending ROBI to video models is an interesting future direction, our models are explicitly designed for simulation. We do not claim to have made any contributions toward overcoming the limitations of current image or video generation frameworks.

There has been major researches focusing on PDE generations in recent years however the paper only compares to two baselines one of which is older than 5 years.

As noted above, we added BSMS‑GNNs [4] as another strong baseline. HCMT currently achieves the highest accuracy on benchmarks such as DeformingPlate, and MeshGraphNet (MGN), despite being five years old, remains competitive. This demonstrates the importance of ROBIN in surpassing the accuracy limits of solid-mechanics learned simulators.

Authors mention that for generation of T frames (timesteps) with K denoising steps the time complexity will be O(KT). […] Can authors clarify why a parallel generation is not possible in this concept?

ROBIN generates trajectories autoregressively, predicting each step based on the previous one. Forecasting large time steps or an entire rollout at once reduces the accuracy of physical simulators [2,5]. The most accurate solid‑mechanics simulators, such as HCMT, are autoregressive.

Since physical simulations are Markovian, conditioning only on the previous step is sufficient. Jointly denoising all time steps ignores the evolving state and violates causality. Even using more than one past step can reduce accuracy [2,3]. Additionally, parallel generation is less desirable because sequence models must retain many time steps and large meshes in memory, treating entire trajectories as single training samples. Our one step training uses single transitions, which improves memory and data efficiency while enabling large-mesh experiments.

We demonstrate that scalability on BendingBeamLarge, which contains 200 training, 20 validation and 20 test simulations of 100 time step and beam meshes with 5K-16K nodes. We fine-tuned a pre-trained model and compare it with a model trained from scratch (over 5 seeds). Both used 750K iterations and a batch size of 1. The pre-trained model is trained with a learning rate of 10610^{-6} and the untrained with an exponential decay from 10410^{-4} to 10510^{-5}.

BendingBeamLarge
ROBIN variantRollout time [s]Rollout RMSE [10310^{-3}]
pre-trained30.936 +/- 0.62778.929 +/- 4.971
untrained30.945 +/- 0.580215.500 +/- 12.135

Pretraining substantially improves accuracy and enables adaptation to meshes with up to 16K nodes.

Paper provides the arguments to how they get to O(K-m + mT) in "Rolling Diffusion-Batched Inference (ROBI)" section however their claim in introduction is that they reduce the inference time from O(KT) to O(T)

ROBI reduces inference costs from O(KT)O(KT) to O(Km+mT)O(K-m + mT). Our experiments set m=1m=1, yielding O(K+T)O(K + T). Since time‑dependent simulations typically have many more time steps TT than the K=520K=5–20 diffusion steps , this simplifies to O(T)O(T). We will clarify this assumption in the camera‑ready version.

If it is possible to put the dependence of the denoising between steps into equation it would become much more clear to the reader.

Given a prediction horizon h=Kh=K, a batch of previous clean states utj1_0\mathbf{u}^{t_j-1}\_{0} and noise samples Δutj_kj\Delta\mathbf{u}^{t_j}\_{k_j} for the time steps tj=t+jt_j = t+j and j{0,...,h1}j\in\{0,...,h-1\} with increasing diffusion steps kj=1+jk_j=1+j, ROBIN outputs the diffusion velocities v_θ(Δutj_kj,kj,utj1_0)\mathbf{v}\_\theta(\Delta\mathbf{u}^{t_j}\_{k_j}, k_j, \mathbf{u}^{t_j-1}\_0). We approximate the clean samples Δu~tj_0Δu~tj_0kj=αˉ_kjΔutj_kj1αˉ_kjv_θ(Δutj_kj,kj,utj1_0)\Delta\tilde{\mathbf{u}}^{t_j}\_{0} \approx \Delta\tilde{\mathbf{u}}^{t_j}\_{0|k_j} = \sqrt{\bar{\alpha}\_{k_j}} \Delta\mathbf{u}^{t_j}\_{{k_j}} - \sqrt{1-\bar{\alpha}\_{k_j}} \mathbf{v}\_\theta(\Delta\mathbf{u}^{t_j}\_{{k_j}}, {k_j}, \mathbf{u}^{t_j-1}\_0) and the current noisy samples Deltamathbfut_j_k_j1=fracsqrtbaralpha_k_j1beta_k_j1baralpha_k_jDeltatildemathbfut_j_0midk_j+fracsqrtalpha_k_j(1baralpha_k_j1)1baralpha_k_jDeltamathbfut_j_k_j\\Delta\\mathbf{u}^{t\_j}\_{k\_j-1}=\\frac{\\sqrt{\\bar{\\alpha}\_{k\_j-1}}{\\beta}\_{k\_j}}{1-\\bar{\\alpha}\_{k\_j}}\\Delta\\tilde{\\mathbf{u}}^{t\_j}\_{0\\mid k\_j} + \\frac{\\sqrt{\\alpha\_{k\_j}}(1-\\bar{\\alpha}\_{k\_j-1})}{1-\\bar{\\alpha}\_{k\_j}} \\Delta\\mathbf{u}^{t\_j}\_{k\_j} +sigma_k_jboldsymbolvarepsilont_j+ \\sigma\_{k\_j} \\boldsymbol{\\varepsilon}^{t\_j} with boldsymbolvarepsilont_jsimmathcalN(0,I)\\boldsymbol{\\varepsilon}^{t\_j}\\sim\\mathcal{N}(0,I). Next, we can reconstruct the states of each time step using a cumulative sum over those approximated clean samples tildemathbfutj_0=mathbfut1_0+sum_i=0jDeltatildemathbfut+i_0\\tilde{\\mathbf{u}}^{{t_j}}\_{0} = \\mathbf{u}^{t-1}\_{0} + \\sum\_{i=0}^j \\Delta \\tilde{\\mathbf{u}}^{t+i}\_{0} with the last fully-denoised sample mathbfut1_0\\mathbf{u}^{t-1}\_{0}. Finally, we can move one physical time step further, such that tj=t+1+jt_j = t+1+j. Therefore, we remove the denoised time step tt from the batch, initialize the sample of the new time step t+h{t+h} as Gaussian noise Deltamathbfut+h_KsimmathcalN(mathbf0,mathbfI)\\Delta\\mathbf{u}^{t+h}\_{K} \\sim \\mathcal{N}(\\mathbf{0}, \\mathbf{I}) and predict the next batch \\mathbf{v}_\\theta(\\Delta\\mathbf{u}^{{t\_j}}\_{k\_j}, {k\_j}, \\tilde{\\mathbf{u}}^{{t\_j}-1}\_{0} = \\mathbf{u}^{t}\_{0} + \\sum\_{i=1}^j \Delta \\tilde{\\mathbf{u}}^{t+i}\_{0}). Note that the proposed state reconstruction is done before the model call and allows the model to treat all time steps as a batch, where each prediction is only conditioned on the previous reconstructed state, e.g., mathbfv_theta(Deltamathbfut+2_2,2,tildemathbfut+1_0)\\mathbf{v}\_\\theta(\\Delta\\mathbf{u}^{t+2}\_2, 2, \\tilde{\\mathbf{u}}^{t+1}\_{0}) for time step j=1j=1.

Why did the authors focus on a very limited area? Is it possible to expand this to other PDEs?

The present work focuses on solid mechanics, a significant engineering discipline encompassing manufacturing, structural mechanics, and lightweight design, where accurate learned simulation remains under-explored. Solid mechanics exhibit global dependencies and high solution frequencies, which present a challenge for learned simulators. However, ROBIN can handle both by effectively combining diffusion with hierarchical GNNs. Although this study is limited to solid mechanics, the method could be applied to other PDEs, including fluid dynamics. Diffusion and hierarchical CNN‑based simulators improve image‑based fluid simulations [1,2]. We plan to apply ROBIN to mesh-based fluid simulations in future work.

We thank the reviewer for the suggestions and hope that we have addressed all concerns. We will include the new results in the revised paper and welcome further questions during the discussion phase.

[1] David Ruhe, et al. Rolling Diffusion Models. ICML, 2024.

[2] Phillip Lippe, et al. PDE-Refiner: Achieving Accurate Long Rollouts with Neural PDE Solvers. NeurIPS, 2023.

[3] Tobias Pfaff, et al. Learning Mesh-Based Simulation with Graph Networks. ICLR, 2020.

[4] Yadi Cao, et al. Efficient learning of mesh-based physical simulation with bi-stride multi-scale graph neural network. PMLR, 2023.

[5] Zongyi Li, et al. Learning Chaotic Dynamics in Dissipative Systems. NeurIPS, 2022.

[6] Kiwhan Song, et al. History-Guided Video Diffusion. arXiv, 2025.

评论

Dear reviewer,

Thank you for your thoughtful and detailed feedback. We hope that our response has clarified the concerns you raised and addressed the specific points you noted. Given the short discussion period, we would be grateful to know whether our responses have resolved your concerns or if there are any other areas in which we could improve the revision.

We value your consideration and the opportunity to strengthen the paper.

审稿意见
5

The authors proposed a novel method, Rolling Diffusion Batched Inference (ROBI), that can accelerate inference in DDPM-based simulation, and then they combined it with a Hierarchical Graph Neural Network that combines multiscale message passing with ROBI to provide fast, accurate diffusion-based simulations.

优缺点分析

The strengths of this paper are:

  1. The paper has a clear model structure and mathematical formula which reinforce the theoretical soundness and reliability of the proposed model.
  2. The model's novelty is good, although many people are using the Diffusion model, the authors improve it here.
  3. The paper features a detailed model structure that is easy to follow and understand, and the experiment is well-designed, clear, and straightforward.

问题

  1. How is the computation cost compared to HGNN? Does it take longer or shorter to train?
  2. How does ROBI learn the parameter compared to conventional diffusion?

局限性

yes

格式问题

no

作者回复

We appreciate the thorough and favorable review, especially the recognition of our meticulous mathematical formulation and well-designed experimental setup. Below, we will briefly address the questions and concerns raised.

How is the computation cost compared to HGNN? Does it take longer or shorter to train?

We agree that it is important to report the computation cost against the baselines. Previously, we only presented ROBIN runtimes for various diffusion inference settings in Figure 5. We report the rollout time and RMSE of ROBIN and the baselines below. We also added an option to truncate the denoising process to speed up ROBIN’s inference further. In this setting, we stop after a chosen number of denoising steps, treating the reconstructed sample as the prediction and skipping the remaining steps. We report on variants that stop after the first (ROBIN - one step), second (ROBIN - two step), and the fifth (ROBIN - five step) steps.

BendingBeamDeformingPlateImpactPlate
Rollout time [s]Rollout RMSE [10310^{-3}]Rollout time [s]Rollout RMSE [10310^{-3}]Rollout time [s]Rollout RMSE [10310^{-3}]
ROBIN - one step9.488 +/- 0.11764.232 +/- 11.8627.113 +/- 0.04210.787 +/- 0.5211.324 +/- 0.00621.462 +/- 2.115
ROBIN - two step9.475 +/- 0.10548.936 +/- 7.24812.588 +/- 0.1277.263 +/- 0.1791.424 +/- 0.00515.788 +/- 1.714
ROBIN - five step9.879 +/- 0.18835.812 +/- 3.29920.243 +/- 0.0385.729 +/- 0.2991.907 +/- 0.01113.011 +/- 0.977
ROBIN15.026 +/- 0.04029.002 +/- 1.99761.601 +/- 0.2914.978 +/- 0.3324.943 +/- 0.01012.330 +/- 0.958
MGN21.398 +/- 0.153189.515 +/- 70.86524.418 +/- 0.2178.761 +/- 0.2932.927 +/- 0.03154.068 +/- 5.878
HCMT25.818 +/- 0.280121.526 +/- 1.86531.572 +/- 0.1448.035 +/- 0.1334.331 +/- 0.05019.571 +/- 0.382

A key advantage of ROBIN is that our AMPNs are trained as one step autoregressive models that are conditioned only on the previous time step. We observed no additional training time compared to conventional autoregressive models. All methods used a fixed two-day training budget.

Furthermore, we conducted an upscaling experiment that demonstrated ROBIN’s architecture is independent of mesh size. This means that, for larger mesh sizes, the number of layers in our AMG-based hierarchy increases, and the AMPNs can handle them because of the shared model blocks. We fine-tuned the pre-trained ROBIN models using the new BendingBeamLarge dataset. While the BendingBeam dataset contains meshes with an average of 750 nodes, the large dataset contains an average of 9K nodes, with a maximum of 16K nodes. We trained both the pre-trained and untrained models for 750K training iterations with a batch size of 11. We fine-tuned the pre-trained models with a learning rate of 10610^{-6}, and we used a learning rate decay ranging from 10410^{-4} to 10510^{-5} for the untrained models.

BendingBeamLarge
ROBIN variantRollout time [s]Rollout RMSE [10310^{-3}]
pre-trained30.936 +/- 0.62778.929 +/- 4.971
untrained30.945 +/- 0.580215.500 +/- 12.135

Pretrained models converge quickly on meshes with about twelve times more nodes, which highlights the ability of ROBIN to upscale and its training efficiency in fine tuning.

How does ROBI learn the parameter compared to conventional diffusion?

We thank the reviewer for the insightful question. ROBI is an inference rollout scheme, and its training is identical to that of conventional diffusion-based inference. It does not have any additional parameters that need to be learned.

ROBI intuitively uses the insight that each diffusion step progressively predicts higher-frequency features to already start denoising future time steps before the current step has been fully denoised. Although this process does not reduce the number of required diffusion steps, it enables the parallel denoising of multiple steps. Figure 2 provides a schematic overview, and Figure 5 evaluates this effect. Empirically, we find that ROBI retains the accuracy of conventional diffusion inference while substantially speeding up the process on accelerated hardware. We will add a brief paragraph clarifying this relationship to the paper.

We want to thank the reviewer again for the helpful feedback. We hope that our clarifications and new results address the concerns raised. We encourage the reviewer to reach out to us during the discussion if there are further questions.

评论

Dear reviewer,

Thank you for your positive and constructive feedback. We are grateful for your engagement with our work and for the encouraging comments you provided. If you have any further suggestions that could help to refine the paper, we would be glad to take them into account when revising it.

We are grateful for your consideration and support.

最终决定

This paper proposes an improved diffusion model based on HGNN to simulate solid mechanics, which can improve both the quality and the speed.

Strengths:

  1. The method is novel and has many interesting ideas, even it is based on some existing work.
  2. The results are quite convincing and cover a good range of solid mechanics.

Weakness:

  1. The initial draft lacks some detailed discussions.

Rebuttal: Both the authors and reviewers actively participated in discussion. The authors provided extensive new results. The reviewers were very positive of the rebuttal, and raised the points.

Justification of recommendation: A solid paper with novel ideas. Very strong rebuttal. Please include the new results and address the final comments in the final version.