6.3

/10

Oral4 位审稿人

最低5最高7标准差0.8

4.0

置信度

正确性3.0

贡献度3.0

表达3.3

NeurIPS 2024

Learning rigid-body simulators over implicit shapes for large-scale scenes and vision

Yulia Rubanova,Tatiana Lopez-Guevara,Kelsey R Allen,William F Whitney,Kim Stachenfeld,Tobias Pfaff

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

摘要

关键词

graph networkslearned simulationphysicsrigid body simulationscaling

评审与讨论

审稿意见

评分: 7置信度: 42024-07-05

This work presents a GNN-based rigid-body simulator augmented with learned signed distance fields (SDFs). The core idea is to connect surface nodes between objects by testing whether their signed distances are within a certain threshold. This way, the resultant graph networks have sparser edges than prior works, enabling larger-scale simulations of rigid bodies with contacts and collisions. The paper evaluated the proposed GNN simulator on synthetic and real-world scenes.

优点

The paper is well written. The main idea is very straightforward to follow and easy to implement. Obviously, using SDFs for rigid-body collision detection is not new in physics animation, but marrying it to a GNN rigid-body simulator looks new to me. For this particular problem setting in this paper (many rigid bodies with frequent contact), using SDF to prune the collision edges is very reasonable.

The large-scale scenes are also nice. While classic numerical simulators can handle these scenes without trouble, I don’t know any existing GNN-based simulators that can solve such big scenes and generate visually plausible results. However, I am not an expert in GNN, so I may have missed some state-of-the-art GNN simulation papers.

缺点

I have one particular concern with the overall storytelling in the paper. The paper keeps drawing comparisons with mesh-based GNN simulators in its introduction, method description, and experiments, e.g., choosing MeshGraphNet and DPI as baselines. I think the overall storytelling is a bit biased and occasionally misleading, without stating the pros and cons of mesh-based GNNs fairly. Methods like MeshGraphNet and DPI can represent deformable objects and fluids, which are more complicated physical systems than rigid bodies. The proposed method in this paper enjoys the benefits of SDFs in collision handling because it focuses on rigid bodies only. For collisions between deformable bodies like those in MeshGraphNet and DPI, a precomputed SDF is not possible because shapes deform over time. Therefore, comparing this paper with MeshGraphNet and DPI is neither fair nor necessary: one wouldn’t use particle systems like DPI to simulate rigid bodies in the first place. I think the paper needs to provide a much more balanced statement about these methods. At a minimum, it should inform readers that by introducing SDFs in this new GNN simulation framework, it is no longer obvious to resolve deformable solids and fluids with contact that previous mesh-based GNNs can typically handle.

问题

I am not sure about one critical technical detail in the method. From what I understand, the collision edges need to be dynamically created and updated during simulation: at each time step, based on the current relative positions of these objects, one checks the vertices from one object in the SDFs of another object. If the distance is smaller than a threshold, collision edges are then created. This process has to be computed on the fly. Could you confirm whether my understanding is correct?

Lines 69-71 suggest acceleration structures like BVH often rely on CPU implementations that are difficult to accelerate. BVH trees on GPUs are actually common, well-known, and efficient, e.g., “Fast Parallel Construction of High-Quality Bounding Volume Hierarchies” and “ "Maximizing Parallelism in the Construction of BVHs, Octrees, and k-d Trees." It might be good to tone down the claim here.

Constructing SDFs: Did the paper construct a narrow-band SDF or the full SDF on the whole grid as the training data for the network to fit? Also, did the network encourage unit SDF gradient norm during training?

Computing closest points: Equation 1 is correct in theory. In practice, due to the numerical representation of SDFs, one needs to apply Eqn 1 multiple times before obtaining a fairly accurate closest point on the zero-level set.

SDF-based inter-object edges: I am not sure I get the O(K^2) complexity for mesh-based simulators. Such simulators rarely go over all O(K^2) pairs in a brute-force manner. Instead, they apply broad-phase and narrow-phase collision detections to prune many pairs.

局限性

The two large-scale scenes (Spheres-in-Bowl and Shoes) are nice, but both of them have some limitations. For the sphere scene, using a neural SDF is an overkill: collision detections between spheres can be computed in its closed form. For the Shoes, only one SDF is needed. These scenes are visually pleasing, but I don’t think they have touched the “true” large-scale rigid-body simulation with contact for the two reasons above. A “true” large-scale scene would be many rigid bodies with various shapes.

Overall, I think I lean positive. Moderately rewriting certain text should address most of my concerns with the paper.

作者回复

2024-08-05

We thank the reviewer for the helpful feedback and insightful comments.

Representation of methods like MeshGraphNet and DPI is misleading. SDF-Sim cannot resolve deformable solids and fluids

We fully agree that MeshGraphNet and DPI can handle broader types of physical systems. We are happy to add a more balanced statement to reflect that.

We would like to clarify that when we draw comparisons mesh-based GNN simulators, we generally refer to FIGNet and FIGNet* as these two models have shown the most complex rigid dynamics so far among the learned models. MeshGraphNet and DPI have shown amazing results in the realm of deformable simulation, however rigid bodies have their own unique challenges. Rigid collisions and contacts are non-smooth and are notoriously hard to derive numerical approximations for. Rigid dynamics is chaotic in nature: tiny errors in the model or in the initial states can quickly accumulate and lead to large deviations in later steps of the object trajectory. Hence, in this work we focus on rigid bodies only.

A clarification about the construction of the collision edges.

The reviewer’s understanding is entirely correct: collision edges are created dynamically at every step of the simulation.

Constructing SDFs: Did the paper construct a narrow-band SDF or the full SDF on the whole grid as the training data for the network to fit?

We don’t use SDF grids for training, as the reviewer might be suggesting. Instead, we randomly sample the ‘query’ points around the object by taking the points on the object surface and adding gaussian noise with zero-mean and sigma=0.1. We compute the SDFs for these query points and use them for training learned SDFs. Therefore, learned SDFs are not constrained to be narrow-band, although most of the training points fall within $2 \sigma = 0.2$ distance from the object surface. Figure S1(d-e) shows that learned SDFs are most accurate within [-0.2, 0.2] distance interval from the surface. In practice, we only care about the accurate SDF estimates within the collision radius=0.1, therefore the current SDF accuracy is sufficient.

We provide more details on SDF training in Appendix D.1.

Did the network encourage unit SDF gradient norm during training?

We did not find that it is necessary to specifically encourage the unit SDF gradient norm. We see this property naturally emerging during the training. We think it is due to our training strategy where we randomly sample the query points near the object surface instead of training on grid points.

Computing closest points: Equation 1 is correct in theory. In practice, … one needs to apply Eqn 1 multiple times before obtaining a fairly accurate closest point.

Empirically, we find that it is sufficient to apply Eq. 1 once to find the closest point on the object surface, thanks to the accurate learned SDFs. We use only one projection step in all our experiments. We invite the reviewer to inspect the Figure S1(a), where we compare the true closest point to the one-step projection* from a learned SDF and show that the mean-squared error stays within [1e-4, 1e-3] for different network sizes.

O(K^2) complexity of SDF-based inter-object edges: … simulators rarely go over all O(K^2) pairs in a brute-force manner.

Here we refer to the number of the constructed edges between the objects. Within the collision region, FIGNet an FIGNet* connect all mesh triangles of one object to all mesh triangles on another object, so the number of edges between the objects is O(K^2). In SDF-Sim we connect nodes of one object to the center of another object (within the collision region), so the number of edges grows only linearly in the number of nodes O(K).

For mesh-based simulators, K^2 edges would be a worst-case scenario in case the collision region is huge and spans the entire scene, which is indeed unlikely to happen.

However, we can see that these asymptotic trends clearly manifest in Figure 7. The quadratic number of edges in FIGNet pose a significant problem, leading to 1M collision edges for 160 objects and causing the model to run OOM. SDF-Sim has <100k edges for the same simulation.

Spheres-in-Bowl and Shoes use repeated objects. These scenes … did not represent the “true” large-scale rigid-body simulation with …many rigid bodies with various shapes.

We have an example of a large-scale scene with many objects with various shapes. It is shown in Figure 10 and replicated in the attached PDF for reviewer’s reference. The corresponding simulation video is shown on the project webpage (“Heaps of stuff” section). This simulation includes concave shapes (shoes, hanger) and thin structures (screwdriver, baking form). We will highlight it more in the final manuscript.

Spheres-in-Bowl: Although sphere SDFs are trivial, Spheres-in-Bowl is a nice example to study the scaling properties of the models due to a large number of collisions within any given time step. We show other large-scale examples with more complex shapes, such as a pile of shoes (Figure 2), a pile of knots (Figure 10 top) and a mix of different objects (Figure 10 bottom). Finally, the bowl itself does not have an analytical SDF expression and still requires an SDF for collision detection.

Lines 69-71 suggest acceleration structures like BVH often rely on CPU implementations that are difficult to accelerate. BVH trees on GPUs are actually common.

Thank you for pointing this out – we will remove this claim from the paper.

评论- Thank you for the clarification

2024-08-12

The rebuttal is helpful. I thank the authors for bringing Fig. 10 and the attached PDF to my attention. The responses to my other questions also look good. I will increase my rating.

审稿意见

评分: 5置信度: 42024-07-18

In this paper, a learning-based simulator based on graph neural networks (GNNs) is proposed that leverages signed distance function (SDF) shape representations of objects for efficient parallel processing. The GNN operates on a set of nodes and edges where the nodes are defines by the object center of masses and contact points on the object surfaces. Edges are formed between object centers and collision points on other objects within a distance threshold. The SDF is used to query distances to the object shapes and determine approximate collision points using the gradients of the signed distance fields in a single step. The approach facilitates scenes with 100s of objects with millions of sample points for collision checking on a V100 GPU. Results indicate that the method scales to larger scenes than mesh-based learning-based simulator baselines. Comparison to classical analytical simulators like NVidia PhysX, MuJoCo or bullet is not provided.

优点

The proposed approach of using SDFs for learning-based simulation is novel and interesting.
The results section provides an interesting set of evaluations of scaling, run-time and accuracy performance. Results indicate improvements in scaling performance over the state-of-the-art learning-based methods.
The paper is well written and easy to follow.

缺点

l. 182ff, why not differentiate through the euler integration and shape matching steps and take these processing steps into account for the learning approach? Are these steps differentiable?
The paper should also compare with classical analytical state-of-the-art simulators such as NVidia PhysX, MuJoCo, bullet. SDFs can be converted into mesh representations for these simulators. How does SDF-Sim relate to these approaches in performance?
The run-time evaluation is unclear. The plots in Fig. 7 suggest that the simulator requires several seconds per time step. What is the simulated time per time step ? Is the simulator real-time capable? At which number of objects / collision sample nodes does the method loose real-time property?
l. 207, please provide a reference for FIGNet* on first occurence.
It is unclear if the edge pruning in FIGNet* is fairly adjusted to these datasets for the comparison with the proposed method. How does FIGNet* perform with variations in the pruning parameters? Can a similar distance threshold be used as in SDF-Sim ? How does the performance tradeoff change with such parameter variation in Fig. 7 ? Can FIGNet* outperform SDF-Sim overall ?
Sec. 4.4. how do classical analytical simulators such as PhysX, MuJoCo, bullet perform on this "large real-world scene" when transforming the extracted scene SDF to a mesh?
l. 314, DANO [10] uses NeRF representation of object shape, not SDF.
l. 327, the requirement of watertight objects can be a disadvantage too, e.g., if the objects are thin. Please discuss.
The broader impact argues that extracting a mesh for simulation from VolSDF reconstruction would be a specialized skill. This is questionable, as a simple MarchingCubes reconstruction might be sufficient to extract such a mesh.

问题

Please address questions raised in paper weaknesses.

局限性

The paper discusses several limitations of the proposed method.

作者回复

2024-08-05

We thank the reviewer for the constructive feedback. We will add the discussions below into the paper. We will also correct the reference to FIGNet* and DANO description in the related work.

Are euler integration and shape matching differentiable?

Both Euler integration and shape-matching (using differentiable SVD) are differentiable, and we could potentially include those steps into the learning process. However, we suspect that the current loss on the nodes before shape-matching provides a fine-grained learning signal and forces the model to correct errors for each node individually.

Edge pruning in FIGNet $\*$ is fairly adjusted? Influence on Fig 7? Can FIGNet $\*$ outperform SDF-Sim overall?

The edge pruning has one main parameter, which is the collision radius. We use the same collision radius of 0.1 for FIGNet, FIGNet* and SDF-Sim, matching the collision radius used in FIGNet and FIGNet* papers.

If we change the collision radius, we expect the relation between the models in Figure 7 to stay the same, because FIGNet* constructs O(K^2) edges within the collision region, while SDF-Sim has only a linear number of edges O(K), where K is the number of nodes. Given such asymptotic complexity, we do not think FIGNet* can outperform SDF-Sim.

Is the simulator real-time capable?

We perform the simulation for 200 time steps, 5 seconds of total simulation time.

We have added a runtime comparison to PyBullet in the attached PDF (Figure 2). SDF-Sim has a similar runtime to PyBullet up to ~30 objects per scene. Generally, learned simulators are not real-time, except on very small scenes.

We note that optimizing our code for speed was not our goal. MuJoCo or Bullet use low-dimensional collision meshes to achieve real-time, while all of our large simulation scenes use relatively detailed node sets. Additionally, a large part of SDF-Sim runtime is spent on constructing the edges of the input graph, which can be sped up by faster edge pruning, batching SDF queries or using specialized cuda kernels.

Comparison to the analytical simulators such as NVidia PhysX, MuJoCo, bullet

This is an interesting question. Note that we use Bullet simulations as the ‘ground-truth’ in experiments with Movi B/C (Figure 6) and Spheres-in-Bowl (Figure 8). As shown in Figure 6, SDF-Sim remains consistent with the Bullet simulation, having low translation and rotation error after 50 rollout steps.

In the attached PDF, we also added new comparisons to Bullet in terms of runtime, penetrations and simulation stability (Figures 2 and 3).

We note that it was not our goal to outcompete the state-of-the-art analytical simulators in runtime. We develop learned simulators because they have their own unique advantages that analytical simulators don’t provide, specifically for Robotics. The main difference is that learned simulators can be trained directly on real-world observations. They can track the real object trajectory better than analytical simulators, solving a well-known sim-to-real gap [1]. Another common issue is precisely estimating the initial states, which analytical simulators rely on – learned simulators can compensate for these inaccuracies [1]. Finally, learned simulators are differentiable and can be used for optimization and design [2]. At the same time, we agree that learned simulators were not optimized for runtime and are slower than analytical simulators. Therefore, in this paper we chose to only compare to other learned simulators.

[1] Graph network simulators can learn discontinuous, rigid contact dynamics. Allen et al. CoRL 2023

[2] Inverse Design for Fluid-Structure Interactions using Graph Network Simulators. Allen et al. NeurIPS 2022.

Sec. 4.4. comparison to PhysX, MuJoCo, bullet on SDFs from vision, where extracted SDF is transformed into a mesh

We expect that PhysX, MuJoCo, Bullet to be able to successfully handle the scene, since 80k nodes is feasible for these simulators.

We are not claiming that SDF-Sim is the only simulator that could do that. Instead, this experiment shows that 1) SDF-Sim can generalize to new scenes, despite being trained on synthetic data 2) We can plug in output from VolSDF directly into SDF-Sim, bypassing the mesh conversion 3) FIGNet already runs OOM on this scene

We will remove the wording “large” for this section and clarify the goals of this experiment in the paper.

The requirement of watertight objects can be a disadvantage too, if the objects are thin

Modeling thin surfaces, e.g. cloths, is non-trivial, and it is outside of scope for this paper. Here we focus on modeling rigid objects, for which volumetric shapes are often a good approximation.

Although SDF-Sim is not designed to represent cloths, other graph-network simulators, e.g. MeshGraphNet [1] are able to do so. One can imagine combining the ideas from SDF-Sim and MeshGraphNet, where rigid objects are represented as an SDF for efficiency, while thin/deformable objects are represented as a meshes.

[1] Learning mesh-based simulation with graph networks. Pfaff et al. ICLR, 2021

A simple MarchingCubes reconstruction might be sufficient to extract a mesh from VolSDF

Extracting meshes intended for simulation require special handling for two reasons:

Naive extraction with Marching Cubes leads to noisy meshes. VolSDF is a learned model and its zero values might not be exactly on the surface. Therefore, the meshes produced by MarchingCubes can contain holes or floating faces and might require substantial clean-up before they can be used in the simulator, because analytical simulators rely on the meshes to be watertight to perform inside-outside tests.
Analytical simulations need a “collision” mesh consisting of convex subcomponents, requiring an additional convex mesh decomposition which is often done manually.

These steps are feasible to do, but they require knowledge and tools that a deep learning researcher might not have encountered before.

2024-08-10

The author response has addressed my concerns. I have no further questions at this point.

审稿意见

评分: 6置信度: 42024-07-26

This work tackles the problem of learning how to simulate rigid-body objects, scaling up to very large scenes that may consist in hundreds of objects and around a million mesh vertices. Graph neural network (GNN) baselines were usually used but become intractable for large scene, with high memory and computational costs. To solve this, this paper presents a new method based on the signed distance function (SDF): first, a learnable SDF is fitted to each object, which can be queried quickly and is memory efficient. Then, the SDFs are leveraged to build a graph between the objects without connecting all vertices, reducing the amount of edges needed. The method is evaluated against baselines on different simulation setups. It is noted that the proposed approach does not perform better than SOTA method based on GNN when the scene is small enough for them to be applied to. However, when scaling the scenes up, these baselines fail due to memory constraints, while the proposed method still runs. Some accompanying videos also show realistic qualitative results.

优点

Generally, I found the paper easy to follow and the ideas clearly presented.
The new SDF-based inter-object edges that project a node of object 1 to object 2 using the object 2's SDF is an interesting idea to reduce edges number and get nearly continuous information about where node 1 is approaching object 2.
1. It is also noted that the worst case complexity is reduced from quadratic to linear w.r.t. the number of nodes per object.
Results show a reduction in memory and computational cost regarding baselines. The experiment with growing number of objects (section 4.2) displays clearly that the proposed method scales better than the baselines and allow to simulate more complex cases where the GNNs cannot be run.
The predicted simulation rollouts are qualitatively realistic without intersection, as seen on the provided videos.
The proof-of-concept vision experiment suggests that once trained on clean objects from a dataset, the method could still generalize to "real" meshes coming from 3D reconstruction.

缺点

As is noted in the work, the proposed method is still a bit behind the SOTA GNNs in the accuracy of the prediction w.r.t. the ground-truth simulation. Thus, the proposed method is more aimed at larger scenes only where the baselines cannot be run.
Learned SDFs for the objects, while fast to query, still need to be fitted first for each object and its accuracy is limited by the network's capacity.
1. In addition, every object's SDF-model need to be pushed on the GPU--or swapped in-and-out. This work seems to use mostly a smaller set of repeatable object, and thus can reuse the same models. This could potentially be solved by weight sharing between models or, as noted in the limitations.
The vision experiment, simulating in an environment extracted from a multi-view NeRF reconstruction, is a bit underwhelming. While it is indeed just a proof-of-concept, it uses only a single object in a single scene, which is directly extracted as a rigid environment. I am unsure what is the main point of this result as the pipeline seems to be mostly the same as the previous ones.
There is not much motivation behind needing learned simulators. Maybe expending on the limitations of the simulators themselves could make the work stronger.

问题

See Weakness 4., why is that important to build learned simulators in the context of rigid-object simulations? Are the simulators also memory-constrained and that makes them unusable in the large scenes discussed in the work?
1. A learnable surrogate model that approximates a simulator can also be useful as a differentiable simulation, e.g., in order to optimize the scene or some objects. Could the presented approach be use in such a pipeline?
A more open-ended question: assuming no intersection, all the SDFs are used in their positive domain. Would the proposed method also work for open surface objects, using their UDFs (unsigned distance functions)? In practice, as there might be tiny amount of penetration, the SDF can detect it (negative output) but not the UDF. I wonder if the method is resilient to this or may fail.

A minor suggestion: for Fig. 8, adding the standard-deviation or similar metric to the plots can be helpful, as they represent an aggregate of multiple simulations.

局限性

Limitations are discussed in the paper: the need to train an SDF MLP per object in the scene, that the current accuracy is slightly below SOTA (when those GNN model can be applied), and is currently limited to rigid-objects only.

作者回复

2024-08-05

We thank the reviewer for positive evaluation of our paper and for insightful comments.

Learned SDFs …. accuracy is limited by the network's capacity.

We agree that the accuracy of Learned SDFs depends on network capacity. However, we found that even a basic MLP with 8 layers and 32 units per layer is sufficient to achieve <10e-5 error on SDF estimates and perform accurate simulation (Figures 9(a-c)). We did not find the SDF network capacity to be a limitation.

Every object's SDF-model need to be pushed on the GPU--or swapped in-and-out

Evaluating each SDF sequentially and swap them in-and-out would be very inefficient and slow.

SDF weights are small, and we keep all object SDF on the GPU throughout the simulation. Even with many different object shapes in a given scene, keeping SDFs in GPU memory is not an issue. To compute the distances, we also evaluate all SDFs in parallel, leveraging the GPU accelerators. As mentioned in Limitations, it is possible to further reduce the memory consumption, if needed, by using amortized models, weight sharing, or faster SDF queries.

This work seems to use mostly a smaller set of repeatable object

We emphasize that SDF-Sim is not specific to using repeatable objects. We provide an example of a large-scale simulation with many different objects in Figure 10 in the original paper (also replicated in the Rebuttal PDF for reviewer’s reference).

Why is it important to build learned simulators in the context of rigid-object simulations? Are the simulators also memory-constrained?

This is a great question. The limitation of traditional simulators like MuJoCo or Bullet is that the simulations inevitably diverge from observations of real objects – a so-called sim-to-real gap [1]. The traditional simulators rely on hard-coded approximations of physical interactions that might not match the properties of real objects. It was shown that even with careful parameter tuning, the analytical simulators cannot precisely model a real cube tossed on a table [2]. This is a major problem for robotics, which heavily relies on simulated environments, and a long line of research in robotics is dedicated to mitigate the sim-to-real gap.

In contrast, learned simulators do not suffer from sim-to-real gap, as they can be directly trained or fine-tuned on real-world data. In fact, learned simulators can be better at tracking real objects than traditional simulators like MuJoCo, Drake or Bullet, even in the low data regime [3].

Another advantage of learned simulators is that they are differentiable by nature and can be used for solving inverse problems, such as design or control (as noted in reviewer’s question below).

However, learned simulators used to be memory-constrained and work only on small scenes with up to 10 objects – this is exactly the problem we are targeting in the paper.

A learnable surrogate model … can also be useful as a differentiable simulation, e.g., in order to optimize the scene or some objects.

This is a great point. SDF-Sim is fully differentiable: GNN and learned SDFs are differentiable by nature of neural networks. We can also pass gradients through the construction of input graph from GNN to the learned SDFs. One can use a differentiable version of SVD for the shape-matching step.

Generally, GNN-based models can be successfully used as differentiable simulators to optimize an object shape, such as an airplane wing, and can provide more stable gradients than traditional differentiable simulators [4]. We believe that SDF-Sim inherits these properties and can be used for differentiable simulation in the areas of mechanical design, robotics and more.

The vision experiment, simulating in an environment extracted from a multi-view NeRF reconstruction, is a bit underwhelming. I am unsure what is the main point of this result.

This experiment is a proof-of-concept that demonstrates 1) SDF-Sim can generalize to real-world object shapes derived from 3D reconstruction, although it was trained on “clean” shapes 2) we can directly connect the VolSDF outputs to SDF-Sim, making a fully-differentiable vision-to-simulation pipeline.

The alternative approach to simulating this scene would be to convert VolSDF output into a mesh and run a mesh-based simulator. However, generating clean meshes can be tricky and the meshes may need to be cleaned-up before they can be used in the simulator. Additionally, most rigid solvers like Bullet or MuJoCo rely on convex meshes, requiring additional convex mesh decomposition which is often done manually. In contrast, SDF-Sim directly takes SDF as an input, and we can bypass the step of converting SDFs into meshes.

Would the proposed method also work for open surface objects, using their UDFs (unsigned distance functions)? Can it detect penetration?

We hypothesize that SDF-Sim will work with UDF representations, assuming that UDFs are used for all objects and SDF-Sim is trained on UDF representations.

The learned simulators generally are able to detect penetrations using unsigned distances. For instance, mesh-based FIGNet baseline uses only the unsigned distances between individual triangles and can successfully reason about penetrations by pooling the information from pairs of faces on the surfaces of the two objects. In the current SDF-SIm paper, we performed an experiment with unsigned distances computed from a mesh (Appendix E.6) and found that the accuracy is similar to our current model with SDF representations.

[1] What Went Wrong? Closing the Sim-to-Real Gap via Differentiable Causal Discovery. Huang et al.CoRL 2023

[2] B. Acosta, W. Yang, and M. Posa. Validating robotics simulators on real-world impacts. IEEE Robotics and Automation Letters, 2022.

[3] Graph network simulators can learn discontinuous, rigid contact dynamics. Allen et al. CoRL 2023

[4] Inverse Design for Fluid-Structure Interactions using Graph Network Simulators. Allen et al. NeurIPS 2022.

2024-08-12

I thank the authors the added explanations and clarifications in the rebuttal, I believe my questions have been answered.

审稿意见

评分: 7置信度: 42024-07-29

This paper presents a neural network-based simulator of rigid body dynamics with contacts that is memory-efficient and thus scalable to scenes with hundreds of objects and up to 1.1 million nodes on a single GPU. The key idea is to use SDF as the geometry representation to simplify collision detection compared with mesh representations.

优点

The paper addressed an important and interesting problem, which is an efficient end-to-end simulator of rigid body dynamics. It has significant potential value in various fields, including robotics, AR, etc. Replacing mesh representations with SDFs allows the framework to largely reduce the computational complexity when modeling the collisions between objects. This key idea has been clearly conveyed and experiments validated this claim.

Section 2 provided a nice context that educates readers about the basic knowledge required to understand the framework of this work. The methodology presentation in Section 3 is clear, compact, and easy to follow. The experiments laid out the performance of the proposed method from small-scale scenes to significant larger-scale scenes, which is beneficial for readers to appreciate the merits of this work.

The authors provided extensive details in the appendix about the implementation of the geometric representation, network modeling, training, and evaluation, which made the whole paper much more informative and improved reader's understanding.

缺点

I found weaknesses mainly in the experiment section, mainly in the limited explanation and interpretation of the experimental results.

First, I found the wording "small-scale datasets" and "small datasets" misleading. The conventional interpretation of a "small-scale dataset" refers to the size of the training data, while in this paper, I believe it actually means the simulation scene is of a relatively small scale, i.e., involving a limited number of objects and nodes. It is a core concept in the experiment section, and the potential gap between the interpretations could lead readers to totally different takeaway messages and create inconsistency between the claim and the evidence. According to Appendix B, 1500 trajectories are used for training in both Movi-B and Movi-C datasets, which reveals the actual scale of the training data.

On top of this point, there isn't enough discussion about why the proposed method performs slightly worse than the baseline methods on datasets with small scenes and outperforms in larger scenes. What is the difference between smaller scenes and larger scenes for these models? It is a key finding in this paper, but the discussion is largely omitted.

Appendix E.6 seems to try to answer this question, but I didn't quite understand this paragraph. What do you mean by " train an SDF-Sim architecture using the accurate distances directly computed from a mesh"? I thought it was exactly what was done in the paper. Why do an experiment with unsigned signed distance? I think this section created more confusion.

A minor issue is in Appendix E.4. and Figure S3. The figure shows the baseline in purple (blue to me) and SDF-sim in red (orange to me), while the text says the opposite. Please make them consistent.

问题

I have a question about the surface nodes. In the paper, the authors didn't explain how the surface nodes are generated. From Appendix E.5, it seems like the surface nodes used in the experiments in the paper are the mesh vertices used for collision simulation. Is it correct? In E.5, using surface nodes as samples on the SDF can achieve similar accuracy with much fewer nodes. Why isn't it the default design in the main paper? Is there a trade-off?

局限性

The authors adequately addressed the limitations. The authors and the reviewer did not foresee the potential negative societal impact of this work.

作者回复

2024-08-05

We thank the reviewer for highly appreciating the value of the paper and for valuable comments.

The wording "small-scale datasets"

Indeed, here we mean the datasets have small-scale scenes with up to 10 objects, while later in the paper we scale the simulator to scenes with hundreds of objects. We will update the wording in the paper.

What is the difference between smaller scenes and larger scenes?

The key difference is that having an order of magnitude more objects in large scenes leads to many more collisions. We specifically design large scenes to have stacked objects, as this setup is considered to be a particularly hard problem in rigid simulation. It requires resolving many collisions within the same time step and propagating the collision response across a chain of multiple objects. In contrast, in small scenes, the collisions are sparse and involve only pairs of objects.

We suspect that FIGNet* and FIGNet may have some slight accuracy advantage in pairwise collisions due to the explicit computation over the mesh. SDF-Sim is better at propagating those collisions across many objects and therefore has better performance on more challenging scenes with stacked objects.

Appendix E.6: What do you mean by "train an SDF-Sim architecture using the accurate distances directly computed from a mesh"? Why do an experiment with unsigned signed distance?

Thank you for pointing out the confusion – we will update the section to make it clearer.

The distances predicted by learned SDFs may not be perfectly accurate, because they come from a learned model. In Appendix E.6 we aim to test whether this is an issue for the simulator. To do so, we compute the distances using a brute-force distance computation between query points and each triangle in the mesh. Then we train the simulator using these distances instead of learned SDFs. It is tricky to estimate the SDF sign using this approach (whether the point is inside/outside of the mesh), so we use the unsigned distances. We found that using learned SDFs versus pre-computed distances makes little difference in practice.

Surface nodes … are the mesh vertices used for collision simulation? In Appendix E.5 surface nodes are samples from the SDF. Why isn't it the default design? Is there a trade-off?

This is correct – for most experiments in the paper we used mesh vertices from the original collision meshes as surface nodes for SDF-Sim. The only exception is the experiment in section E.5, where the nodes were sampled from an SDF.

Indeed, we have shown that we can potentially scale the simulator further by re-sampling the nodes from an SDF. However, the tradeoff of node re-sampling is that it introduces an additional step of the pipeline with its own set of hyperparameters (grid size, downsampling, etc.) that might need to be tuned for more complex shapes. In the main paper we chose to focus on a clear message that we can scale the simulator solely by using SDFs and adapting the GNN, as this architecture works universally well for different object shapes.

2024-08-13

I thank the authors for responding to my questions. My concerns are addressed. Hopefully the updated version of the paper reflects the clarifications. I also went through other reviewer's comments and the rebuttal. Overall, I'm happy with this submission.

作者回复

2024-08-05

We thank the reviewers for the thoughtful and constructive comments.

We are happy that the reviewers appreciated the contributions of our paper, specifically that the paper addressed an important and interesting problem … and has significant potential value in various fields, including robotics, AR (mwTT); marrying SDFs to a GNN rigid-body simulator is new (HA7E, nWDb); SDF-Sim allows to largely reduce the cost of modeling collisions (mwTT, SAmo); predicted simulations are qualitatively realistic (SAmo); other existing GNN-based simulators cannot model such large scenes (HA7E). Reviewer SAmo also pointed out that our vision experiment in section 4.4 as a strength, noting that SDF-Sim could generalize to "real" meshes coming from 3D reconstruction.

A large-scale simulation with many different object shapes

We would like to draw reviewers’ attention that we have an example of large-scale simulation with a mix of different object shapes from Movi: “Heaps of stuff” video on the webpage, also shown in Figure 10. This simulation includes concave shapes (shoes, hanger) and thin structures (screwdriver, baking form). We emphasize that the benefits of SDF-Sim are not specific to simulations with similar object shapes. We will bring up this example to the first page in the final manuscript.

Why learned simulators?

We additionally describe why it is important to build learned simulators in response to reviewers nWDb and SAmo.

Additional results

We have added a PDF to address reviewer's questions:

Re-iteration of “Heaps of stuff” simulation with a mix of objects (reviewers SAmo, HA7E)
New results with runtime comparison between SDF-Sim and Bullet on large scenes (reviewer nWDb)
Comparison of Penetration and Rollout RMSE metrics, with a new comparison to Bullet on large simulations (reviewer nWDb)

We will update the final manuscript to include reviewers’ suggestions.

最终决定Accept (oral)

2024-09-25

This paper describes an approach to significantly improving the scale of simulations that can be successfully addressed by GNN based methods by employing learned Signed Distance Functions which effectively capture the geometry of the objects and lead to accelerated computational performance. The paper was reviewed by a panel of experts who felt that the method was well explained and that the advantages in terms of the scale of the simulations that could be addressed were quite compelling. It was also noted that the approach could provide an avenue to more accurate simulations of physical systems since it could be used to learn appropriate representations from sensor data thus addressing the sim-to-real gap. The provided rebuttal effectively addressed all of the reviewers comments.