Learning 3D Particle-Based Simulators from RGB-D Video

William F. Whitney*, Tatiana Lopez-Guevara*, Tobias Pfaff, Yulia Rubanova, Thomas Kipf, Kimberly Stachenfeld, Kelsey R. Allen

ICLR 2024

arXiv paper

Visual Particle Dynamics (VPD)

Visual Particle Dynamics learns a 3D particle-based simulator from multi-camera RGB-D videos. One or more camera images are encoded into a latent particle representation of the scene. A hierarchical graph neural network dynamics model predicts the change in location and features of the set of particles. Then a ray-based renderer decodes the particle set into images. The entire model is supervised end-to-end with a pixel-based loss.

Varying material parameters

In this video we show a model trained on a mix of deformable and rigid objects. VPD can infer that gray objects are rigid, while colorful objects are deformable, and predict dynamics accordingly. Ground truth is shown on the left, model predictions on the right.

Generalization to larger scenes with more objects

These videos were generated by a model trained on videos of only two objects. The model is able to generalize to a much larger scene with 32 objects represented using 4x as many particles as seen in training.

Long rollouts: Deformable Collision

A VPD model trained on DeformableCollision is rolled out on the test set for 50 frames. Ground truth is shown on the left, predictions are shown on the right.

Viewpoint generalization

A VPD model trained on DeformableCollision is rolled out on the test set for 50 frames, and the latent state is rendered from a rotating camera at steps 25 and 50. Even though VPD was only trained on 4 fixed camera position, it has no trouble generating views from arbitrary new angles due to its structured representation.

3D Point Cloud Editing

A VPD model trained on DeformableCollision can be edited by modifying the 3D Point Cloud directly.

Original Scene

Deleted Cylinder

Deleted Floor

3D Point Cloud Interactive Visualization: Deformable Collision

Here we show the latent particles of a rollout in the Deformable Collision dataset by applying the VPD renderer to each particle location to get a color, then plotting the colored points in 3D.

Rendering with partial observations

This video shows a render from 360° views conditioned on only a single input image. The model does not observe the "back" of the block and some parts of the floor and thus there are no particles in those regions. However, there is enough information in the particles surrounding them for the renderer to complete the scene.

Full rollout comparisons

These videos show ground truth videos along with 32-timestep predictions from each of our baselines and VPD.

MuJoCo Block

Deformable Block

Deformable Multi

Deformable Collision

Kubric MOVi-A

Kubric MOVi-B

Kubric MOVi-C

Additional MuJoCo Block rollouts

These videos show additional results comparing VPD (left) to ground truth (right). This dataset tests the model's ability to handle rigid objects and sharp collisions.