Modeling the Real World with
High-Density Visual Particle Dynamics

We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles (see: Fig. 1). To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs), called Interlacers, leveraging intertwined linear-attention Performer layers and graph-based neighbourhood attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of high degree-of-freedom bi-manual robots with two RGB-D cameras. Compared to the previous GNN approaches, our Interlacer dynamics is twice as fast with the same prediction quality, and can achieve higher quality using 4x as many particles. We illustrate how HD-VPD can evaluate motion plan quality with robotic box pushing and can grasping tasks (see: paper).

Fig. 1: Overview of HD-VPD model with learned Encoders, Dynamics and Render. Encoders encode RGB-D images into a point cloud representation with latent per-point features. Dynamics predicts the evolution of the scene conditioned on the current scene as well as a kinematic skeleton representing the motion of the robot. Renderer is a Point-NeRF style model which enables generation of images of the predicted future scene. The entire model is trained end-to-end with a pixel-wise L2 loss.

HD-VPD rollouts

The Neighbor-Attender Layer

HD-VPD rollouts

Below we show many long rollouts of HD-VPD on held-out examples. These rollouts are HD-VPD's predictions up to 24 timesteps (12 seconds) into the future, conditioned on two input frames (not shown) and the robot's actions.

Each visual has six panels.

Left column

Top panel: HD-VPD's prediction in particle space. This shows how the model predicts each point will move. Note that these particles are a latent representation, and their positions are not supervised during training.

Bottom panel: The kinematic particles describing the robot's motion. These are an input to the model and represent the action the robot will take.

Middle column

Top panel: The rendered video prediction from HD-VPD from the perspective of the overhead camera.

Bottom panel: The rendered video prediction from HD-VPD from the perspective of the (moving) wrist camera.

Right column

Top panel: The ground-truth outcome video from the overhead camera.

Bottom panel: The ground-truth outcome video from the wrist camera.

Quantitative results

Fig 3: Left to right: prediction quality, dyanmics speed and training memory. Analysis of models' behavior as a function of number of particles. Note: GNN is unable to be run with 65K or 131K particles due to memory limitations. Interlacer with 131K particles provides the best prediction quality while staying competitive with the GNN baselines for dynamics speed, and with Performer-PCTs for memory requirements. (a) Test set SSIM prediction quality increases with the number of particles, and Interlacer with 131K points does the best. (b) Interlacer is faster than GNN while able to handle many more particles. Performer-PCT is faster, but achieves worse results. (c) Performer-PCT and Interlacer use less memory than the GNN baseline, enabling them to scale to larger point clouds.

Push planning

Results for planning a desired 0.125m push. X-axis is planned push distance for different potential push plans, Y-axis is the cost for each plan as measured via the HD-VPD rollout. The yellow point shows that a planned push of 0.125m is the best option when the goal is to move the box 0.125m, and the cost of other push plans are ordered consistently by this cost function.

Grasp planning

Grasp offset from object center vs HD-VPD predicted plan cost. With sufficiently large offsets, the grasps miss the can and fail. HD-VPD latent dynamics is able to understand that the points of the coke can should lift up following a grasp trajectory for well planned grasps, and understands that the can should not lift upwards for poorly planned grasps that fail in real world execution.

Architecture

Fig. 2: Interlacer dynamics. The input point clouds from each timestep are processed by the neighbor-attender layers, followed by the Performer layers (see Appendix A for details). In the HD-VPD model, a separate third channel is reserved for processing kinematic particles describing the actions conducted by the robot, which are preprocessed by a regular PCT layer. Then, all of the preprocessed point clouds are merged. The model predicts particles’ displacements as well as deltas of their corresponding feature vectors, after applying one more neighbor-attender and Performer. See: Figure 1 for how Interlacer is integrated with the HD-VPD model.

The Neighbor-Attender Layer

The Neighbor-Attender is designed to provide each particle with information about the geometry and features of its immediate neighborhood as efficiently as possible. A simple approach might find the k nearest neighbors of each particle and extract a summary of those neighbors' features and relative positions, but this would involve a forward pass on kN particle-neighbor pairs. Neighbor-Attender instead computes such neighborhood features only on a small, uniformly-sampled subset of particles we call anchor particles, then uses the anchor particles to update the rest of the particles. This results in a bottleneck step of only kN/r pairs for a subsampling rate r, allowing us to control memory consumption at will. The computations of the Neighbor-Attender layer consist of six steps that we explain in detail below; steps 1-4 aggregate neighborhood features onto the anchor particles, then steps 5-6 use those to update the rest.

Choosing anchor particles: Sample uniformly at random r = N/4 particles from the input set of N particles. We refer to them as anchor particles.
Computing neighbors of anchors: For each anchor particle i, compute the set tau_i of its s=16 nearest neighbors from the entire N-element set.
Computing attention-vectors: For each anchor particle i and its neighbor j in tau_i, compute the relative position feature vector defined as: (x_i, x_j, x_i-x_j, ||x_i-x_j||_2) and concatenate it with the feature vector f_j. We refer to the resulting vector as the (i,j) edge feature f_ij.
Updating feature vectors for all anchor particles: For each anchor particle i, compute its new feature vector f'_i as the weighted sum (with softmax weights) of its edge features using a learnable MLP module (details in the paper).
Finding closest anchors for all the particles: For each particle i in the original N-element set, find its closest anchor particle a(i).
Computing new feature vectors for all the points: For each particle i in the original N-element set, compute its new feature vector f''_i = MLP(f_i, f'_a(i)).

Below, we provide a visualization of the Neighbor-Attender module of the Interlacer.

Modeling the Real World with High-Density Visual Particle Dynamics

Contents

HD-VPD rollouts

Quantitative results

Push planning

Grasp planning

Architecture

The Neighbor-Attender Layer

Modeling the Real World with
High-Density Visual Particle Dynamics