Relational Neural Expectation Maximization

Bouncing Balls with Variable Mass

Comparison of physical dynamics learned by RNN to R-NEM. Both methods have been trained on sequences of 30 timesteps with 4 bouncing balls and are evaluated on sequences of 500 timesteps with the same number of balls. The sequences shown constitute a random subset of the test sequences.

Legend:

  • Top frame constitutes the ground truth sequence
  • Middle frame constitutes the prediction by the model
  • Bottom frame constitutes the grouping of pixels to components (in case of R-NEM)

Observations:

  • The RNN poorly models physical dynamics (even when no interactions take place). In particular notice how the individual balls jitter and don't move in a straight line. Moreover notice that when collision occurs balls frequently swallow one another and stick together.
  • R-NEM accurately captures most of the physical dynamics in the environment. Balls move in straight trajectories, and most collisions are accurately captured. Notice that in most cases the grouping is perfect.

An RNN evaluated on 10 test sequences (trained on sequences with four balls)

R-NEM evaluated on 10 test sequences (trained on sequences with four balls)

Extrapolating learned physical dynamics to environments with more balls

Comparison of physical dynamics learned by RNN to R-NEM when extrapolating to environment with more balls. Both methods have been trained on sequences of 30 timesteps with 4 bouncing balls and are evaluated on sequences of 500 timesteps with 6-8 balls. The sequences shown constitute a random subset of the test sequences.

Legend:

  • Top frame constitutes the ground truth sequence
  • Middle frame constitutes the prediction by the model
  • Bottom frame constitutes the grouping of pixels to components (in case of R-NEM)

Observations:

  • The behavior of the RNN on this environment is a more extreme version of what was observed on sequences with 4 balls
  • R-NEM is able to generalize most of the learned physical dynamics to sequences with an increased number of balls

An RNN evaluated on 10 test sequences (trained on sequences with four balls)

R-NEM evaluated on 10 test sequences (trained on sequences with four balls)

Bouncing Balls with an Invisible Curtain

Learned physical dynamics by R-NEM on a bouncing balls environment in which a curtain (spawned at a random location) occludes the balls. R-NEM has been trained on sequences of 30 timesteps with 3 bouncing balls and is evaluated on sequences of 500 timesteps with the same number of balls. The sequences shown constitute a random subset of the test sequences.

Legend:

  • Top frame constitutes the ground truth sequence
  • Middle frame constitutes the prediction by the model
  • Bottom frame constitutes the grouping of pixels to components (in case of R-NEM)

Observations:

  • R-NEM accurately captures the physical dynamics in this environment, even when collision occurs behind the curtain
  • Notice that there is a one-to-one correspondence between components and objects, i.e. the component that models a ball that has been completely occluded is the same one that was modeling the ball before it occluded.
  • This suggests that the system exhibits some degree of object-permanence, and that the representation of the ball in that component is meaningful, even when the ball is occluded

R-NEM evaluated on 10 test sequences

Simulating the Bouncing Balls Environment

Comparison of the simulation quality of the bouncing balls environment by RNN and R-NEM. both models have been trained on sequences of 30 timesteps with 4 bouncing balls and are evaluated on sequences of 50 timesteps, followed by (after the flash) simulation of 100 time-steps with the same number of balls. The sequences shown constitute a random subset of the test sequences.

Legend:

  • The flash indicates the start of simulation
  • Top frame constitutes the ground truth sequence
  • Middle frame constitutes the prediction by the model
  • Bottom frame constitutes the grouping of pixels to components (in case of R-NEM)

Observations:

  • The RNN relies too much on feedback from the environment and so the moment that simulation begins the balls grow in size, slow down and no longer resemble the actual dynamics of the environment
  • The simulation quality of R-NEM is much better, especially when simulating for about 10-30 timesteps and the grouping is accurate.
  • Notice that when the grouping is not perfect, only one of the balls that have been put together in the same component remains
  • Moreover, when simulating for a much larger number of timesteps (>30-100) we observe deviations, including surreal cases in which the big balls shrink to the size of the small balls (and adopt their physical dynamics), which we attribute to a combination of noise and approximation of the E-step

An RNN evaluated on 10 test sequences (trained on sequences with four balls)

R-NEM evaluated on 10 test sequences (trained on sequences with four balls)