Convolutional State Space Models for
Long-Range Spatiotemporal Modeling
Jimmy T.H. Smith, Shalini De Mello, Jan Kautz, Scott W. Linderman, Wonmin Byeon
Jimmy T.H. Smith, Shalini De Mello, Jan Kautz, Scott W. Linderman, Wonmin Byeon
Below are randomly sampled trajectories for the different models on each dataset. We show 16 samples for each model and compare them to the ground truth. The green box represents conditioning on the context window (and is the same as the ground truth), while the red box represents the generated samples. For best comparison, we recommend setting the video players to the 1080p resolution for the highest quality videos.
Moving-MNIST Samples
1200 frames generated conditioned on 100
DMLab Samples
156 frames generated conditioned on 144 (action-conditioned)
264 frames generated conditioned on 36 (no action-conditioning)
Minecraft Samples
156 frames generated conditioned on 144 (action-conditioned)
264 frames generated conditioned on 36 (action-conditioned)
Habitat Samples
156 frames generated conditioned on 144 (action-conditioned)
264 frames generated conditioned on 36 (no action-conditioning)