Below on the left, we visualize 300 frame videos, with 264 generated frames conditioned on 36, without action conditioning.
On the right, we visualize the 3D scenes produced by these videos. 3D auxiliary data is used only for evaluation, and generation is done using RGB frames.
TECO (ours)
Latent LDM
Perceiver-AR
CW-VAE
FitVid
Below on the left, we visualize 300 frame videos, with 164 generated frames conditioned on 144 with action conditioning.
On the right, we visualize the 3D mazes constructed from our video predictions, with 264 generated frames conditioned on 36 without action conditioning. Video prediction uses only RGB frames