Unsupervised Learning for Interaction through Video Prediction

Supplementary Appendix: https://goo.gl/G0ZIr4

Dataset coming soon.

Robot pushing evaluation
Below are example video predictions from various models in our evaluation on the robot interaction dataset. The ground truth video shows all ten time steps, whereas all other videos show only the generated 8 time steps (conditioned on only the first two ground truth images, and all robot actions)

Sample video predictions - seen objects
ground truth // CDNA // ConvLSTM, with skip // FF multiscale [14] // FC LSTM [17]
    
    
    

Sample video predictions - novel objects
ground truth // CDNA // ConvLSTM, with skip // FF multiscale [14] // FC LSTM [17]
    
    
    
Note how the ConvLSTM model predicts motion less accurately compared to the CDNA model, and degrades the background (e.g. the left edge of the table).

Changing the action
CDNA, novel objects
0x action // 0.5x action // 1x action // 1.5x action
   
   

Visualized masks
CDNA, seen objects, masks 0 (background), 2, and 8
    
   
    

CDNA, novel objects, masks 0 (background), 2, and 8
   
   
   


Human3.6M evaluation
Below are example video predictions from various models in our evaluation on the Human3.6M, with a held-out human subject. The ground truth video shows ten ground truth time steps, whereas all other videos show the generated 10 time steps (conditioned on only the first ten ground truth images, which are not shown)

Sample video predictions
ground truth // DNA // FF multiscale [14] // FC LSTM [17]
   
   
   

Erratum
An early version of the paper included an error in the optical flow plots in Figure 4. The corrected plots are below: