Unsupervised Learning for Physical Interaction through Video Prediction

Links:
- Full Dataset (including training and test sets)

Robot pushing evaluation
Below are example video predictions from various models in our evaluation on the robot interaction dataset. The ground truth video shows all ten time steps, whereas all other videos show only the generated 8 time steps (conditioned on only the first two ground truth images, and all robot actions)

Qualitative comparison across models - seen objects
ground truth // CDNA // ConvLSTM, with skip // FF multiscale [14] // FC LSTM [17]
    
    
    

Qualitative comparison across models - novel objects
ground truth // CDNA // ConvLSTM, with skip // FF multiscale [14] // FC LSTM [17]
    
    
    
Note how the ConvLSTM model predicts motion less accurately compared to the CDNA model, and degrades the background (e.g. the left edge of the table).

Changing the action
CDNA, novel objects
0x action // 0.5x action // 1x action // 1.5x action
   
   

Randomly-sampled predictions, novel objects
ground truth // CDNA


Visualized masks
CDNA, seen objects, masks 0 (background), 2, and 8
    
   
    

CDNA, novel objects, masks 0 (background), 2, and 8
   
   
   


Human3.6M evaluation
Below are example video predictions from various models in our evaluation on the Human3.6M, with a held-out human subject. The ground truth video shows ten ground truth time steps, whereas all other videos show the generated 10 time steps (conditioned on only the first ten ground truth images, which are not shown)

Sample video predictions
ground truth // DNA // FF multiscale [14] // FC LSTM [17]