PIP: PHYSICAL INTERACTION PREDICTION VIA MENTAL SIMULATION WITH SPAN SELECTION

Institute for Infocomm Research, A*STAR1 , Singapore University of Technology and Design 2 , Nanyang Technological University of Singapore3

Abstract

Accurate prediction of physical interaction outcomes is a crucial component of human intelligence and is important for safe and efficient deployments of robots in the real world. While there are existing vision-based intuitive physics models that learn to predict physical interaction outcomes, they mostly focus on generating short sequences of future frames based on physical properties (e.g. mass, friction and velocity) extracted from visual inputs or a latent space. However, there is a lack of intuitive physics models that are tested on long physical interaction sequences with multiple interactions among different objects. We hypothesize that selective temporal attention during approximate mental simulations helps humans in physical interaction outcome prediction. With these motivations, we propose a novel scheme: Physical Interaction Prediction via Mental Simulation with Span Selection (PIP). It utilizes a deep generative model to model approximate mental simulations by generating future frames of physical interactions before employing selective temporal attention in the form of span selection for predicting physical interaction outcomes. To evaluate our model, we further propose the large-scale SPACE+ dataset of synthetic videos with long sequences of three prime physical interactions in a 3D environment. Our experiments show that PIP outperforms human, baseline, and related intuitive physics models that utilize mental simulation. Furthermore, PIP’s span selection module effectively identifies the frames indicating key physical interactions among objects, allowing for added interpretability

SPACE+ Dataset

The SPACE+ dataset is an improved extension of the SPACE dataset[2], it consists of three fundamental physical interactions: stability, contact and containment in a 3D environment. The SPACE+ dataset is made up of 57,057 synthesized videos with over 8 million frames in total for seen object classes, and additional 11,411 videos for unseen object classes for a total of over 1.7 million frames. The SPACE+ dataset further included unseen object classes U = { Suzanne, Truck, Airplane} on top of the previously seen object classes O = {Cylinder, Cone, Inverted Cone, Cube, Torus, Sphere, Flipped Cylinder}. To benchmark PIP with human performance, we use 1,000 scenes per task from the SPACE+ dataset and split it into 60% for training set, 20% for validation and 20% for testing. Each video also comes with other visual data attributes such as segmentation map, optical flow map, depth map and surface normal vector map of each frame.

Data distribution of the SPACE+ dataset used for training and testing in terms of physical interaction outcome.

SPACE+ dataset analysis.

Experiments

PIP model architecture. (A) Data inputs: the original data inputs for our physical interaction prediction task comprise of the first M frames, the first M target object masks and the task description. The task description is different for each of the three fundamental tasks to facilitate multi-task learning for the combined task (in this diagram the task description for the stability task is shown). (B) Mental simulation: the first M frames are fed into the mental simulation module that consists of a ConvLSTM to generate the next N frames. (C) Span selection: the original data inputs and the generated N frames are fed into the span selection module, where pretrained models will encode them into features before classification. All models are trained.

Human trial setup on physical interaction prediction tasks. Trial structure for familiarization trials (top) and test trials (bottom) with the observed frames, task queries and ground-truth frames.

Results

Accuracy results for seen (left) and unseen (right) object scenarios for all four physical interaction outcome prediction tasks. PIP outperforms most of the baselines and ablation for both seen object and unseen object classes.

(A) Average test prediction accuracy and standard deviation for seen (left) and unseen object (right) scenarios across all models and seeds. (B) PIP’s frame selection frequencies on the test set for seen object scenarios across all seed runs.

An example of PIP’s generation and span selection corresponding to the first window of peak span selection frequencies in the stability task. For visualizations of key physical interaction moments in the other tasks, refer to our supplementary material.

An example of the generation done by PIP

Questions?

Contact duanjiafei@hotmail.sg to get more information about the project