Learning Predictive Models From Observation and Interaction

Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works, and then use this learned model to plan coordinated sequences of actions to bring about desired outcomes. However, learning a model that captures the dynamics of complex skills represents a major challenge: if the agent needs a good model to perform these skills, it might never be able to collect the experience on its own that is required to learn these delicate and complex behaviors. Instead, we can imagine augmenting the training set with observational data of other agents, such as humans. Such data is likely more plentiful, but represents a different embodiment. For example, videos of humans might show a robot how to use a tool, but (i) are not annotated with suitable robot actions, and (ii) contain a systematic distributional shift due to the embodiment differences between humans and robots.

We address the first challenge by formulating the corresponding graphical model and treating the action as an observed variable in the first case and an unobserved variable in the second case, and the second challenge by using a domain-dependent prior. Our method is able to leverage additional videos of passive observations in a driving dataset and a dataset of robotic manipulation videos. A robotic planning agent equipped with our method can learn to use tools in a tabletop robotic manipulation setting by observing humans without ever seeing a robotic video of tool use.

Our system learns from action-observation sequences collected through interaction, such as robotic manipulation or autonomous vehicle data, as well as observations of another demonstrator agent, such as data from a human or a dashboard camera. By combining interaction and observation data, our model is able to learn to generate predictions for complex tasks and new environments without costly expert demonstrations.

We learn a predictive model of visual dynamics (in solid lines) that predicts the next frame xt+1 conditioned on the current frame xt and action representation zt. We optimize the likelihood of the interaction data, for which the actions are available, and observation data, for which the actions are missing. Our model is able to leverage joint training on the two kinds of data by learning a latent representation z that corresponds to the true action.

Network architecture. To optimize the ELBO for the model above, we predict the latent action zt from xt and xt+1 using the inverse model qinv. When the true actions are available, we additionally predict the latent action from the true action at using the action encoder qact, and encourage the predictions from qact and qinv to be similar with a Jensen-Shannon divergence loss. The next frame is predicted from zt and xt.

Video Prediction Results

Robotic Manipulation

SAVP [2]; best for ours

Ours; best for ours

SAVP [2]; best for SAVP

Ours; best for SAVP

SAVP [2]; median difference

Ours; median difference

Example predictions on the robotic dataset. We compare our model to the baseline of the SAVP model trained with random robot data. The compared models had access to robot interaction data and action-free observation data of human tool use. We order sequences by the improvement in MSE of our method over SAVP [2], and select sequences with the largest, the smallest and the middle improvement. Our model is able to leverage action-free data of human demonstrations, allowing it to generate better predictions, by being able to track pushed objects more accurately.

Driving

SAVP [2]; best for ours

Ours; best for ours

SAVP [2]; best for SAVP

Ours; best for SAVP

SAVP [2]; median difference

Ours; median difference

Example predictions on the Singapore portion of the nuScenes dataset [1]. The compared models had access to interaction data from Boston and action-free observation data from Singapore. We order sequences by the improvement in MSE of our method over SAVP [2], and select sequences with the largest, the smallest and the middle improvement. Our method produces images of higher quality than SAVP which is unable to leverage action-free data, maintaining the shape of the predicted objects for longer.

Action Prediction Results

Action predictions on human and robot data. The sequences of images show the ground truth observations, while the arrows show the action in the (x, y) plane between each pair of frames. The blue arrow is the ground truth action, the green arrow is the action generated from decoding the output of the action encoder, and the red is the action generated by decoding the output of the inverse model. The human data only has actions generated by the inverse model. Our model is able to infer plausible actions for both domains, despite never seeing ground truth human actions.

Manipulation Results

Examples of a robot using our model to successfully complete tool use tasks. The robot must move the objects specified by the red symbols to the locations of the corresponding green symbols. We use visual model predictive control with our model to plan a trajectory. The robot is able to use a tool to simultaneously move several objects to their goal locations as our model leverages observations of humans using tools.

Human Tool Use Dataset

We also present a dataset of humans pushing objects using tools, which is used in our experiments. Details and a download link are coming soon.

References

[1] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. Nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019

[2] A. X. Lee, R. Zhang, F. Ebert, P. Abbeel, C. Finn, and S. Levine. Stochastic adversarial video prediction. arXiv:1804.01523, abs/1804.01523, 2018