Learning Intuitive Physics with Multimodal Generative Models

Sahand Rezaei-Shoshtari,^1,2 Francois R. Hogan,¹ Michael Jenkin,^1,3David Meger,^1,2 Gregory Dudek^1,2

¹Samsung AI Center Montreal, ²McGill University, ³York University

In Thirty-Fifth AAAI Conference on Artificial Intelligence. AAAI 2021.

TL;DR: Predicting motions of objects using vision and touch

Paper

Code

Abstract

Predicting the future interaction of objects when they come into contact with their environment is key for autonomous agents to take intelligent and anticipatory actions. This paper presents a perception framework that fuses visual and tactile feedback to make predictions about the expected motion of objects in dynamic scenes. Visual information captures object properties such as 3D shape and location, while tactile information provides critical cues about interaction forces and resulting object motion when it makes contact with the environment. Utilizing a novel See-Through-your-Skin (STS) sensor that provides high resolution multimodal sensing of contact surfaces, our system captures both the visual appearance and the tactile properties of objects. We interpret the dual stream signals from the sensor using a Multimodal Variational Autoencoder (MVAE), allowing us to capture both modalities of contacting objects and to develop a mapping from visual to tactile interaction and vice-versa. Additionally, the perceptual system can be used to infer the outcome of future physical interactions, which we validate through simulated and real-world experiments in which the resting state of an object is predicted from given initial conditions.

Motivation

To extend the synergies between the senses of vision and touch for dynamic prediction. Touch enables physical reasoning and direct measurement of 3D surface, while vision provides a holistic view of the projected appearance.
To predict the most informative and stable elements of a motion trajectory, as it is often challenging and unnecessary to predict the full tajectory of objects in motion.

Approach Summary

We present a multimodal generative perceptual system that integrates visual, tactile and 3D pose using the Multimodal Variational Autoencoder (MVAE) (Wu and Goodman 2018) for predicting the outcome of dynamic interactions.
MVAE uses the notion of Product of Experts (PoE) to learn the approximate joint posterior of different modalities as the product of individual posteriors of each modality. MVAE scales up efficiently and can inherently tackle missing modalities.

We develop a novel visuotactile sensor, See-Through-Your-Skin (STS), that renders dual stream high resoultion images of contact geometery and the external world using a semi-transparent surface and regulated internal lighting conditions.

Experiments

Simulation expriments using our open-source visuotactile simulator based on PyBullet.
Real-world experiments using See-Through-Your-Skin (STS) sensor protoype.

Results

Models that exploit multimodal sensing outperform those relying on visual/tactile modalities alone. Importantly, we find that tactile information improves the prediction accuracy by reasoning about interaction forces and the geometry of contact.
In dynamic scenarios where the intermediate states are not of interest, we can learn to predict the final outcome with higher accuracy without explicitly reasoning about intermediate steps.

Prediction error (BCE) for the simulation experiments.

Prediction error (BCE) for the real-world experiment.