Human brains represent a concept by grounding it on multiple sensory stimulus. So, we can imagine a person's face from his voice and also can search a scissor in a drawer only by exploring the inside with hands. For embodied agents to infer representations of the underlying 3D physical world they inhabit, they should efficiently combine multisensory cues from numerous trials, e.g., by looking at and touching objects. Despite its importance, multisensory 3D scene representation learning has received less attention compared to the unimodal setting. In this paper, we propose the Generative Multisensory Network (GMN) for learning latent representations of 3D scenes which are partially observable through multiple sensory modalities. We also introduce a novel method, called the Amortized Product-of-Experts, to improve the computational efficiency and the robustness to unseen combinations of modalities at test time. Experimental results demonstrate that the proposed model can efficiently infer robust modality-invariant 3D-scene representations from arbitrary combinations of modalities and perform accurate cross-modal generation. To perform this exploration, we also develop the Multisensory Embodied 3D-Scene Environment (MESE).
Combining Multisensory Cues to infer 3D structure
Grasping/manipulating objects with robot hand is one of the most interesting robotics tasks; for example, take a look at OpenAI et al., 2018. It is well known that efficient design of task-relevant representation is important in this task. As it is not easy to learn such representation from data, many works focus on controls, while using feature extractors trained via supervised learning (Pinto & Gupta 2016; OpenAI et al., 2018). However, what makes difficult to learn task-relevant information here?
This example task requires some challenging properties for their task-relevant representations. (1) Representations are supposed to abstract 3D information. (2) The environment or data acquisition process may have intrinsic stochasticity. (3) Agents commonly need to infer such 3d information from embedded camera or hands. This means that robots' sensors may not contain 3D information by themselves. In common conditions, those sensors may observe only small parts of the entire situation (partially observable). (4) Moreover, any representation of such environment needs to be sensory-agnostic; thus, it should contains multisensory information. For example, haptic and vision.
Can we learn representations to satisfy the desiderata? Can we infer 3D structure from haptics once we have such representations? What additional difficulties are associated in multisensory settings? The current work attempts to tackle these questions and proposes a method to learn sensory agnostic 3D representations using partially observable multisensory inputs.
Can we infer the shape of an object without seeing it?
Imagine if you have one hand and one eye. When someone helps you to touch a cup with your hand before seeing it, can you visualize how it might look like? Can you tell how you might perceive using your hand only by seeing it? This simple scenario encapsulates the aforementioned requirements.
Regarding these, we build a simulation environment, called Multisensory Embodied 3D-Scene Environment (MESE). Instead of a cup, we adopt Shepard-Metzler mental rotation objects from a recent work of DeepMind (Eslami et al., 2018). These objects have non-trivial 3D shapes, consisting of multiple cubes. Especially when each cube of single object randomly colored, it is not easy to infer the shape or colors of the object using partially observable image or haptics. For simulating hand, we employ the MPL hand model from the Mujoco HAPTIX library (Kumar & Todorov, 2015). In this environment, we (1) randomly generate single Shepard-Metzler object and (2) simulate the visual or haptic interactions and their resulting data. For more details about the environment's design, please find our paper.
An example simulation scenario is illustrated below. Imagine you have a single randomly sampled object. If a camera looks at this object from one viewpoint, the camera will see 2D image. If a hand grasps the object from one position using a predefined policy, the hand will obtain haptics information. MESE is designed to simulate this process and we generate approximately 1M of such objects and corresponding interactions in order to learn modality-invariant representations.
Multisensory Embodied 3D-Scene Environment (MESE)
Example single object multisensory scene in MESE
Generative Multisensory Network (GMN)
Our goal is now to learn a modality-invariant representation of the 3D object through visual and haptic interactions. Remember the cup example. You can experience a cup on a table only by touching or grabbing it from some hand poses. Someone may ask if you can visually imagine the appearance of the cup. We can define a generative model for such scenario as follows. Imagine you have some previous interaction experience (e.g. context in the below figure). If we assume to have a representation (scene representation) that abstracts all previous experience, we may able to predict what this object would look like from behind (observation senses) using such representation. It is also possible that our guess is not correct as previous experience is not sufficient. Using simulation data generated using MESE, we can train this conditional generative model to maximize likelihoods via variational method. Note that this way of formulating 3D scene representation is originally proposed with GQN model by DeepMind (Eslami et al., 2018). For more details about the models and their training, please take a look at our paper!
Generation process of GMN
Infer Latent 3D Representation from Multimodal Sensory Inputs
How our models would behave? Can they combine well haptic cues as well as visuals to infer 3D structure? We can get a glimpse of their properties by observing how the trained models predict the objects.
For instance, let's take a look at the left figure from the below. We provide one visual experience and multiple haptics to a model. Here the visual experience is not sufficient enough to guess the shape of the object. After that, the model is asked to predict 2D images from predefined positions. For specifically, our scene representation ("z") is a random variable, we can sample it multiple times. Four z values are sampled. Even with different representation values, all representation corresponds to the same shapes! It looks good. However, we sample this z by conditioning only on a single image with a particular angle. This results in different colors when we sampled different z. On the other hand, we can also observe that the prediction is relatively consistent where color is seen, compared to other parts.
Cross-modal inference using scene representation (1)
(a) visual and (b) haptic context. (c) generated image observation given image queries. (d) ground truth image observations.
Cross-modal inference using scene representation (2)
(a) x-axis: indices of haptic-query pairs in context. (b) ground truth image observations for given queries.
We can also show how the prediction improves along with the number of contexts. See the right figure above. Assume we have a similar setting. This time we will sample images with varying context conditions. Here, x-axis indicates indices of haptic interaction shown in the upper row. We only sample one z per each column. The first column is conditioned on this one visual query-sense pair and none of other context. As we don’t have much of evidence, it generates a random shape. The second column is conditioned on one more haptic experience, as well as the visual one from the previous. The next column is conditioned on more haptic evidences. We can see that the sample starts to fill up some parts where the hand touches!
Training with Missing Modalities
One prominent characteristic of multisensory is that data may not be always jointly observable, especially during training (missing modalities). For example, we see many new objects without any haptic interaction, but still we can guess how it would feel like when we grasp them. This is related to how we can aggregate multiple context experience to infer our scene representation z. One simple choice of such aggregation is summation (See "baseline" from below). This means you might have some hidden vectors from each experience and sum them up! An important benefit of this summation is that it is an order-invariant operation, meaning that the resulting sums are supposed to be the same regardless of the order of experiences. As long as, an encoder to read this representation is powerful enough, our model may able to infer 3D structure properly.
However, this simple solution may have potential drawback for missing modality training scenario. If my model didn't see certain combinations of different modalities during training, encoder may not able to handle that very new combination for test. Regarding this, Product-of-Experts (PoE) has been shown to provide a good solution in such scenarios (Wu & Goodman, 2018). For example, haptic encoder represent probability of an object with some amount of uncertainty. Additional visual information who can also have different types of uncertainty. But we need to represent the combined distribution as a product of them, always. This let each encoder to learn the uncertainty independently as well as train them from their product. During training even though there is no input from one sensory, the missing sensory would have its own uncertainty on 3d worlds. The rest of sensors would work independently as well.
In this study, we further observe that standard Product-of-Experts implementations require large memory and computations, especially for relatively large-scale models. In order to deal with the computational complexity, we introduce to use amortization, i.e. Amortized Product-of-Experts (APoE). Learn a single model that does everything!
Baseline model, Product-of-Experts (PoE), and Amortized PoE
In our paper, we demonstrate that the model with APoE learns better modality-agnostic representations, as well as modality-specific ones. We also show various experiments to compare the characteristics of the baseline model and APoE (and PoE). They are very interesting to look into :) Please find our paper for the details.
In this study, we propose the Generative Multisensory Network (GMN) for understanding 3D scenes via modality-invariant representation learning. In GMN, we introduce the Amortized Product-of-Experts (APoE) in order to deal with the problem of missing-modalities while resolving the space complexity problem of standard Product-of-Experts. In experiments on 3D scenes with blocks of different shapes and a human-like hand, we show that GMN can generate any modality from any context configurations. We also show that the model with APoE learns better modality-agnostic representations, as well as modality-specific ones. To the best of our knowledge this is the first exploration of multisensory representation learning with vision and haptics for generating 3D objects. Furthermore, we have developed a novel multisensory simulation environment, called the Multisensory Embodied 3D-Scene Environment (MESE), that is critical to performing these experiments.
On the other hand, there are many unanswered questions. For instance, it is important to know how this model would perform in more complex environments. It is also interesting to ask if learned representation is actually beneficial in downstream tasks, such as robot grasping task. Someone might be interested in a setting where we jointly learn the proposed model while robotics arms do their works.