Training with Missing Modalities
One prominent characteristic of multisensory is that data may not be always jointly observable, especially during training (missing modalities). For example, we see many new objects without any haptic interaction, but still we can guess how it would feel like when we grasp them. This is related to how we can aggregate multiple context experience to infer our scene representation z. One simple choice of such aggregation is summation (See "baseline" from below). This means you might have some hidden vectors from each experience and sum them up! An important benefit of this summation is that it is an order-invariant operation, meaning that the resulting sums are supposed to be the same regardless of the order of experiences. As long as, an encoder to read this representation is powerful enough, our model may able to infer 3D structure properly.
However, this simple solution may have potential drawback for missing modality training scenario. If my model didn't see certain combinations of different modalities during training, encoder may not able to handle that very new combination for test. Regarding this, Product-of-Experts (PoE) has been shown to provide a good solution in such scenarios (Wu & Goodman, 2018). For example, haptic encoder represent probability of an object with some amount of uncertainty. Additional visual information who can also have different types of uncertainty. But we need to represent the combined distribution as a product of them, always. This let each encoder to learn the uncertainty independently as well as train them from their product. During training even though there is no input from one sensory, the missing sensory would have its own uncertainty on 3d worlds. The rest of sensors would work independently as well.
In this study, we further observe that standard Product-of-Experts implementations require large memory and computations, especially for relatively large-scale models. In order to deal with the computational complexity, we introduce to use amortization, i.e. Amortized Product-of-Experts (APoE). Learn a single model that does everything!