Sound2Sight : Generating Visual Dynamics from Sound and Context

Learning associations across modalities is critical for robust multimodal reasoning, especially when a modality is missing. Exploiting such associations can help with occlusion reasoning, "seeing through corners", aiding the hearing impaired, etc.

In this work, we consider the task of generating future video frames given the accompanying video and the visual past. Towards this end, we propose a deep neural network with the following three components:

Prediction Network
Multimodal Stochastic Network
Multimodal Discriminator Network

As illustrated in the figure above, the prediction network is responsible for generating the future frames, one frame at a time. In order to do so, it samples a stochastic vector from the Stochastic Network at every time step. The stochastic network learns the distribution from which this sampling takes place. Further, a multimodal discriminator judges the realism of the generated frames besides looking into the smoothness of object motion and synchrony with the audio channel.

Code Repository: Link

Datasets: We conducted experiments on three audiovisual datasets. The details maybe found in the supplementary. Below we provide downloadable links for all of them, with the corresponding STFT and MFCC features.

M3SO: Link

Audioset Drums: Link

YouTube Painting: Link

Publication:

M. Chatterjee, A. Cherian, “Sound2Sight: Generating Visual Dynamics from Sound and Context”, European Conference on Computer Vision 2020 (ECCV 2020).