Learning Audio-Visual Dynamics Using Scene Graphs

Intelligent systems need to draw meaningful deductions about objects in a scene by associating their visual appearance and motion with their audio signatures.

In this work, we consider the task of visually guided audio source separation and use this separated audio (derived from the visual conditioning information) to coarsely estimate the direction of motion of the sound source. Towards this end, we propose a deep neural network, Audio Separator and Motion Predictor (ASMP), with the following three architectural components:

* Visual Conditioining Module

* Audio-Separator Network

* Direction Prediction Network

As illustrated in the figure above, ASMP begins by leveraging its Visual Conditioning module to create graph embeddings of potential auditory sources and their context nodes. Towards this end, this module employs Graph Ateention Networks and Edge Convolution. Importantly the graph construction encodes scene geometry information. Next, the graph embeddings are used to condition a U-Net style network responsible for undertaking the audio source separation, called the Audio Separator Network. Finally this conditionally separated output is passed through a direction prediction network, to estimate the direction of motion (one of 28 classes).

Code Repository: Link

ASIW Dataset: Link

ASIW Displacement Information: Link

Paper Link: Link

Poster Link: Link

Publication: