Learning Audio-Visual Dynamics Using Scene Graphs
Intelligent systems need to draw meaningful deductions about objects in a scene by associating their visual appearance and motion with their audio signatures.
In this work, we consider the task of visually guided audio source separation and use this separated audio (derived from the visual conditioning information) to coarsely estimate the direction of motion of the sound source. Towards this end, we propose a deep neural network, Audio Separator and Motion Predictor (ASMP), with the following three architectural components:
* Visual Conditioining Module
* Audio-Separator Network
* Direction Prediction Network
As illustrated in the figure above, ASMP begins by leveraging its Visual Conditioning module to create graph embeddings of potential auditory sources and their context nodes. Towards this end, this module employs Graph Ateention Networks and Edge Convolution. Importantly the graph construction encodes scene geometry information. Next, the graph embeddings are used to condition a U-Net style network responsible for undertaking the audio source separation, called the Audio Separator Network. Finally this conditionally separated output is passed through a direction prediction network, to estimate the direction of motion (one of 28 classes).
Publication:
M. Chatterjee, N. Ahuja, A. Cherian, “Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation”, Advances in Neural Information Processing Systems 2022 (NeurIPS 2022).