Audio Visual Scene-Graph Segmentor
Associating the visual appearance of real world objects with their auditory signatures is critical for holistic AI systems and finds practical usage in several tasks, such as audio denoising, musical instrument equalization, etc.
In this work, we consider the task of visually guided audio source separation. Towards this end, we propose a deep neural network, Audio Visual Scene Graph Segmenter (AVSGS), with the following two components:
Visual Conditioining Module
Audio-Separator Network
As illustrated in the figure above, AVSGS begins by leveraging its Visual Conditioning module to create dynamic graph embeddings of potential auditory sources and their context nodes. Towards this end, this module employs Graph Ateention Networks and Edge Convolution. Next, the graph embeddings are used to condition a U-Net style network responsible for undertaking the audio source separation, called the Audio Separator Network.
Publication:
M. Chatterjee, J Le Roux, N. Ahuja, A. Cherian, “Visual Scene Graphs for Audio Source Separation”, International Conference on Computer Vision 2021 (ICCV 2021).