Audio Visual Scene-Graph Segmentor


Associating the visual appearance of real world objects with their auditory signatures is critical for holistic AI systems and finds practical usage in several tasks, such as audio denoising, musical instrument equalization, etc.

In this work, we consider the task of visually guided audio source separation. Towards this end, we propose a deep neural network, Audio Visual Scene Graph Segmenter (AVSGS), with the following two components:

  • Visual Conditioining Module

  • Audio-Separator Network

As illustrated in the figure above, AVSGS begins by leveraging its Visual Conditioning module to create dynamic graph embeddings of potential auditory sources and their context nodes. Towards this end, this module employs Graph Ateention Networks and Edge Convolution. Next, the graph embeddings are used to condition a U-Net style network responsible for undertaking the audio source separation, called the Audio Separator Network.


Code Repository: Link

ASIW Dataset: Link

Publication: