In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.
Audio-Visual Event (AVE) dataset contains 4143 videos covering 28 event categories and videos in AVE are temporally labeled with audio-visual event boundaries.
Tasks: (a) illustrates audio-visual event localization. (b) illustrates cross-modality localization for V2A and A2V.
Networks: (a) Audio-visual event localization framework with audio-guided visual attention and multimodal fusion. One timestep is illustrated, and note that the fusion network and Fully Connected layers (FC) are shared for all timesteps. (b) Audio-visual distance learning network
Audio-guided visual attention: the attention network will adaptively learn which visual regions in each segment of a video to look for the corresponding sounding object or activity.
Dual Multimodal Residual Fusion: given audio and visual features from LSTMs, the fusion network will compute the updated audio and visual features. Here, the update strategy can both preserve useful information in the original modality and add complimentary information from the other modality.
Event localization prediction accuracy (%) on AVE dataset. A, V, V-att, A+V, A+V-att denote that these models use audio, visual, attended visual, audio-visual and attended audio-visual features, respectively. W-models are trained in a weakly supervised manner.
Event localization prediction accuracy (%) of diﬀerent feature fusion methods on AVE dataset. These methods all use same audio and visual features as inputs. Our DMRN model in the late fusion setting can achieve better performance than all compared methods.
Accuracy on cross-modality localization. A2V: visual localization from audio segment query; V2A: audio localization from visual segment query. Our AVDL outperforms DCCA over a large margin both on A2V (44.8 vs. 34.8) and V2A (35.6 vs. 34.1).
This work was supported by NSF BIGDATA 1741472. We gratefully acknowledge the gift donations of Markable, Inc., Tencent and the support of NVIDIA Corporation with the donation of the GPUs used for this research. This article solely reflects the opinions and conclusions of its authors and neither NSF, Markable, Tencent nor NVIDIA.