MarginNCE: Robust Sound Localization with a Negative Margin
Abstract
The goal of this work is to localize sound sources in visual scenes with a self-supervised approach. Contrastive learning in the context of sound source localization leverages the natural correspondence between audio and visual signals where the audio-visual pairs from the same source are assumed as positive, while randomly selected pairs are negatives. However, this approach brings in noisy correspondences; for example, positive audio and visual pair signals that may be unrelated to each other, or negative pairs that may contain semantically similar samples to the positive one. Our key contribution in this work is to show that using a less strict decision boundary in contrastive learning can alleviate the effect of noisy correspondences in sound source localization. We propose a simple yet effective approach by slightly modifying the contrastive loss with a negative margin. Extensive experimental results show that our approach gives on-par or better performance than the state-of-the-art methods. Furthermore, we demonstrate that the introduction of a negative margin to existing methods results in a consistent improvement in performance.
Overview
Qualitative results
- Sound localization results on VGG-SS
Top left: Input image; Top right: Localization result (Ours);
Bottom left: Localization result (LVS); Bottom right: Localization result (EZVSL);
If the sound source does not play, click Refresh, and then play.
Man plays slide guitar
Train
Man plays a flute
Lawn mowers
Playing a guitar
Lawn mowers
Lawn mowers
Playing a piano
Typing on keyboard
Helicopter
Playing erhu
Wind chime
- Sound localization results on SoundNet-Flickr
Top left: Input image; Top right: Localization result (Ours);
Bottom left: Localization result (LVS); Bottom right: Localization result (EZVSL);
If the sound source does not play, click Refresh, and then play.
Crowd
Engine sound
Crowd
Train
Crowd
Speech
Speech
Engine sound
Speech
Engine sound
Playing flutes
Crowd
- Failure cases
Left: Input image; Right: Localization result (Ours);
If the sound source does not play, click Refresh, and then play.
Since most motorboat images contain water, it is difficult to detect only motorboats.
For certain birds, there are many images with cages, so cage bars are often detected at the same time.
If the object is small, quiet, or there are many ambient sounds that do not appear in the image, the sound object cannot be properly detected.
Motorboat
Motorboat
Canary calling
Small object / Quiet sound
Small object / Many ambient sounds
Canary calling
Publication
"MarginNCE: Robust Sound Localization with a Negative Margin" [pdf]
Sooyoung Park*, Arda Senocak*, and Joon Son Chung
(*: equal contribution)
Bibtex
@inproceedings{park2022marginnce,
title={MarginNCE: Robust Sound Localization with a Negative Margin},
author={Sooyoung Park and Arda Senocak and Joon Son Chung},
journal = {arXiv preprint arXiv:2211.01966},
year={2022},
}