Singer Identity Representation Learning Using Self-Supervised Techniques

Welcome to the accompaniment website of Singer Representation Learning Using Self-Supervised Techniques.
The website provides additional content not present in the paper. Code is available at github.com/SonyCSLParis/ssl-singer-identity.

We present:

Pipeline diagrams
Average similarity analysis between singers
Visualization of the singer embeddings in a lower-dimensional space (3D PCA and 2D T-SNE)

Please select a test dataset below to see the visualizations:

M4Singer

VocalSet

Mixed Set

In Domain Test Set

NUS-48E

VCTK

Below is a summary of the training pipeline. The two network branches operate on a Siemese (shared weights) configuration, except for BYOL (see next)

Each technique has a different way of regularizing the projection embeddings' space during training:

For BYOL [1] training, things are a little bit different as we add the predictor layer and update the second branch with a slowed-down version of the weights of the first branch

Reference for the SSL techniques used to trained the different models:

BYOL: Trained on BYOL configuration [1]
CONT: Trained in a SimCLR[2] scenario optimizing the decoupled contrastive loss [3]
VICReg: Trained on variance, invariance and covariance losses [4]
CONT-VC: Added variance and covariance regularization to CONT
UNIF: Trained on Uniformity-Alignment loss [5]

Selected baselines:

H/ASP: Pre-trained supervised speaker verification model which achieves 1.01% EER on VoxCeleb [6]
Wav2Vec 2.0 pre-trained models: base [8] XLRS-53 [9]

[1] J.-B. Grill et al., “Bootstrap your own latent - A new approach to self-supervised learning,” in NeurIPS, 2020.
[2] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020.
[3] C.-H. Yeh et al, “Decoupled contrastive learning,” in Computer Vision–ECCV , 2022.
[4] A. Bardes, J. Ponce, and Y. LeCun, “VICReg: Variance-invariance-covariance regularization for self-supervised learning,” in ICLR, 2022.
[5] T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in ICML, 2020
[6] Y. Kwon et al., “The ins and outs of speaker recognition: lessons from VoxSRC 2020,” in ICASSP 2021
[7] Saeed et al. “Contrastive Learning of General-Purpose Audio Representations.” In ICASSP, 2021.
[8] Baevski et al. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in NeurIPS, 2020
[9] Conneau et al. “Unsupervised Cross-Lingual Representation Learning for Speech Recognition”, 2020

Google Sites

Report abuse