ANALYSIS OF SELF-SUPERVISED SPEECH MODELS ON CHILDREN’S SPEECH AND INFANT VOCALIZATIONS
Jialu Li1,2, Mark Hasegawa-Johnson1,2, Nancy L. McElwain2,3
1Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign
2Beckman Institute for Advanced Science and Technology, University of Illinois Urbana-Champaign
3Department of Human Development and Family Studies, University of Illinois Urbana-Champaign
This site contains supplementary materials for the original paper including:
complete sets of canonical correlation analysis (CCA) scores for Wav2Vec 2.0 (W2V2) and HuBERT models
sample T-SNE plots for individual phonemes across layers for fine-tuned W2V2 models
repository for code and model weights
CCA scores
CCA scores for three pretrained self-supervised learning models
W2V2-LL4300h
W2V2-Base
HuBERT-Base
CCA scores for four fine-tuned W2V2 models
W2V2-Libri100h
W2V2-MyST
W2V2-Libri100h-Pro
W2V2-MyST-Pro
CCA scores for four fine-tuned HuBERT models
HuBERT-Libri100h
HuBERT-MyST
HuBERT-Libri100h-Pro
HuBERT-MyST-Pro
T-SNE Plots
Methods:
We randomly selected 200 samples for each phone per dataset to draw T-SNE plots. We show sample T-SNE plots below of layers 1, 4, 8, and 12 for W2V2-Libri100h, W2V2-MyST, and W2V2-MyST-Pro for sample vowels (/i/ and /ɑ/) and consonant (/p/ and /d/). W2V2-Libri100h-Pro follows a similar pattern to W2V2-MyST-Pro.
Observations:
At lower layers, we see clear phone clusters among adults, older children, and younger children.
As the layers progress to the top layer in W2V2-Libri100h and W2V2-MyST, phonemes clusters of younger children are mostly separated from those overlapped phonemes clusters of older children and adults.
In W2V2-MyST-Pro, phonemes clusters of younger children mostly overlap with those of older children and remain distinct from adults.
plots of /i/
plots of /ɑ/
plots of /p/
plots of /d/
Code and model weights
You may find code and model weights in the following links:
Children's Phoneme Recognition: https://huggingface.co/lijialudew/wav2vec_children_ASR
Infant Vocalization Classification: https://huggingface.co/lijialudew/wav2vec_LittleBeats_LENA
Paper/BibTex Citation
If you found our paper helpful, please cite us as
@inproceedings{li2024analysis,
title={Analysis of Self-Supervised Speech Models on Children's Speech and Infant Vocalizations},
author={Li, Jialu and Hasegawa-Johnson, Mark and McElwain, Nancy L},
booktitle={IEEE Workshop on Self-Supervision in Audio, Speech and Beyond (SASB)},
year={2024}
}
Contact
Jialu Li (she, her, hers)
Ph.D candidate @ Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign
E-mail: jialuli3@illinois.edu
Homepage: https://sites.google.com/view/jialuli/