You will find below the speech corpora I helped making freely available to the NLProc community.
This unannotated audio collection corresponds to 671 hours of radio broadcasts in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma.
19 hours of Tamasheq audio from radio broadcasts in Niger, aligned to French text translations. Made available in the context of IWSLT 2022.
5,130 sentences long data set from a true language documentation case. Mboshi (or Embosi) is a dialect spoken in Congo-Brazzavile. Multilingual extension available here.
330 sentences long data set from a true language documentation case. Griko is a dialect spoken in south Italy, being a mix of Greek and Italian.
A aligned speech extension from the CMU data set (monolingual alignments from the Bible). The results is a 8,130 sentences long speech parallel dataset.
The eight covered languages are: English, Spanish, French, Hungarian, Romanian, Basque, Russian and Finnish.
mHuBERT-147 pre-trained models [LINK]
The fairseq fork for training [LINK]
Pre-processing and clustering scripts [LINK]
HUTTER: a mHuBERT-147 CommonVoice Prototype [LINK]
LeBenchmark French models (13 models, from 1K to 14K hours) [LINK]
LeBenchmark pre-processing and training scripts [LINK]
Recipe to the winning submission [LINK]
Seminar at CoML research group, ENS, Paris "mHuBERT-147: A Compact and Powerful Multilingual Speech Foundation Model" [SLIDES]
UTTER DAYS 2024 "mHuBERT-147: A Compact Multilingual HuBERT Model" (Subset of slidedeck above)
UTTER DAYS 2023 "NAVER LABS Europe's Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track" [SLIDES]
Seminar at GIPSA-LAB "A brief (low-resource) speech processing adventure in the realm of self-supervised models." [SLIDES] (24/11/22)
Seminar "Attention-based Unsupervised Word Segmentation from Speech" [SLIDES] (distilled version of the PhD defense). Invited seminar at:
NAVER Labs Europe (07/04),
Sheffield University (24/06),
Avignon University (02/07),
TALEP, LIS - Marseille (30/08),
ISCA SIGML special group (29/11),
CoML, ENS Paris (14/12).
PhD defense "Models and Resources for Attention-based Unsupervised Word Segmentation" [SLIDES] (09/07/2021)
Masters Defense "Unsupervised Word Discovery Using Attentional Encoder-Decoder Models" [SLIDES]
Invited panelist for Festival'IA 2023: "IA et multilinguisme : enjeux et état des lieux".
Invited panelist for RJCP 2023: "poursuites après le doctorat".
Invited panelist for SIGUL 2022: towards building technologies for low-resource languages while respecting the sovereignty of the communities.