Resources

Corpora

You will find below the speech corpora I helped making freely available to the NLProc community.

Niger-Mali Audio Collection [DATA][PAPER]

This unannotated audio collection corresponds to 671 hours of radio broadcasts in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma.

Tamasheq-French Parallel Corpus [DATA][PAPER]

19 hours of Tamasheq audio from radio broadcasts in Niger, aligned to French text translations. Made available in the context of IWSLT 2022.

Mboshi-French Parallel Speech Corpus [DATA] [PAPER]

5,130 sentences long data set from a true language documentation case. Mboshi (or Embosi) is a dialect spoken in Congo-Brazzavile. Multilingual extension available here.

Griko-Italian Parallel Speech Corpus [DATA] [PAPER]

330 sentences long data set from a true language documentation case. Griko is a dialect spoken in south Italy, being a mix of Greek and Italian.

MaSS Multilingual Speech-to-Speech Corpus [DATA][PAPER]

A aligned speech extension from the CMU data set (monolingual alignments from the Bible). The results is a 8,130 sentences long speech parallel dataset.

The eight covered languages are: English, Spanish, French, Hungarian, Romanian, Basque, Russian and Finnish.

Code and Resources

mHuBERT-147 project:

mHuBERT-147 pre-trained models [LINK]
The fairseq fork for training [LINK]
Pre-processing and clustering scripts [LINK]
HUTTER: a mHuBERT-147 CommonVoice Prototype [LINK]

Multilingual DistilWhisper:

Language Experts collection [LINK]
Code for training and inference [LINK]

LeBenchmark and LeBenchmark 2.0 pre-trained models:

LeBenchmark French models (13 models, from 1K to 14K hours) [LINK]
LeBenchmark pre-processing and training scripts [LINK]

IWSLT 2023 - Multilingual Speech Translation:

Recipe to the winning submission [LINK]

IWSLT 2022 - French to Tamasheq Direct Speech Translation:

SpeechBrain recipe for low-resource speech translation [LINK]
wav2vec 2.0 model trained on Tamasheq speech (243 hours) [LINK]
wav2vec 2.0 model trained on the Niger-Mali audio collection (658 hours) [LINK]

Presentations and Seminars

2024:

Seminar at CoML research group, ENS, Paris "mHuBERT-147: A Compact and Powerful Multilingual Speech Foundation Model" [SLIDES]
UTTER DAYS 2024 "mHuBERT-147: A Compact Multilingual HuBERT Model" (Subset of slidedeck above)

2023:

UTTER DAYS 2023 "NAVER LABS Europe's Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track" [SLIDES]

2022:

Seminar at GIPSA-LAB "A brief (low-resource) speech processing adventure in the realm of self-supervised models." [SLIDES] (24/11/22)

2021:

Seminar "Attention-based Unsupervised Word Segmentation from Speech" [SLIDES] (distilled version of the PhD defense). Invited seminar at:
- - NAVER Labs Europe (07/04),
  - Sheffield University (24/06),
  - Avignon University (02/07),
  - TALEP, LIS - Marseille (30/08),
  - ISCA SIGML special group (29/11),
  - CoML, ENS Paris (14/12).
PhD defense "Models and Resources for Attention-based Unsupervised Word Segmentation" [SLIDES] (09/07/2021)

2018:

LIG PhD Day [SLIDES] [POSTER]
LICIA Workshop [SLIDES]

2017:

Masters Defense "Unsupervised Word Discovery Using Attentional Encoder-Decoder Models" [SLIDES]

Panels

2023:

Invited panelist for Festival'IA 2023: "IA et multilinguisme : enjeux et état des lieux".
Invited panelist for RJCP 2023: "poursuites après le doctorat".

2022:

Invited panelist for SIGUL 2022: towards building technologies for low-resource languages while respecting the sovereignty of the communities.

Google Sites

Report abuse