Resources
Corpora
You will find below the speech corpora I helped making freely available to the NLProc community.
This unannotated audio collection corresponds to 671 hours of radio broadcasts in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma.
19 hours of Tamasheq audio from radio broadcasts in Niger, aligned to French text translations. Made available in the context of IWSLT 2022.
5,130 sentences long data set from a true language documentation case. Mboshi (or Embosi) is a dialect spoken in Congo-Brazzavile. Multilingual extension available here.
330 sentences long data set from a true language documentation case. Griko is a dialect spoken in south Italy, being a mix of Greek and Italian.
A aligned speech extension from the CMU data set (monolingual alignments from the Bible). The results is a 8,130 sentences long speech parallel dataset.
The eight covered languages are: English, Spanish, French, Hungarian, Romanian, Basque, Russian and Finnish.
Code and Resources
mHuBERT-147 project:
mHuBERT-147 pre-trained models [LINK]
The fairseq fork for training [LINK]
Pre-processing and clustering scripts [LINK]
HUTTER: a mHuBERT-147 CommonVoice Prototype [LINK]
Multilingual DistilWhisper:
LeBenchmark and LeBenchmark 2.0 pre-trained models:
LeBenchmark French models (13 models, from 1K to 14K hours) [LINK]
LeBenchmark pre-processing and training scripts [LINK]
IWSLT 2023 - Multilingual Speech Translation:
Recipe to the winning submission [LINK]
IWSLT 2022 - French to Tamasheq Direct Speech Translation:
Presentations and Seminars
2024:
ICASSP 2024 "Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts" [POSTER]
2023:
UTTER DAYS 2023 "NAVER LABS Europe's Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track" [SLIDES]
2022:
SIGUL 2022 "Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings" [SLIDES]
LREC 2022 "Speech Resources in the Tamasheq Language" [POSTER]
IWLST 2022 "ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource Speech Translation Tasks" [POSTER]
Seminar at GIPSA-LAB "A brief (low-resource) speech processing adventure in the realm of self-supervised models." [SLIDES] (24/11/22)
2021:
Seminar "Attention-based Unsupervised Word Segmentation from Speech" [SLIDES] (distilled version of the PhD defense). Invited seminar at:
NAVER Labs Europe (07/04),
Sheffield University (24/06),
Avignon University (02/07),
TALEP, LIS - Marseille (30/08),
ISCA SIGML special group (29/11),
CoML, ENS Paris (14/12).
PhD defense "Models and Resources for Attention-based Unsupervised Word Segmentation" [SLIDES] (09/07/2021)
2019:
LIFT 2019 "How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages" [POSTER]
INTERSPEECH 2019 "Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings" [SLIDES]
2018:
LICIA Workshop [SLIDES]
INTERSPEECH 2018 "Unsupervised Word Discovery From Speech With Attention" [SLIDES]
SLTU 2018 Workshop "A small Griko-Italian speech parallel corpus" [SLIDES]
2017:
Panels
2023:
Invited panelist for Festival'IA 2023: "IA et multilinguisme : enjeux et état des lieux".
Invited panelist for RJCP 2023: "poursuites après le doctorat".
2022:
Invited panelist for SIGUL 2022: towards building technologies for low-resource languages while respecting the sovereignty of the communities.