Resources

Corpora

You will find below the speech corpora I helped making freely available to the NLProc community.  

Niger-Mali Audio Collection [DATA][PAPER]

This unannotated audio collection corresponds to 671 hours of radio broadcasts in five languages: French from Niger, Fulfulde, Hausa, Tamasheq and Zarma.

Tamasheq-French Parallel Corpus [DATA][PAPER]

19 hours of Tamasheq audio from radio broadcasts in Niger, aligned to French text translations. Made available in the context of IWSLT 2022.

Mboshi-French Parallel Speech Corpus [DATA] [PAPER]

5,130 sentences long data set from a true language documentation case.  Mboshi (or Embosi) is a dialect spoken in Congo-Brazzavile. Multilingual extension available here.

Griko-Italian Parallel Speech Corpus [DATA] [PAPER]

330 sentences long data set from a true language documentation case.  Griko is a dialect spoken in south Italy, being a mix of Greek and Italian. 

MaSS Multilingual Speech-to-Speech Corpus [DATA][PAPER]

A aligned speech extension from the CMU data set (monolingual alignments from the Bible).  The results is a 8,130 sentences long speech parallel dataset.

The eight covered languages are: English, Spanish, French, Hungarian, Romanian, Basque, Russian and Finnish. 

Code and Resources

mHuBERT-147 project:

Multilingual DistilWhisper:

LeBenchmark and LeBenchmark 2.0 pre-trained models:

IWSLT 2023 - Multilingual Speech Translation:

IWSLT 2022 - French to Tamasheq Direct Speech Translation:

Presentations and Seminars

2024:

2023:

2022:

2021:

2019:

2018: 

2017: 

Panels

2023:

2022: