Research
Publications
2024:
mHuBERT-147: A Compact Multilingual HuBERT Model. [PDF]
Multilingual DistilWhisper: Efficient Distillation of Multi-task Speech Models via Language-Specific Experts. [PDF]
LeBenchmark 2.0: a Standardized, Replicable and Enhanced Framework for Self-supervised Representations of French Speech. [PDF]
2023:
NAVER LABS Europe’s Multilingual Speech Translation Systems for the IWSLT 2023 Low-Resource Track. [PDF]
2022:
A Study of Gender Impact in Self-supervised Models for Speech-to-Text Systems. [PDF]
Unsupervised Word Segmentation from Discrete Speech Units in Low-Resource Settings. [PDF]
Speech Resources in the Tamasheq Language. [PDF]
FINDINGS OF THE IWSLT 2022 EVALUATION CAMPAIGN. [PDF]
ON-TRAC Consortium Systems for the IWSLT 2022 Dialect and Low-resource Speech Translation Tasks. [PDF]
Promises and Limitations of Self-supervised Learning for Automatic Speech Processing. [PDF]
LeBenchmark, un référentiel d’évaluation pour le français oral. [PDF] (French only)
Modèles neuronaux pré-appris par auto-supervision sur des enregistrements de parole en français. [PDF] (French only)
2021:
Task Agnostic and Task Specific Self-Supervised Learning from Speech with LeBenchmark. [PDF]
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech. [PDF]
Investigating Alignment Interpretability for low-resource NMT. [PDF]
2020:
Investigating Language Impact in Bilingual Approaches for Computational Language Documentation. [PDF]
MaSS: A large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible. [PDF]
2019:
ON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared Task. [PDF]
How Does Language Influence Documentation Workflow? Unsupervised Word Discovery Using Translations in Multiple Languages. [PDF]
Empirical Evaluation of Sequence-to-Sequence Models for Word Discovery in Low-resource Settings. [PDF]
2018:
Unsupervised Word Segmentation From Speech With Attention. [PDF]
A small Griko-Italian speech translation corpus. [PDF]
A very low resource language speech corpus for computational language documentation experiments. [PDF]
2017:
Unwritten Languages Demand Attention Too! Word Discovery with Encoder-Decoder Models. [PDF]
Unsupervised Word Discovery Using Encoder-Decoder Models. [PDF]
2014:
Size does not matter. Frequency does. A study of features for measuring lexical complexity. [PDF]
Uma análise do perfil de entropia das estruturas sintáticas do português. [PDF]
PhD Thesis (2021):
Models and Resources for Attention-based Unsupervised Word Segmentation. [PDF]
Master Thesis (2017):
Unsupervised Word Discovery Using Attentional Encoder-Decoder Models. [PDF]
Work in Conferences
Scientific Committee: LREC 2020, ACL 2020, SLTU-CCURL 2020, EMNLP 2020, EACL 2021, EMNLP 2021, ACL 2022, LREC 2022, SIGUL 2022, NAACL 2022, EACL 2022, GITT 2023, ILLC-NLP 2024, INTERSPEECH 2024
Website Chair - CoNLL 2019
Social Media and Communications Chair - PROPOR 2018
Local Organization Comittee - LTT 2018, TALN2022, RÉCITAL 2022
External Reviewer - SBAC-PAD 2018
Internal Communications Chair - ACL2022
Task organizer: IWSLT 2022 (low-resource track).
Co-chair RÉCITAL 2022, SASB 2023, SASB 2024