CREAM Project

Documentation des langues CREoles Assistée par la Machine

Welcome

CREAM is a research project funded by the ANR (Agence Nationale de la Recherche, CS-38).

It aims at developping new techniques for language documentation, especially for Creole Languages.

Team and Fundings

CREAM is a project selected by the Agence Nationale de la Recherche (CE38 - Révolution numérique : rapports au savoir et à la culture, appel à projet générique 2020). It is coordinated by Emmanuel Schang (Univ. Orléans).

It gathers researchers from two French teams:

  • Laboratoire Ligérien de Linguistique (UMR 7270, CNRS & Université d'Orléans)

  • LIG-GETALP (Université Grenoble-Alpes)

And associated members from:

  • University of West Indies (Mona), Jamaica

  • University of Coimbra, Portugal

  • ZAS Berlin, Germany

  • University of Ziguinchor, Senegal

  • University of Buffalo, USA

  • FLA, Université d'Etat d'Haïti


Motivation and Challenges

Most language documentation projects rely on the successive phases of recording, transcription and translation. But transcribing is a a time-consuming task. Moreover, when the language has a poorly adopted orthography (or no orthography at all), the result is not really reliable for developping tools.

These difficulties correspond to the well-known "transcription bottleneck" problem. We propose instead to:

(a) give priority to the translation of the recordings rather than their transcription

(b) focus on a ‘sparse transcription’ where only a subset of the data collected is transcribed while spoken term detection (or query-by-example) methods allow to retrieve repeated terms in a speech collection.

Aim

Therefore, the aim of this project is to pave the way for disrupting methods in language documentation and resource building while focusing on Creole languages. By using cutting-edge machine-learning technologies, we seek to change the way language documentation is implemented in terms of building language resources and processing spoken corpora. We will try to bypass the transcription problem in the CREAM project by focusing more on spoken term detection and query-by-example technologies.

In concrete terms, we propose to (1) enrich the existing recordings in three different Creoles with spoken translation in the lexifier (dominant) language and (2) develop specific tools for alignment, annotation and keyword spotting on those speech collections.

Credits: The pictures on these pages are property of E. Schang. The logo has been made by the students of the Master in Linguistics (Univ. Orléans). The logo of the ANR has been provided by the ANR.

Contacts: emmanuel dot schang arobase univ-orleans dot fr