Interspeech 2021

Special Session


SHARED TASK ON AUTOMATIC SPEECH RECOGNITION

FOR NON-NATIVE CHILDREN’S SPEECH


The availability of large amounts of training data and large computational resources have made Automatic Speech Recognition (ASR) technology usable in many application domains, and recent research has demonstrated that ASR systems can achieve performance levels that match human transcribers for some tasks. However, ASR systems still struggle and have deficiencies when applied to speech produced by specific types of speakers, in particular, non-native speakers and children.

Several phenomena that regularly occur in non-native speech can greatly reduce ASR performance, including mispronounced words, ungrammatical utterances, disfluencies (including false starts, partial words, and filled pauses), and code-switched words. ASR for children's speech can be challenging due to linguistic differences from adult speech at many levels (acoustic, prosodic, lexical, morphosyntactic, and pragmatic) caused by physiological differences (e.g., shorter vocal tract lengths), cognitive differences (e.g., different stages of language acquisition), and behavioral differences (e.g., whispered speech). Developing ASR systems for both of these domains is made more challenging due to the lack of publicly available databases for both non-native speech and children's speech.

Despite these difficulties, a significant portion of the speech transcribed by ASR systems in practical applications may come from both non-native speakers, (e.g., newscasts, movies, internet videos, human-machine interactions, human-human conversations in telephone call centers, etc.) and children (e.g., educational applications, smart speakers, speech-enabled gaming devices, etc.). Therefore, it is necessary to continue to improve ASR systems to be able to accurately process speech from these populations. An additional important application area is the automatic assessment of second language speaking proficiency, where the ASR difficulties can be increased by the low proficiency levels of the speakers, especially if they are children. The lack of training data is especially pronounced for this population (non-native children's speech).

With this special session, which follows the one organized at Interspeech 2020, we intend to advance the research addressing non native children's ASR technology.

To reach this goal we will distribute a new set of data, in addition to that used for the 2020 challenge, that will contain additional training data for the English language (acquired from speakers of different native languages) as well as data for developing a German ASR for non native children. The spoken responses in the data set were produced in the context of both English and German speaking proficiency examinations.

The following data will be released for this shared task:

• ~100 hours of English transcribed speech, to be used as a training set

• ~6 hours English transcribed speech (3 hours to be used as a development set and 3 hours as a test set)

• ~5 hours of German transcribed speech, to be used as a training set

• ~60 hours of German non transcribed speech, to be used as a training set

• ~2.5 hours of German transcribed speech (1 hour to be used as a development set and 1 hour and half to be used as a test set)

For both languages, English and German, a baseline ASR system together with evaluation scripts will be provided. More information can be found in the document at the following link .

Important Dates

• Release of training data, development data, and baseline systems: February 10, 2021

• Test data released: March 10, 2021

• Submission of results on test set: March 20, 2021

• Test results announced: March 23, 2021

Submission

The shared task will consist of two tracks for each one of the two languages, English and German: a closed track and an open track. In the closed track, only the training data distributed as part of the shared task can be used to train the models (note that English data cannot be used to train the German model and vice versa); in the open track, any additional data can be used to train the models.

How to participate

The resources provided for the challenge are released through the following five compressed packages:

  • ETLT2021_ETS_EN.tgz: transcribed audio data in English provided by ETS;

  • ETLT2021_FBK_EN.tgz: transcribed audio data in English, additional text data and lexica provided by FBK;

  • ETLT2021_CAMBRIDGE_EN_baseline.tgz: Kaldi English baseline, provided by Cambridge University;

  • ETLT2021_FBK_DE.tgz: transcribed audio data in German, additional text data and lexica provided by FBK.

  • ETLT2021_FBK_DE_baseline.tgz: Kaldi German baseline, provided by FBK


  • Follow this link to download the user license for getting ETLT2021_ETS_EN.tgz and send the signed license to "amisra001@ets.org".

  • Follow this link to download the user license for getting both ETLT2021_FBK_EN.tgz and ETLT2021_FBK_DE.tgz and send the signed license to "falavi@fbk.eu".

After signing the licenses you will receive the links to download the data packages, including baselines.


EVALUATION DATA ARE NOW AVAILABLE!


Submit results

Use the CodaLab submission portal, to register and submit your results. Use the forum in the CodaLab portal to make questions to the organizers . Submissions will open on March 10 and close at midnight on March 17 (midnight anywhere in the world, i.e., 12pm UTC on March 18). Submissions should include the ASR output produced by the system and a brief description of the system (see the documentation on the CodaLab submission portal for further instructions about formatting and the submission procedure).

Participating teams can submit up to one submission per day for a maximum total of 7 submissions per team. Results will be displayed on a leaderboard throughout the week that the submission site is open.


RESULTS.

ENGLISH CLOSED TRACK: total submission=18, total participants=9

===========================

RANK %WER TEAM

1 25.69 Netease Youdao

2 29.27 NTNU SMIL

3 29.74 PAII

4 31.22 -----

5 31.36 -----

6 32.21 DA-IICT_SRI-B

7 33.18 IDIAP

8 33.21 baseline

9 37.05 tal_speech [not used all available training data]

===========================


ENGLISH OPEN TRACK: total submission=17, total participants=7

===========================

RANK %WER TEAM

1 23.98 Netease Youdao

2 29.08 NTNU SMIL

3 29.58 ------

4 29.63 PAII

5 30.61 University of Dublin

6 33.21 baseline

7 37.05 tal_speech [used very limited training data]

===========================


GERMAN CLOSED TRACK: total submission=25, total participants=8

===========================

RANK %WER TEAM

1 23.50 tal_speech

2 38.55 Netease Youdao

3 39.98 UCLA SPAPL

4 40.04 IDIAP

5 40.51 -----

6 40.63 -----

7 43.13 NTNU SMIL

8 45.21 baseline

===========================


GERMAN OPEN TRACK: total submission=20, total participants=6

===========================

RANK %WER TEAM

1 23.50 tal_speech

2 39.98 UCLA SPAPL

3 40.27 -----

4 40.69 -----

5 40.87 Netease Youdao

6 45.21 baseline

===========================



Organizers

Daniele Falavigna, Fondazione Bruno Kessler, falavi@fbk.eu

Roberto Gretter, Fondazione Bruno Kessler

Marco Matassoni, Fondazione Bruno Kessler

Abhinav Misra, Educational Testing Service, amisra001@ets.org

Chee Wee Leong, Educational Testing Service

Kate Knill, Cambridge University

Linlin Wang, Cambridge University