Interspeech 2020 Special Session

Shared Task on Automatic Speech Recognition for Non-Native Children’s Speech

This shared task will help advance the state-of-the-art in automatic speech recognition (ASR) by considering a challenging domain for ASR: non-native children's speech. A new data set containing English spoken responses produced by Italian students will be released for training and evaluation. The spoken responses in the data set were produced in the context of an English speaking proficiency examination. The following data will be released for this shared task: training set of 49 hours of transcribed speech, development set of 2 hours of transcribed speech, test set of 2 hours of speech, and a baseline Kaldi ASR system with evaluation scripts. The shared task will consist of two tracks: a closed track and an open track. In the closed track, only the training data distributed as part of the shared task can be used to train the models; in the open track, any additional data can be used to train the models.

For questions about the shared task, please email

Important Dates

  • Release of training data (initial set), development data, and baseline system: February 7, 2020
  • Release of training data (additional 40 hours): February 14, 2020
  • Test data released and opening of submission site: April 17, 2020
  • Closing of submission site: April 24, 2020 (midnight anywhere in the world, i.e., 12pm UTC on April 25)
  • Announcement of results: April 27, 2020
  • Interspeech paper submission deadline: May 8, 2020

How to Participate

  • Download the training data: Follow this link to access the user license and download the training data for the shared task (including the additional 40 hours)
  • Download the test data: Follow this link to access the test data for the shared task
  • Submit results: Follow this link to access the CodaLab site for the shared task to register and submit your results
    • The CodaLab submission portal will open on April 17 and close at midnight on April 24 (midnight anywhere in the world, i.e., 12pm UTC on April 25)
    • Submissions should include the ASR output produced by the system and a brief description of the system (see the documentation on the CodaLab submission portal for further instructions about formatting and the submission procedure)
    • Participating teams can submit up to one submission per day for a maximum total of 7 submissions per team
    • Results will be displayed on a leaderboard throughout the week that the submission site is open


Daniele Falavigna, Fondazione Bruno Kessler

Marco Matassoni, Fondazione Bruno Kessler

Roberto Gretter, Fondazione Bruno Kessler

Keelan Evanini, Educational Testing Service

Ben Leong, Educational Testing Service

Motivation for the shared task

The availability of large amounts of training data and large computational resources have made Automatic Speech Recognition (ASR) technology usable in many application domains, and recent research has demonstrated that ASR systems can achieve performance levels that match human transcribers for some tasks. However, ASR systems still present deficiencies when applied to speech produced by specific types of speakers, in particular, non-native speakers and children.

Several phenomena that regularly occur in non-native speech can greatly reduce ASR performance, including mispronounced words, ungrammatical utterances, disfluencies (including false starts, partial words, and filled pauses), and code-switched words. ASR for children’s speech can be challenging due to linguistic differences from adult speech at many levels (acoustic, prosodic, lexical, morphosyntactic, and pragmatic) caused by physiological differences (e.g., shorter vocal tract lengths), cognitive differences (e.g., different stages of language acquisition), and behavioral differences (e.g., whispered speech). Developing ASR systems for both of these domains is made more challenging due to the lack of publicly available databases for both non-native speech and children’s speech.

Despite these difficulties, a significant portion of the speech transcribed by ASR systems in practical applications may come from both non-native speakers, (e.g., newscasts, movies, internet videos, human-machine interactions, human-human conversations in telephone call centers, etc.) and children (e.g., educational applications, smart speakers, speech-enabled gaming devices, etc.) Therefore, it is necessary to continue to improve ASR systems to be able to accurately process speech from these populations. An additional important application area is the automatic assessment of second language speaking proficiency, where the ASR difficulties can be increased by the low proficiency levels of the speakers, especially if they are children. The lack of training data is especially pronounced for this population (non-native children’s speech).

With this special session we aim to help address these gaps and stimulate research that can advance the present state-of-the-art in ASR for non-native children’s speech. To achieve this aim we will distribute a new data set containing non-native children’s speech and organize a challenge that will be presented in the special session. The data set consists of spoken responses collected in Italian schools from students between the ages of 9 and 16 in the context of English speaking proficiency assessments. The data that will be released includes both a test set (ca. 4 hours) and adaptation (ca. 9 hours) set, both of which were carefully transcribed by human listeners. In addition, a set of around 90 hours of untranscribed spoken responses will be distributed. A Kaldi baseline system will also be released together with the data, and a challenge web site will be developed for collecting and scoring submissions.

The following points makes this session special:

  • Distribution of a unique and challenging (from the ASR perspective) set of spoken language data acquired in schools from students of different ages.
  • Organization of a challenge addressing research topics in several ASR subfields, including:
    • Language models: How to handle grammatically incorrect sentences, false starts and partial words, code-switched words, etc.
    • Lexicon: Generation of multiple pronunciations for non-native accents, training of pronunciation models, etc.
    • Acoustic models: Multilingual model training, transfer learning approaches, model adaptation for non-native children (supervised, unsupervised, lightly supervised), modeling of spontaneous
    • speech phenomena, acoustic models for non-native children, etc.
    • Evaluation: Database acquisition and annotation of non-native speech, performance evaluation for non-native children’s speech
    • Handling low resource training/adaptation data for less commonly studies populations (non-native speech, children’s speech)
  • Establishing benchmarks for future research.
  • Establishing a common data set for additional future annotations for applications beyond ASR (e.g., computer assisted language learning).
  • The special session will be supported by SIG-CHILD, the ISCA special interest group focusing on multimodal child-computer interaction and will continue a series of productive events that have been hosted by SIG-CHILD in the area of child-computer interaction and analysis of children’s speech since 2008 (including the Interspeech 2019 special session entitled Spoken Language Processing for Children’s Speech).


Closed Track

Open Track