Hindi-tamil-English asr challenge


Speech Lab, IIT Madras announces Automatic Speech Recognition (ASR) Challenge in three Indian languages - Hindi, Tamil and Indian-English. This challenge is the third challenge in the series of ASR challenges planned. In this installment of the challenge, approximately 490 hours of transcribed speech data in three Indian languages will be made open source. This data subsumes the data released in the previous challenges. The details of the first and the second challenges can be found here and here. These challenges are a part of the National Language Translation Mission funded by MeitY. They aim towards helping and encouraging the advancement of ASR in Indian Languages. We plan to have a series of challenges with increasing difficulty in different Indian languages, and release appropriate data with each challenge. In the first two challenges, we had released everything including source codes so that start-ups/Universities/Research-Labs without previous experience in ASR can also participate and get familiar with it.

Challenge overview

Recent advancements in Speech technology have shown that ASR systems can work on par with humans. To build a good ASR system requires large amounts of training data and high-end computational resources.

However, when it comes to Indian languages, not everyone, especially academic institutions and startups, have access to these resources. As a part of this challenge, we will be releasing speech data in Hindi, Tamil and Indian-English. Everyone who participates in this challenge will then be free to use this data for research purposes.

Data Set Details

The data set comprises of Hindi, Tamil and Indian-English read and conversational speech data along with the corresponding transcriptions. This speech data was collected by Speech Lab IITM and several startups. We will be releasing approximately 490 hours of speech data in this challenge round. The details of the data sets released for this challenge are as follows:

HINDI: 188.1 hours

  • Train set - 178.4 hours

  • Development set - 4.8 hours

  • Evaluation set - 4.9 hours

TAMIL: 112.2 hours

  • Train set - 104.5 hours

  • Development set - 3.9 hours

  • Evaluation set - 3.8 hours

INDIAN ENGLISH: 190.3 hours

  • Train set - 179.5 hours

  • Development set - 5.4 hours

  • Evaluation set - 5.4 hours

Lexicon has also been made available. The lexicon was generated using the Unified-parser (Hindi and Tamil) and CMU Lexicon tool (Indian-English). The Hindi and English data released in this challenge includes the Hindi data released in the first challenge and "IITM" English data released in the second challenge respectively. So approximately 490 hours + 200 hours (NPTEL data from second challenge) = 690 hours of transcribed speech data has been released through these three challenges.

How to Participate

  • Enroll yourself by registering on this link: Register Now!

  • Registering on the above link provides access to the user license and download the training and test data for the challenge

  • The download link should be mailed within 24 hours of registering. Please reach out to us if you do not hear from us even after 24 hours.

Challenge

  • The participants are expected to submit their results on the evaluation data.

  • The evaluation data will be made available only when the submission portal is opened, i.e., 7th July 2021 14th of July 2021.

  • The links to download evaluation sets will be mailed to all the registered participants.

  • The challenge will have two streams:

    • Closed Challenge: Only the training data distributed as part of the challenge can be used to train the models (both acoustic and language models). Please do not use dev set data.

    • Open Challenge: You can use any external/additional data to train the acoustic and language models. Please note that "Hindi data" from our first challenge and "IITM English data" from our second challenge cannot be used in this open challenge. But the "200 hour NPTEL English" data from second challenge can be used.

  • The participants can choose to submit their results to both streams or any one among them.

Submit results: Use submission portal submit your results.

  • The submission portal will open on 7th of July 2021 14th of July 2021 and closes on 14th of July 2021 21st of July 2021(midnight anywhere in the world, i.e., 12pm UTC on 21st of July 2021)

  • Submissions should include the ASR output produced by the system and a brief description of the system

  • The format of the decode files to be submitted will be shared soon.

  • Participating teams can submit a maximum of 10 submissions per team

  • Results will be displayed on a leader board throughout the period that the submission site is open

Important Dates

  • Release of training data, development data and, lexicon: May 13, 2021

  • Evaluation data release and opening of submission site: July 7th, 2021 July 14th, 2021

  • Closing of submission site: July 14th, 2021 July 21st, 2021(midnight anywhere in the world, i.e., 12pm UTC on July 21st, 2021)

  • Announcement of results: July 22nd, 2021

About Speech Lab IITM

Speech lab IIT Madras is headed by Prof. S. Umesh and is part of the Dept. of Electrical Engg. Our focus is on building state of the art speech recognition systems, especially in Indian languages. Our research interests are in low-resource modelling, multilingual speech recognition and speaker normalisation.