Dataset

Dataset Description: Participants will receive a subset of the RESPIN dataset, a read-speech corpus developed as part of the RESPIN initiative. For this challenge, approximately 1200 hours of data will be shared across 8 Indian languages, with 150 hours per language, covering a diverse range of speakers (1000-1700 per language). All training and development data will be balanced across dialects and domains. The shared dataset includes 33 dialects across 8 Indian languages: Bengali, Bhojpuri, Chhattisgarhi, Kannada, Magahi, Maithili, Marathi, and Telugu. Depending on the track, participants can use a 30-hour subset (Tracks 1 & 3) or the full 150-hour dataset (Tracks 2 & 4).

Training Set: Either 30 hrs/language (subset) or 150 hrs/language (full set), based on the track.
Development Set: ~2 hrs of read speech data provided for intermediate model validation. Contains only unseen speakers and sentences.
Blind Test Set: Will be released later for final evaluation. Contains balanced and unseen speakers/dialects. (note: The test set will consist of a combined set of utterances from all 8 languages, with no dialect or language information provided. Evaluation will be conducted on the combined set.)
Hidden Test Set (Optional): For teams willing to share model weights or APIs for blind evaluation by organizers.

Baseline recipes and model weights will be made available.

To access the dataset: https://ee.iisc.ac.in/madasrdataset/

The transcripts and relevant preprocessed codes are available at the challenge GitHub repository : coming soon

Submit the results at this Link (coming soon)

Data Split description

There are no common speakers or sentences between train, dev, and text.
Dev set is representative of the test set

Page updated

Google Sites

Report abuse