Dataset Description: Participants will receive a subset of the RESPIN dataset, a read-speech corpus developed as part of the RESPIN initiative. For this challenge, approximately 1200 hours of data will be shared across 8 Indian languages, with 150 hours per language, covering a diverse range of speakers (1000-1700 per language). All training and development data will be balanced across dialects and domains. The shared dataset includes 33 dialects across 8 Indian languages: Bengali, Bhojpuri, Chhattisgarhi, Kannada, Magahi, Maithili, Marathi, and Telugu. Depending on the track, participants can use a 30-hour subset (Tracks 1 & 3) or the full 150-hour dataset (Tracks 2 & 4).
Training Set: Either 30 hrs/language (subset) or 150 hrs/language (full set), based on the track.
Development Set: ~2 hrs of read speech data provided for intermediate model validation. Contains only unseen speakers and sentences.
Blind Test Set: Will be released later for final evaluation. Contains balanced and unseen speakers/dialects. (note: The test set will consist of a combined set of utterances from all 8 languages, with no dialect or language information provided. Evaluation will be conducted on the combined set.)
Hidden Test Set (Optional): For teams willing to share model weights or APIs for blind evaluation by organizers.
Baseline recipes and model weights will be made available.
To access the dataset: https://ee.iisc.ac.in/madasrdataset/
The transcripts and relevant preprocessed codes are available at the challenge GitHub repository : coming soon
Submit the results at this Link (coming soon)
There are no common speakers or sentences between train, dev, and text.Â
Dev set is representative of the test set