CFILT, IIT Bombay, presents the English-Marathi Parallel Corpus Creation for Machine Translation (MPaCT) challenge. This challenge is a part of the National Language Translation Mission funded by MeitY. It aims towards helping and encouraging the advancement of Machine Translation technology in Indian Languages.
Machine Translation (MT) is arguably the most widely used language technologies today thanks to the popularity of internet and globalised economy. Over the last two decades, MT technology has taken significant strides forward due to the adoption and advancement of data-driven approaches for natural language processing. Data is now the key driver of progress in MT. Large volumes of training data, also called parallel corpora, are needed for training the Machine Learning models used in MT. Unfortunately, large parallel corpora are not available in many Indian languages and this has been the major barrier for progress in MT technology for those languages. To address the data gap in Marathi language, CFILT, IIT Bombay is setting up the English-Marathi parallel corpus creation challenge and is opening it up to participants from industry and academic institutions. As part of this challenge, a two-pronged approach for creating high quality English-Marathi parallel corpus will be taken:
Translation: We will provide text documents in English and the participants are required to produce high quality Marathi translations.
Community contribution: Participants are encouraged to contribute parallel data from any domain with the goal of collectively building a large multi-domain English-Marathi parallel corpus.
Submissions will be evaluated based on the quality and throughput of the translations by a selection committee set up by CFILT, IIT Bombay using well-established evaluation metrics and participants will be ranked accordingly.
A subset of top-ranking participants will be commissioned by CFILT, IIT Bombay after the successful conclusion of the challenge to build a large English-Marathi parallel corpus for its MT system. They will be compensated for their services as per the norms of Government of India. Terms and conditions apply.
Parallel data voluntarily contributed by participants during the challenge will be made available to all contributing participants. Terms and conditions apply.
The data set to be provided to the participants of IMPaCT challenge comprises of text documents in English on various subjects drawn primarily from education domain.
English documents : ~10,000 sentences words
Registered participants will be directly sent the link to the data set after verification.
We are thankful to IIT Madras and Prof. Prathap Haridoss for making available the data set for use in the challenge.
Enroll yourself by registering on this link: Register Now!!
Only registered participants will get access to the data
Use submission portal to submit your submission.
The submission portal will open on August 20, 2020 and closes at midnight on August 28, 2020
See the documentation for further instructions about formatting and the submission procedure
Last date for registration: August 12, 2020
Meeting with registered participants: August 14, 2020 August 16, 2020
Questions submitted by registered participants: questions
Minutes of the meeting: minutes
Release of data set: August 16, 2020 August 20, 2020
Opening of submission portal: August 16, 2020 August 20, 2020
Last date for submission: August 28, 2020
Announcement of results: On or before September 15, 2020 Evaluation of submissions is currently on. Results will be announced shortly after the evaluation is complete. Results have been communicated to the participants.
Read the terms and conditions for participating in the challenge here.