INTRODUCTION
Document-level translation is challenging, especially for low-resource language pairs.
We introduce this shared task to encourage broader interest in document-level translation for low-resource language pairs and to offer a testbed for diverse approaches.
TASK DESCRIPTION
This task focuses on document-level machine translation between English and 11 Indic languages.
We use the Pralekha corpus, a large-scale dataset comprising over 2 million document pairs from domains such as government communications and public broadcasts.
Participants may use pre-trained models such as Llama, Qwen, Gemma, IndicBART, mBART, mT5 etc, and any additional data (except the dev and test data).
The corpus contains English and these Indic languages:
IMPORTANT DATES
The paper submission deadline is extended to 27 October.
BASELINE
We provided a few-shot prompting baseline. It is a full pipeline containing data preparation, inference using open large language models such as Llama 3.1 8B, and evaluation. We made it public for participants to get started easily [github repo].
The baseline uses a Large Language Model-based approach to perform translation between English and 11 Indic languages. The LLM used in this implementation is Llama-3.1-8B-Instruct.
SUBMISSION
Participants are welcome to submit outputs for any translation directions between Indic languages and English.
The submission will be a single zipped file, which contains result files for individual translation direction.
Each file is with name indicating its translation direction, such as hin_eng.txt or hin_2_eng.json etc.
Each file contains 1k lines, with each line in the format of ["translation result"]. Please refer to the baseline github repo for the format of each line.
EVALUATION
Automatic evaluation metrics such as Document-level BLEU, COMET.
CONTACT
General questions
**wat-organizer -at- googlegroups -dot- com**
Task-specific questions
Haiyue Song: **haiyue.song -at- nict -dot- go -dot- jp**
Sanjay Suryanarayanan: **sanj -dot- ai -at- outlook -dot- com**