English ↔ Indic Document Translation Task (WAT 2025)

[WAT 2025 HOME]

INTRODUCTION

Document-level translation is challenging, especially for low-resource language pairs.

We introduce this shared task to encourage broader interest in document-level translation for low-resource language pairs and to offer a testbed for diverse approaches.

TASK DESCRIPTION

This task focuses on document-level machine translation between English and 11 Indic languages.

We use the Pralekha corpus, a large-scale dataset comprising over 2 million document pairs from domains such as government communications and public broadcasts.

Participants may use pre-trained models such as Llama, Qwen, Gemma, IndicBART, mBART, mT5 etc, and any additional data (except the dev and test data).

Dataset

The corpus is available [here].

Here is the statistical data of the corpus.

The corpus contains English and these Indic languages:

IMPORTANT DATES

The paper submission deadline is extended to 27 October.

BASELINE

We provided a few-shot prompting baseline. It is a full pipeline containing data preparation, inference using open large language models such as Llama 3.1 8B, and evaluation. We made it public for participants to get started easily [github repo].

The baseline uses a Large Language Model-based approach to perform translation between English and 11 Indic languages. The LLM used in this implementation is Llama-3.1-8B-Instruct.

SUBMISSION

Participants are welcome to submit outputs for any translation directions between Indic languages and English.

Please submit the results to **haiyue.song -at- nict -dot- go -dot- jp** with the format below:

Format

The submission will be a single zipped file, which contains result files for individual translation direction.

Each file is with name indicating its translation direction, such as hin_2_eng.jsonl

Each file contains 1k lines, with each line in the format of ["translation result"]. Please refer to the baseline github repo for the format of each line [Here].

EVALUATION

Automatic evaluation metrics such as Document-level BLEU, COMET.

CONTACT

General questions

**wat-organizer -at- googlegroups -dot- com**

Task-specific questions

Haiyue Song: **haiyue.song -at- nict -dot- go -dot- jp**

Sanjay Suryanarayanan: **sanj -dot- ai -at- outlook -dot- com**

Page updated

Google Sites

Report abuse