Shared Task
SHARED TASK
INTRODUCTION
We are excited to introduce a new shared task for this year’s CoCo4MT workshop! Our aim is to encourage and facilitate research on corpus construction for low-resource machine translation.
Corpus creation for machine translation is typically constrained by the cost and availability of human translators. When a new dataset needs to be created for a low-resource language or a specialized domain, the annotation budget should be used efficiently and any sentences chosen for translation should be of high quality.
In this shared task, we ask participants to come up with ways in which such examples can be identified for a target language without any existing data. Specifically, given a parallel corpus between high-resource languages, the goal is to choose a good subset of the high-resource corpus to manually be translated into the low-resource language, in order to obtain a good machine translation system. The shared task winner will be the team whose instances result in the best final system after training.
TASK SETUP
Data
Participants are provided access to 7-way parallel data from the JHU Bible corpus in high-resource languages, as well as in low-resource languages for the evaluation of their data selection algorithms.
High-resource train/dev/test data are provided in English, German, Korean and Indonesian, for training initial MT systems for data selection, if desired by the participants.
We additionally provide train/dev/test data in the following (simulated and real) low-resource languages: Gujarati, Burmese, and French. This data should not be used for instance selection! (That's what the high-resource-language data is for!) Instead, participants can use it to simulate translation of selected English instances and train MT systems on this subset.
The training, development and test splits for all languages can be found at https://github.com/ananyaganesh/coco4mt-shared-task.
Baselines, in the form of selected English instances, are released in the same repository.
Baselines
Our baseline methods consist of selecting data through simple heuristics, e.g., based on length, number of unique types, as well as random selection.
We have finetuned the mBART model (https://huggingface.co/docs/transformers/model_doc/mbart#overview-of-mbart) on the heuristically selected English data and the corresponding low-resource-language instances, and provide baseline performance using chrF++ scores. (To be released soon!)
Participants can train MT models on the data selected by their proposed algorithms and compare performance against that of systems trained on our baseline data. By this, they can assess their algorithms' effectiveness before submission. No MT system or system training is needed for submission; all we need from participants is the subset of data their algorithms have selected!
Evaluation
We will evaluate the quality of the chosen instances, indicated by sentence IDs mailed in by the participants, as follows, using chrF++ score:
We will build parallel datasets between English and test-language instances, picking the sentences from the JHU Bible corpus whose lines correspond to the sentence IDs. (The test languages are unknown to the participants; for initial assessments of algorithms, the dev languages should be used here!)
We will finetune an mBART model between English and each test language. The test languages will be the target languages.
The final score of a submission will be the average of the scores of the systems for all test languages.
The set of low-resource languages that we evaluate on (the test languages) will not be revealed until after the completion of the shared task.
Submission format
Participants should submit text files with the sentence IDs, corresponding to the sentences' line numbers in the training files, of the best training examples chosen by their algorithm. (For an example, please look at one of the files in the baselines folder.)
Note that only the sentence IDs selected by a method need to be submitted, although we encourage participants to additionally share their algorithms in the form of github codebases.
The size of the selected subset must be <=20% of the original training set.
Permitted models
Participants are permitted to use pretrained models as well as outside sources of data for identifying effective instances in the high-resource language corpora.
However, participants may not use manually selected instances or use data from the JHU Bible corpus/other Bible corpora reserved for training.
SUBMISSION
If you are interested in participating in this shared task, please
Sign up here: CoCo4MT Shared Task Registration Form.
Send all your submissions to coco4mt.organizers@gmail.com
Use “CoCo4MT Shared Task Submission: <your team name>” as the subject line.
Please also include the names and email addresses of all team members in the message.
[Optional but encouraged] Share a GitHub link to an implementation of their method along with their submission.
NB: Submissions must be .txt files containing the sentence IDs of the selected instances.
System Description Papers:
System description papers must be 4 pages long, excluding references.
Please use the official MT Summit style guide to format your paper (templates found here)
Please include a brief overview of the shared task objective, and a detailed description of your system, describing your method, any outside sources of data used, validation techniques and . For an example of a shared task system description paper, please take a look at https://aclanthology.org/2023.iwslt-1.23.pdf.
Please also include a link to a publicly available github codebase containing an implementation of your method.
IMPORTANT DATES
May 19 2023 - Release of train, dev and test data
May 25 2023 - Release of baselines
July 16, 2023 - Deadline to submit results
August 1, 2023 - System description papers due