Parallel Corpus Filtering Task (WAT 2023)

Task Description

In this parallel corpus filtering task, we ask participants to score noisy parallel corpus, then train the models with a fixed setting and evaluate their accuracy. Participants are required to improve translation accuracy by only removing training data that may hurt the model. This year, we provide a noisy parallel corpus on Japanese-English, and evaluate it on a general-domain test set.

Difference from the previous year


Updates

Important Dates

Task Details

This task asks you to score each sentence pair in the noisy parallel corpus based on its quality.
You do not need to submit your translation model or its outputs. We need scores for each sentence pair.
Then we will train the model with your high-scored sentences and the fixed hyper-parameters and report BLEU scores.
This year, we prepared JParaCrawl v3.0 as a noisy dataset. 

The ultimate goal of this shared task is to create a cleaner JParaCrawl corpus.
After this shared task ends, we plan to ensemble all participant scores and make a cleaner corpus.

Noisy Parallel Corpus

Evaluation Dataset

Constraints

Model Training Settings

After the submission deadline, we will train the model with the following settings.

Language direction: English-Japanese and Japanese-English

Number of sentences: We use the top 100k, 1M, and 10M sentences for training models.

Training Details: We will train the model in the following docker environment and training scripts.
https://github.com/MorinoseiMorizo/wat2022-filtering

Q&A

Q. Can we use the train set from WMT to do the filtering?

Yes, you can use the WMT training set for filtering.


Q. Can we use pretrained language or translation models for the task?

Yes, you can use both pre-trained language models and translation models.


Q. Will all the task submissions be made public, for research purposes at the end of the workshop?

We want to release all the submissions to the public, but if you do not want us to release your submission, just let us know.

Submission

Please submit a file with quality scores. One score per line.
The quality score should be in the range of 0 to 1.  A higher score indicates better quality.
The number of sentences should be the same as JParaCrawl v3.0 (25,740,835 lines).

Please upload your scores to online storage and provide the download link from the submission form.

Submission Form

Results

N/A

Reference

[1] Morishita et al., "JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus", in Proc. of LREC, 2020.
[2] Kocmi et al., "Findings of the 2022 Conference on Machine Translation (WMT22)", in Proc. of WMT, 2022.

Contact

For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

Organizers