Parallel Corpus Filtering Task (WAT 2022)

Task Description

This year, we hold parallel corpus filtering tasks, which ask participants to clean noisy parallel corpus, then train the models with a fixed setting and evaluate their accuracy. Competitors are required to improve translation accuracy by only removing training data that may hurt the model. This year, we will provide a noisy parallel corpus on Japanese-English.


Updates

Important Dates

Task Details

This task asks you to clean the noisy parallel corpus and adapt it to the scientific paper domain.
Unlike the other translation tasks, we do not ask you to train the model but send the cleaned parallel data.
Then we will train the model with the submitted parallel data and the fixed hyper-parameters and report BLEU scores.
This year, we prepared JParaCrawl v3.0 as a noisy dataset.

Translation Dataset

Evaluation Dataset

Constraints

Model Training Settings

After the submission deadline, we will train the model with the following settings.

Language direction: English-Japanese and Japanese-English

Training Details: We will train the model in the following docker environment and training scripts. (Last  update  2022/06/13)
https://github.com/MorinoseiMorizo/wat2022-filtering

Q&A

Q. Can we use the train set from ASPEC to do the filtering?

Yes, you can use the ASPEC training set for filtering, but you are not allowed to add the sentences in ASPEC to JParaCrawl v3.0.


Q. Can we use pretrained language or translation models for the task?

Yes, you can use both pre-trained language models and translation models.


Q. Will all the task submissions be made public, for research purposes at the end of the workshop?

We want to release all the submissions to the public, but if you do not want us to release your submission, just let us know.

Submission

Please upload your cleaned data to online storage and provide the download link from the submission form.

Submission Form

Results

You can download the translation results from here.

Reference

[1] Morishita et al., "JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus", in Proc. of LREC, 2020.
[2] Nakazawa et al., "ASPEC: Asian Scientific Paper Excerpt Corpus", in Proc. of LREC, 2016.

Contact

For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

Organizers