Parallel Corpus Filtering Task (WAT 2022)
Task Description
This year, we hold parallel corpus filtering tasks, which ask participants to clean noisy parallel corpus, then train the models with a fixed setting and evaluate their accuracy. Competitors are required to improve translation accuracy by only removing training data that may hurt the model. This year, we will provide a noisy parallel corpus on Japanese-English.
Updates
2022/07/25 Updated results.
2022/07/11 Deadline extended.
2022/06/10 Added constraints, updated docker environment.
2022/06/15 Added Q&A.
Important Dates
System submission due on July 11 July 18, 2022 (Deadline extended)
System description paper submission due on August 1, 2022
Review feedback of system description papers: August 29, 2022
Camera-ready deadline for system description papers: September 5, 2022
Task Details
This task asks you to clean the noisy parallel corpus and adapt it to the scientific paper domain.
Unlike the other translation tasks, we do not ask you to train the model but send the cleaned parallel data.
Then we will train the model with the submitted parallel data and the fixed hyper-parameters and report BLEU scores.
This year, we prepared JParaCrawl v3.0 as a noisy dataset.
Translation Dataset
JParaCrawl v3.0 (Japanese-English, web-based) [1]
Evaluation Dataset
ASPEC (Japanese-English, scientific paper) [2]
Constraints
You cannot add or edit the sentences in JParaCrawl.
You can only remove the noisy sentences.
You can use other language resources if they are publicly and freely available.
However, you cannot edit or add sentences to the training data as written above.
You cannot use the ASPEC test set in any way.
Model Training Settings
After the submission deadline, we will train the model with the following settings.
Language direction: English-Japanese and Japanese-English
Training Details: We will train the model in the following docker environment and training scripts. (Last update 2022/06/13)
https://github.com/MorinoseiMorizo/wat2022-filtering
Q&A
Q. Can we use the train set from ASPEC to do the filtering?
Yes, you can use the ASPEC training set for filtering, but you are not allowed to add the sentences in ASPEC to JParaCrawl v3.0.
Q. Can we use pretrained language or translation models for the task?
Yes, you can use both pre-trained language models and translation models.
Q. Will all the task submissions be made public, for research purposes at the end of the workshop?
We want to release all the submissions to the public, but if you do not want us to release your submission, just let us know.
Submission
Please upload your cleaned data to online storage and provide the download link from the submission form.
Results
You can download the translation results from here.
Reference
[1] Morishita et al., "JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus", in Proc. of LREC, 2020.
[2] Nakazawa et al., "ASPEC: Asian Scientific Paper Excerpt Corpus", in Proc. of LREC, 2016.
Contact
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
Organizers
Makoto Morishita, NTT, Japan
Yusuke Oda, Tohoku University, Japan