Parallel Corpus Filtering Task (WAT 2023)
Previous task(s)
Task Description
In this parallel corpus filtering task, we ask participants to score noisy parallel corpus, then train the models with a fixed setting and evaluate their accuracy. Participants are required to improve translation accuracy by only removing training data that may hurt the model. This year, we provide a noisy parallel corpus on Japanese-English, and evaluate it on a general-domain test set.
Difference from the previous year
We ask participants to score all sentence pairs. Not submitting high-scored sentences.
We changed the test set from the science domain to the general domain.
Updates
N/A
Important Dates
System submission deadline: July 7, 2023
System description paper submission deadline : July 14, 2023
Review feedback of system description papers: July 28, 2023
Camera-ready deadline for system description papers: August 4, 2023
Workshop dates: September 4, 2023
Task Details
This task asks you to score each sentence pair in the noisy parallel corpus based on its quality.
You do not need to submit your translation model or its outputs. We need scores for each sentence pair.
Then we will train the model with your high-scored sentences and the fixed hyper-parameters and report BLEU scores.
This year, we prepared JParaCrawl v3.0 as a noisy dataset.
The ultimate goal of this shared task is to create a cleaner JParaCrawl corpus.
After this shared task ends, we plan to ensemble all participant scores and make a cleaner corpus.
Noisy Parallel Corpus
JParaCrawl v3.0 (Japanese-English, web-based) [1]
Evaluation Dataset
WMT22 General task test set (Japanese<->English) [2]
Constraints
You cannot add or edit the sentences in JParaCrawl.
You can only score the noisy sentences.
You can use other language resources if they are publicly and freely available.
However, you cannot edit or add sentences to the training data as written above.
You cannot use the WMT22 test set in any way.
Model Training Settings
After the submission deadline, we will train the model with the following settings.
Language direction: English-Japanese and Japanese-English
Number of sentences: We use the top 100k, 1M, and 10M sentences for training models.
Training Details: We will train the model in the following docker environment and training scripts.
https://github.com/MorinoseiMorizo/wat2022-filtering
Q&A
Q. Can we use the train set from WMT to do the filtering?
Yes, you can use the WMT training set for filtering.
Q. Can we use pretrained language or translation models for the task?
Yes, you can use both pre-trained language models and translation models.
Q. Will all the task submissions be made public, for research purposes at the end of the workshop?
We want to release all the submissions to the public, but if you do not want us to release your submission, just let us know.
Submission
Please submit a file with quality scores. One score per line.
The quality score should be in the range of 0 to 1. A higher score indicates better quality.
The number of sentences should be the same as JParaCrawl v3.0 (25,740,835 lines).
Please upload your scores to online storage and provide the download link from the submission form.
Results
N/A
Reference
[1] Morishita et al., "JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus", in Proc. of LREC, 2020.
[2] Kocmi et al., "Findings of the 2022 Conference on Machine Translation (WMT22)", in Proc. of WMT, 2022.
Contact
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
Organizers
Makoto Morishita, NTT, Japan
Yusuke Oda, Tohoku University, Japan