Parallel Corpus Filtering Task (WAT 2023)

Previous task(s)

Parallel Corpus Filtering Task at WAT 2022

Task Description

In this parallel corpus filtering task, we ask participants to score noisy parallel corpus, then train the models with a fixed setting and evaluate their accuracy. Participants are required to improve translation accuracy by only removing training data that may hurt the model. This year, we provide a noisy parallel corpus on Japanese-English, and evaluate it on a general-domain test set.

Difference from the previous year

We ask participants to score all sentence pairs. Not submitting high-scored sentences.
We changed the test set from the science domain to the general domain.

Updates

Important Dates

System submission deadline: July 7, 2023
System description paper submission deadline : July 14, 2023
Review feedback of system description papers: July 28, 2023
Camera-ready deadline for system description papers: August 4, 2023
Workshop dates: September 4, 2023

Task Details

This task asks you to score each sentence pair in the noisy parallel corpus based on its quality.
You do not need to submit your translation model or its outputs. We need scores for each sentence pair.
Then we will train the model with your high-scored sentences and the fixed hyper-parameters and report BLEU scores.
This year, we prepared JParaCrawl v3.0 as a noisy dataset.

The ultimate goal of this shared task is to create a cleaner JParaCrawl corpus.
After this shared task ends, we plan to ensemble all participant scores and make a cleaner corpus.

Noisy Parallel Corpus

JParaCrawl v3.0 (Japanese-English, web-based) [1]

Evaluation Dataset

WMT22 General task test set (Japanese<->English) [2]
- English-Japanese
  - Source
  - Reference
- Japanese-English
  - Source
  - Reference

Constraints

You cannot add or edit the sentences in JParaCrawl.
- You can only score the noisy sentences.
You can use other language resources if they are publicly and freely available.
- However, you cannot edit or add sentences to the training data as written above.
You cannot use the WMT22 test set in any way.

Model Training Settings

After the submission deadline, we will train the model with the following settings.

Language direction: English-Japanese and Japanese-English

Number of sentences: We use the top 100k, 1M, and 10M sentences for training models.

Training Details: We will train the model in the following docker environment and training scripts.
https://github.com/MorinoseiMorizo/wat2022-filtering

Q&A

Q. Can we use the train set from WMT to do the filtering?

Yes, you can use the WMT training set for filtering.

Q. Can we use pretrained language or translation models for the task?

Yes, you can use both pre-trained language models and translation models.

Q. Will all the task submissions be made public, for research purposes at the end of the workshop?

We want to release all the submissions to the public, but if you do not want us to release your submission, just let us know.

Submission

Please submit a file with quality scores. One score per line.
The quality score should be in the range of 0 to 1. A higher score indicates better quality.
The number of sentences should be the same as JParaCrawl v3.0 (25,740,835 lines).

Please upload your scores to online storage and provide the download link from the submission form.

Submission Form

Results

N/A

Reference

[1] Morishita et al., "JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus", in Proc. of LREC, 2020.
[2] Kocmi et al., "Findings of the 2022 Conference on Machine Translation (WMT22)", in Proc. of WMT, 2022.

Contact

For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

Organizers

Makoto Morishita, NTT, Japan
Yusuke Oda, Tohoku University, Japan

Page updated

Google Sites

Report abuse