Restricted Translation Task (WAT2021)

Updates:

- We published source code in Restricted Translation 2021! [code]
- We presented the task overview at WAT2021. Thank you for your participants in RT task! :) [Slide]
- We updated Final score & human evaluation results. Please see the Results section below.
- The shared task submission deadline was extended. The new submission deadline is May 3, 2021 (11:59pm, UTC-12:00).
- We updated all the vocabulary list files on 03/12/2021. Please make sure that you use the latest version.

Task Description

Consistency is key to clear and accurate translations. It is practically important to have consistent terminology when translating documents related to science or business/marketing that include technical terms and proper nouns (See the examples below, where the tokens colored in red indicate some technical terms, and the boldface shows inconsistent translations among systems).

In this task, we will investigate translation with restricted target vocabularies. We will provide a list of target words as well as source sentences to represent word restriction. Participants are required to submit translations that contain all the target words in each list.

We will evaluate submissions by using two different metrics: 1) usual translation accuracy such as BLEU, and 2) a consistency score which evaluates how many word restrictions were satisfied in submitted translations.

Example: (from ASPEC Corpus [1])

Input: この回路は，入力信号位相の変化により共振周波数がシフトする帰還回路であり，２基のコイルの中央にある物体の磁気特性の変化を，高い感度と分解能で検出することができる。

Reference: This is a feedback circuit shifting resonance frequency by change of input signal phase, which can detect change of magnetic features of an object present at a center of two coils on high sensitivity and resolution.

Target vocabulary list: {magnetic features, resonance frequency, feedback circuit, resolution}

Online system A: This circuit is a feedback circuit whose resonance frequency shifts due to changes in the input signal phase, and can detect changes in the magnetic characteristics of the object in the center of the two coils with high sensitivity and resolution.

Online system B: This circuit is a feedback circuit in which the resonance frequency shifts due to changes in the input signal phase, and it is possible to detect changes in the magnetic properties of objects in the center of two coils with high sensitivity and resolution.

Dataset

Translation Dataset

ASPEC (Japanese-English, scientific paper) [1]

Translation Direction

English → Japanese
Japanese → English

Target Vocabulary list (Updated: 03/12/2021)

Dev set : en / ja
Devtest set : en / ja
Test set : en / ja

Each file contains a vocabulary list with an empty line delimiter. Here is a sample vocabulary list from the English dev set (en). Note that every target vocabulary is shown in random order.

l.1 miniature integrated circuit elements

l.2 high-density information record technology

l.3 next generation semiconductors
l.4 (empty line)

Evaluation

We calculate two distinct metrics in this task.

Usual translation accuracy according to the WAT convention (including BLEU).
A consistency score: the ratio of #sentences satisfying exact match of given constraints over the whole test corpus.

For the "exact match" evaluation, we will conduct the following process:

English: simply lowercase hypotheses and constraints, then judge character level sequence matching (including whitespaces) for each constraint.
Japanese: judge character level sequence matching (including whitespaces) for each constraint without preprocessing.

For the final ranking, we also calculate the combined score of both: calculating BLEU with only the exact match sentences:

Calculates the exact match.
If the translation does not satisfy the constraint, replace the translation with an empty string (this simulates that "the process failed to respond").
Calculates BLEU with modified translations.

Note: in this scenario, the brevity score in BLEU does not figure out a usual meaning, but the n-gram scores maintain their consistency.

We also plan to run a human evaluation that appraises the top-ranked submitted systems by bilingual human annotators.

Submission Format

Each submission file has to be in a format that is used in the BLEU score calculation script (See multi-bleu.perl). We also expect translation outputs to be re-cased and de-tokenized in both English and Japanese.
Submission due on April 26, 2021 May 3, 2021 (11:59pm UTC-12:00).

Results

(Updated: 06/09/2021)

We took the top-ranked systems from each participant for human evaluation. The systems are appraised by bilingual speakers, based on 1) source-based direct assessment (source-based DA) [2, 4] and 2) source-based contrastive assessment (source-based CA) [3,4]. We report the final results as follows.

English → Japanese

final score

NTT 57.2
NHK 33.9
NICTRB 28.8

human eval (source-based DA)

NTT 77.5
NHK 74.1
NICTRB 73.6
(reference 73.4)

human eval (source-based CA)

NTT 79.7
NHK 77.2
NICTRB 77.1
(reference 76.4)

Japanese → English

final score

NTT 44.1
NHK 37.5
NICTRB 31.8
TMU 22.6

human eval (source-based DA)

NTT 75.6
(reference 74.1)
NHK 73.9
NICTRB 72.1
TMU 50.2

human eval (source-based CA)

NTT 74.4
NHK 73.5
(reference 72.9)
NICTRB 71.8
TMU 48.3

Reference

[1] Nakazawa et al., "ASPEC: Asian Scientific Paper Excerpt Corpus", in Proc. of LREC, 2016.

[2] Cettolo et al., "Overview of the IWSLT 2017 evaluation campaign", in Proc. of IWSLT, 2017.

[3] Sakaguchi and Durme, "Efficient Online Scalar Annotation with Bounded Support", in Proc. of ACL, 2018.

[4] Federmann. "Appraise Evaluation Framework for Machine Translation", in Proc. of COLING, 2018. (GitHub)

Contact

For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".

Organizer

Akiko Eriguchi, Microsoft, USA
Kaori Abe, Tohoku University, Japan
Yusuke Oda, LegalForce, Japan

Acknowledgements

We are grateful to Microsoft for their giving support of human evaluation.

Page updated

Google Sites

Report abuse