Consistency is key to clear and accurate translations. It is practically important to have consistent terminology when translating documents related to science or business/marketing that include technical terms and proper nouns (See the examples below, where the tokens colored in red indicate some technical terms, and the boldface shows inconsistent translations among systems).
In this task, we will investigate translation with restricted target vocabularies. We will provide a list of target words as well as source sentences to represent word restriction. Participants are required to submit translations that contain all the target words in each list.
We will evaluate submissions by using two different metrics: 1) usual translation accuracy such as BLEU, and 2) a consistency score which evaluates how many word restrictions were satisfied in submitted translations.
Input: この回路は ,入力信号位相の変化により共振周波数がシフトする帰還回路であり,2基のコイルの中央にある物体の磁気特性の変化を,高い感度と分解能で検出することができる。
Reference: This is a feedback circuit shifting resonance frequency by change of input signal phase, which can detect change of magnetic features of an object present at a center of two coils on high sensitivity and resolution.
Target vocabulary list: {magnetic features, resonance frequency, feedback circuit, resolution}
Online system A: This circuit is a feedback circuit whose resonance frequency shifts due to changes in the input signal phase, and can detect changes in the magnetic characteristics of the object in the center of the two coils with high sensitivity and resolution.
Online system B: This circuit is a feedback circuit in which the resonance frequency shifts due to changes in the input signal phase, and it is possible to detect changes in the magnetic properties of objects in the center of two coils with high sensitivity and resolution.
ASPEC (Japanese-English, scientific paper) [1]
English → Japanese
Japanese → English
Each file contains a vocabulary list with an empty line delimiter. Here is a sample vocabulary list from the English dev set (en). Note that every target vocabulary is shown in random order.
l.1 miniature integrated circuit elements
l.2 high-density information record technology
l.3 next generation semiconductors
l.4 (empty line)
We calculate two distinct metrics in this task.
Usual translation accuracy according to the WAT convention (including BLEU).
A consistency score: the ratio of #sentences satisfying exact match of given constraints over the whole test corpus.
For the "exact match" evaluation, we will conduct the following process:
English: simply lowercase hypotheses and constraints, then judge character level sequence matching (including whitespaces) for each constraint.
Japanese: judge character level sequence matching (including whitespaces) for each constraint without preprocessing.
For the final ranking, we also calculate the combined score of both: calculating BLEU with only the exact match sentences:
Calculates the exact match.
If the translation does not satisfy the constraint, replace the translation with an empty string (this simulates that "the process failed to respond").
Calculates BLEU with modified translations.
Note: in this scenario, the brevity score in BLEU does not figure out a usual meaning, but the n-gram scores maintain their consistency.
We also plan to run a human evaluation that appraises the top-ranked submitted systems by bilingual human annotators.
Each submission file has to be in a format that is used in the BLEU score calculation script (See multi-bleu.perl). We also expect translation outputs to be re-cased and de-tokenized in both English and Japanese.
Submission due on April 26, 2021 May 3, 2021 (11:59pm UTC-12:00).
(Updated: 06/09/2021)
NTT 57.2
NHK 33.9
NICTRB 28.8
NTT 77.5
NHK 74.1
NICTRB 73.6
(reference 73.4)
NTT 79.7
NHK 77.2
NICTRB 77.1
(reference 76.4)
NTT 44.1
NHK 37.5
NICTRB 31.8
TMU 22.6
NTT 75.6
(reference 74.1)
NHK 73.9
NICTRB 72.1
TMU 50.2
NTT 74.4
NHK 73.5
(reference 72.9)
NICTRB 71.8
TMU 48.3
[1] Nakazawa et al., "ASPEC: Asian Scientific Paper Excerpt Corpus", in Proc. of LREC, 2016.
[2] Cettolo et al., "Overview of the IWSLT 2017 evaluation campaign", in Proc. of IWSLT, 2017.
[3] Sakaguchi and Durme, "Efficient Online Scalar Annotation with Bounded Support", in Proc. of ACL, 2018.
[4] Federmann. "Appraise Evaluation Framework for Machine Translation", in Proc. of COLING, 2018. (GitHub)
For general questions, comments, etc. please email to "wat-organizer -at- googlegroups -dot- com".
Akiko Eriguchi, Microsoft, USA
Kaori Abe, Tohoku University, Japan
Yusuke Oda, LegalForce, Japan
We are grateful to Microsoft for their giving support of human evaluation.