Restricted Translation Task (WAT2022)

Updates

(03/06/2023) Published source code in Restricted Translation 2022! [code]
(10/17/2022) We presented the task overview at WAT2022. Thank you for your participation in RT task! :) [Slide]
(06/04/2022) The system submission deadline for Restricted Translation task has been extended (July 11 -> July 15, 2022 AoE).
(06/01/2022) Released Zh<>Ja dictionary lists with annotators' scores

Restricted Translation Task will be held at Workshop on Asian Translation (WAT2022) that is collocated with COLING 2022 in Oct 12-17, 2022.

System submission due on July 15, 2022 (AoE)
System submission due on July 11, 2022

Task Description

Consistency is key to clear and accurate translations. It is practically important to have consistent terminology when translating documents related to science or business/marketing that include technical terms and proper nouns (See the examples below, where the tokens colored in red indicate some technical terms, and the boldface shows inconsistent translations among systems).

In this task, we will investigate translation with restricted target vocabularies. We will provide a list of target words as well as source sentences to represent word restriction. Participants are required to submit translations that contain all the target words in each list.

We will evaluate submissions by using two different metrics: 1) usual translation accuracy such as BLEU, and 2) a consistency score which evaluates how many word restrictions were satisfied in submitted translations.

Example: (from ASPEC Corpus [1])

Input: この回路は，入力信号位相の変化により共振周波数がシフトする帰還回路であり，２基のコイルの中央にある物体の磁気特性の変化を，高い感度と分解能で検出することができる。

Reference: This is a feedback circuit shifting resonance frequency by change of input signal phase, which can detect change of magnetic features of an object present at a center of two coils on high sensitivity and resolution.

Target vocabulary list: {magnetic features, resonance frequency, feedback circuit, resolution}

Online system A: This circuit is a feedback circuit whose resonance frequency shifts due to changes in the input signal phase, and can detect changes in the magnetic characteristics of the object in the center of the two coils with high sensitivity and resolution.

Online system B: This circuit is a feedback circuit in which the resonance frequency shifts due to changes in the input signal phase, and it is possible to detect changes in the magnetic properties of objects in the center of two coils with high sensitivity and resolution.

Dataset

Translation Dataset

ASPEC (scientific papers) [1]

Translation Directions

English → Japanese
Japanese → English
Chinese → Japanese (NEW!)
Japanese → Chinese (NEW!)

Target Vocabulary lists

Dev set : en / ja | For Zh-Ja: zh / ja / scores | For Ja-Zh: zh / ja / scores
Devtest set : en / ja | For Zh-Ja: zh / ja / scores | For Ja-Zh: zh / ja / scores
Test set : en / ja | For Zh-Ja: zh / ja / scores | For Ja-Zh: zh / ja / scores

Each file contains a vocabulary list with an empty line delimiter. Here is a sample vocabulary list from the English dev set (en). Note that every target vocabulary is shown in random order.

l.1 miniature integrated circuit elements

l.2 high-density information record technology

l.3 next generation semiconductors
l.4 (empty line)

We also provide two direct-assessment scores ranging [0, 100] per Zh<>Ja dictionary item, where an item with 100 indicates a translation pair that is the most highly evaluated by bilingual human annotators. We provide the Zh<>Ja dictionary lists whose average scores are >= 50.

Evaluation

We calculate two distinct metrics in this task.

Usual translation accuracy according to the WAT convention (including BLEU).
A consistency score: the ratio of #sentences satisfying exact match of given constraints over the whole test corpus.

For the "exact match" evaluation, we will conduct the following process:

English: simply lowercase hypotheses and constraints, then judge character level sequence matching (including whitespaces) for each constraint.
Japanese: judge character level sequence matching (including whitespaces) for each constraint without preprocessing.

For the final ranking, we also calculate the combined score of both: calculating BLEU with only the exact match sentences:

Calculates the exact match.
If the translation does not satisfy the constraint, replace the translation with an empty string (this simulates that "the process failed to respond").
Calculates BLEU with modified translations.

Note: in this scenario, the brevity score in BLEU does not figure out a usual meaning, but the n-gram scores maintain their consistency.

We also plan to run a human evaluation that appraises the top-ranked submitted systems by bilingual human annotators.

Submission Format

Each submission file has to be in a format that is used in the BLEU score calculation script (See sacrebleu). We also expect translation outputs to be re-cased and de-tokenized in both English and Japanese.
Please read out the WAT'22 official submission page for system submission. After submitting your systems, please fill in the form so that we can keep track of them.

Important Dates

System submission due on July 11, 2022 July 15, 2022 (Anywhere on earth)
System description paper submission due on August 1, 2022 (11:59pm UTC-12)
Review feedback of system description papers: August 29, 2022
Camera-ready deadline for system description papers: September 5, 2022 (11:59pm UTC-12)

Results

(Updated: 09/20/2022)

We took three systems from submitted TMU systems for human evaluation. The systems are appraised by bilingual speakers, based on 1) source-based direct assessment (source-based DA) [2, 4] and 2) WAT human evaluation scheme* (evaluated by 5 different workers and the final decision is made by the voting of the judgements). We report the final results as follows.

*We changed our human evaluation scheme due to an accident with COVID-19.

Description of the three systems:

baseline (LeCA): ASPEC first 2M, Transformer + data aug + ptrnet
proposed (LeCA+LevT): ASPEC first 2M, Transformer + data aug + ptrnet + LevT (mix)
proposed_ensemble (LeCA+LecT ensemble): ASPEC first 2M, Transformer + data aug + ptrnet + 5model ensemble + LevT (mix)

English → Japanese

final score

TMU (proposed_ensemble) 52.7
TMU (proposed) 50.5
TMU (baseline) 37.6

human eval (WAT)

TMU (baseline) 4.24
TMU (proposed) 4.19
TMU (proposed_ensemble) 4.18

human eval (DA)

TMU (proposed) 76.6*
TMU (proposed_ensemble) 76.4*
TMU (baseline) 74.9
--------

(Human Reference) 76.6
* indicates that the systems cannot be statistically distinguished from HREF.

Japanese → English

final score

TMU (proposed_ensemble) 42.1
TMU (proposed) 39.3
TMU (baseline) 23.8

human eval (WAT)

TMU (proposed_ensemble) 4.31
TMU (baseline) 4.22
TMU (proposed) 4.14

human eval (DA)

TMU (proposed_ensemble) 74.1*
TMU (proposed) 72.0
TMU (baseline) 73.3

--------

(Human Reference) 74.7
* indicates that the systems cannot be statistically distinguished from HREF.

Reference

[1] Nakazawa et al., "ASPEC: Asian Scientific Paper Excerpt Corpus", in Proc. of LREC, 2016.

[2] Cettolo et al., "Overview of the IWSLT 2017 evaluation campaign", in Proc. of IWSLT, 2017.

[3] Sakaguchi and Durme, "Efficient Online Scalar Annotation with Bounded Support", in Proc. of ACL, 2018.

[4] Federmann. "Appraise Evaluation Framework for Machine Translation", in Proc. of COLING, 2018. (GitHub)

Task Organizers

Akiko Eriguchi, Microsoft, USA
Kaori Abe, Tohoku University, Japan
Yusuke Oda, Inspired Cognition, Japan

Contact

For general questions, comments and etc., please email to "wat-organizer -at- googlegroups -dot- com".

Acknowledgements

We are grateful to Microsoft for their giving support of human evaluation & annotation.

Page updated

Google Sites

Report abuse