HAT19 - APE at scale and its Implications on MT Evaluation Biases

Markus Freitag: APE at scale and its Implications on MT Evaluation Biases

Abstract:

In this work, we train an Automatic Post-Editing (APE) model for Neural Machine Translation (NMT), and use it to reveal biases in standard MT evaluation procedures. The goal of our APE is to correct typical errors introduced by the translation process, and convert the "translationese" output into natural text. Our APE model is trained on monolingual data that has been round-trip translated through English, to mimic errors that are similar to the ones introduced by NMT. We apply our model to the output of existing NMT systems, and demonstrate that, while the human-judged quality improves in all cases, BLEU scores drop with forward-translated test sets. We verify these results for the WMT18 English to German, WMT14 English to French, and WMT16 English to Romanian news translation tasks. Furthermore, we selectively apply our APE model on the output of the top submissions of the most recent WMT evaluation campaigns.

Bio:

Markus Freitag is a Senior Software Engineer at Google Translate in Mountain View. His research focuses on Machine Translation and table-to-text problems. Before joining Google, he worked as a Research Staff Member at IBM in Yorktown Heights, NY. Markus received a PhD in Computer Science in 2015 from the RWTH Aachen University under the supervision of Prof. Dr. Hermann Ney.