Are Automatically Translated Data Good Enough for Multilingual NLP?

Are Automatically Translated Data Good Enough for Multilingual NLP?

How important is the quality of translated data to train models in other languages?


Abstract


With the growing interest in NLP, the number of available corpora and datasets also increased greatly, but new resources focus mainly on English and other high-resource languages. While machine-translating annotated data into lesser-resourced languages can compensate, translation errors can often affect the final quality of the model. In this project, you will explore how automatic estimation of translation quality can improve the performance of models trained on translated data.


Description


Despite the fast-growing interest in NLP, most annotated resources are available only in English and other high-resource languages, perpetuating the marginalization of some communities and accentuating inequalities in access to language technologies worldwide.

While in recent years, there has been a growing interest in multilingual NLP and cross-lingual transfer of knowledge in NLP models, these approaches are still in their infancy. A common technique to create new resources for other languages is translating the ones available in English. Since translation costs are very high at this scale, machine translation is usually used to perform such tasks. While this has proven effective, it comes at the cost of incorporating the MT system biases in the data, and, most importantly, makes the effectiveness of the process highly reliant on the translation quality, as shown by (Bonifacio et al, 2021). While usually the quality of machine translations is evaluated by comparing a translated sentence to its gold reference, the practice of quality estimation (QE) has gained traction as an automatic and referenceless way of assessing the quality of MT outputs, recently showing results comparable to reference-based metrics.


In this project, you will explore the use of machine-translated data for building a model of natural language inference (NLI), the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise”. You’re provided with a large NLI corpus in which all text fields have been automatically translated from English to Dutch by means of a small pretrained Transformer model. Every translated sentence is associated with two quality estimation scores produced by two multilingual QE systems from the Unbabel Comet framework. The first is trained to predict normalized Direct Assessment (DA) scores (see Section “3.1 WMT” From Freitag et al, 2021) from past editions of the Workshop on Machine Translation. The second was pre-trained on DA and then further fine-tuned to predict Multidimensional Quality Metrics (MQM) scores (see Section “3.3 MQM” from Freitag et al, 2021) on the data by Freitag et al, 2021. These scores are approximations of translation quality for each sentence pair. 


Your main goal is to assess how the estimated quality of translated instances affects the performance of models trained on them, notably considering a subset of high-quality translated training instances. Is there a tradeoff between the amount of training data available and the model's performance? You must test the performance of your models on the Dutch SICK-NL corpus and its English counterpart, SICK, since these have been manually created by human annotators. Do not base your evaluation on the test set provided with the Dataset GroNLP/ik-nlp-22_transqe on the Hub, since it was automatically translated alongside the rest of the data. 


Ideas for research directions

Materials

References

Joshi, Pratik et al. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. ACL (2020)

Ruder, Sebastian. Why You Should Do NLP Beyond English. Blog post (2020).

Vries, Wietse de et al. “BERTje: A Dutch BERT Model.” ArXiv abs/1912.09582 (2019).


Rei, Ricardo et al. “COMET: A Neural Framework for MT Evaluation.” EMNLP (2020).


Rei, Ricardo et al. “Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task.” WMT (2021).


Christopher Potts. “Natural Language Inference’, Stanford University (2019).


Freitag, Markus et al. “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation.Transactions of the Association for Computational Linguistics 9 (2021): 1460-1474.


Wijnholds, Gijs Jasper and Michael Moortgat. “SICK-NL: A Dataset for Dutch Natural Language Inference.” EACL (2021).


Kocyigit et al. "Better Quality Estimation for Low Resource Corpus Mining". ACL (2022)