Are Automatically Translated Data Good Enough for Multilingual NLP?
Are Automatically Translated Data Good Enough for Multilingual NLP?
How important is the quality of translated data to train models in other languages?
Abstract
With the growing interest in NLP, the number of available corpora and datasets also increased greatly, but new resources focus mainly on English and other high-resource languages. While machine-translating annotated data into lesser-resourced languages can compensate, translation errors can often affect the final quality of the model. In this project, you will explore how automatic estimation of translation quality can improve the performance of models trained on translated data.
Description
Despite the fast-growing interest in NLP, most annotated resources are available only in English and other high-resource languages, perpetuating the marginalization of some communities and accentuating inequalities in access to language technologies worldwide.
While in recent years, there has been a growing interest in multilingual NLP and cross-lingual transfer of knowledge in NLP models, these approaches are still in their infancy. A common technique to create new resources for other languages is translating the ones available in English. Since translation costs are very high at this scale, machine translation is usually used to perform such tasks. While this has proven effective, it comes at the cost of incorporating the MT system biases in the data, and, most importantly, makes the effectiveness of the process highly reliant on the translation quality, as shown by (Bonifacio et al, 2021). While usually the quality of machine translations is evaluated by comparing a translated sentence to its gold reference, the practice of quality estimation (QE) has gained traction as an automatic and referenceless way of assessing the quality of MT outputs, recently showing results comparable to reference-based metrics.
In this project, you will explore the use of machine-translated data for building a model of natural language inference (NLI), the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise”. You’re provided with a large NLI corpus in which all text fields have been automatically translated from English to Dutch by means of a small pretrained Transformer model. Every translated sentence is associated with two quality estimation scores produced by two multilingual QE systems from the Unbabel Comet framework. The first is trained to predict normalized Direct Assessment (DA) scores (see Section “3.1 WMT” From Freitag et al, 2021) from past editions of the Workshop on Machine Translation. The second was pre-trained on DA and then further fine-tuned to predict Multidimensional Quality Metrics (MQM) scores (see Section “3.3 MQM” from Freitag et al, 2021) on the data by Freitag et al, 2021. These scores are approximations of translation quality for each sentence pair.
Your main goal is to assess how the estimated quality of translated instances affects the performance of models trained on them, notably considering a subset of high-quality translated training instances. Is there a tradeoff between the amount of training data available and the model's performance? You must test the performance of your models on the Dutch SICK-NL corpus and its English counterpart, SICK, since these have been manually created by human annotators. Do not base your evaluation on the test set provided with the Dataset GroNLP/ik-nlp-22_transqe on the Hub, since it was automatically translated alongside the rest of the data.
Ideas for research directions
Present the distribution of quality scores for each field and perform an in-depth analysis of how the performance of some models of your choice is affected when filtering examples on QE scores at different thresholds. In your analysis, mention how effective the two QE scores are when using them for filtering, discuss why that could be, and, if it makes sense in this context, devise a smart way to combine them.
Since the dataset has been automatically translated from an existing English one, having an English baseline of the same/similar model would be very helpful. You might also be interested in evaluating how the performance of a Dutch model on the translated data compares to multilingual ones such as mBERT, mT5, mMiniLM, and mDeBERTa. Are multilingual models more robust to bad translations when using the full data for training?
Do sentences with low QE scores exhibit some common properties or linguistic phenomena? How does QE-based data selection affect the performance of NLI-trained models on such sentences? Conduct a thorough error analysis and present your findings.
[Challenge 🏆] We can imagine a trade-off between data quantity and data quality in driving training (fine-tuning) performance. After having tested filtering out low-quality instances, modify the classification model to use a weighted loss during fine-tuning, using the QE scores as weights to assign low weight to examples with poor translation quality. How does this affect the final performances with respect to the original setting?
Materials
A HuggingFace dataset associated with the original & translated data has been created and is available on the Dataset Hub.
Refer to the dataset card on the Dataset Hub for all information related to available features and an example from the dataset.
You can find the SICK-NL dataset here. The original SICK corpus can be found here.
References
Joshi, Pratik et al. The State and Fate of Linguistic Diversity and Inclusion in the NLP World. ACL (2020)
Ruder, Sebastian. Why You Should Do NLP Beyond English. Blog post (2020).
Vries, Wietse de et al. “BERTje: A Dutch BERT Model.” ArXiv abs/1912.09582 (2019).
Rei, Ricardo et al. “COMET: A Neural Framework for MT Evaluation.” EMNLP (2020).
Rei, Ricardo et al. “Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task.” WMT (2021).
Christopher Potts. “Natural Language Inference’, Stanford University (2019).
Freitag, Markus et al. “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation.” Transactions of the Association for Computational Linguistics 9 (2021): 1460-1474.
Wijnholds, Gijs Jasper and Michael Moortgat. “SICK-NL: A Dataset for Dutch Natural Language Inference.” EACL (2021).
Kocyigit et al. "Better Quality Estimation for Low Resource Corpus Mining". ACL (2022)