Abstract
With the growing interest in NLP, the number of corpora and datasets available grew along, but mostly for English and other high-resource languages. Machine-translating annotated data into lesser-resourced languages can compensate, but errors in translation can often affect the final quality of the model. In this project, you will explore how automatic estimations of translation quality can be used to potentially improve the performance of models trained on translated data.
Description
Despite the fast-growing interest in NLP, most annotated resources are available only in English and other high-resource languages, a fact that perpetuates the marginalization of some communities and accentuates inequalities in access to language technologies.
While in recent years there has been a growing interest in multilingual NLP and cross-lingual transfer of knowledge in NLP models, these approaches are still in their infancy. A common technique to create new resources for other languages is to simply translate the ones that are available in English, for example. Since translation costs are very high at this scale, usually machine translation is used to perform such tasks. While this has proven effective, it comes at the cost of incorporating the MT system biases in the data, and most importantly makes the effectiveness of the process highly reliant on the translation quality, as shown by (Bonifacio et al, 2021). While usually the quality of machine translations is evaluated by comparing a translated sentence to its gold reference, the practice of quality estimation (QE) has gained traction as an automatic and referenceless way of assessing the quality of MT outputs, recently showing results comparable to reference-based metrics.
In this project, you will explore the use of machine-translated data for building a model of natural language inference (NLI), the task of determining whether a "hypothesis" is true (entailment), false (contradiction), or undetermined (neutral) given a "premise”.
You’re provided with a large NLI corpus in which all text fields have been automatically translated from English to Dutch by means of a small pretrained Transformer model. Every translated sentence is associated with two quality estimation scores produced by two multilingual QE systems from the Unbabel Comet framework. The first is trained to predict normalized Direct Assessment (DA) scores (see Section “3.1 WMT” From Freitag et al, 2021) from past editions of the Workshop on Machine Translation. The second was pre-trained on DA and then further fine-tuned to predict Multidimensional Quality Metrics (MQM) scores (see Section “3.3 MQM” from Freitag et al, 2021) on the data by Freitag et al, 2021. These scores are approximations of translation quality for each sentence pair.
Your main goal is to assess how the estimated quality of translated instances affects the performance of models trained on them, notably considering a subset of high-quality translated training instances. Is there a tradeoff between the amount of training data available and the performances of the model? You are required to test the performance of your models on the Dutch SICK-NL corpus and its English counterpart SICK since these have been manually created by human annotators.
Ideas for research directions:
Present the distribution of quality scores for each field and perform an in-depth analysis of how the performance of some models of your choice is affected when filtering examples on QE scores at different thresholds. In your analysis, mention how effective are the two QE scores when using them for filtering, discuss why that could be and, if it makes sense in this context, come up with a smart way to combine them.
Since the dataset has been automatically translated from an existing English one, having an English baseline of the same/similar model would be very helpful. You might also be interested in evaluating how the performance of a Dutch model on the translated data compares to multilingual ones such as mBERT, mT5, mMiniLM, and mDeBERTa. Are multilingual models more robust to bad translations when using the full data for training?
Do sentences with low QE scores exhibit some common properties or linguistic phenomena? How does QE-based data selection affect the performance of NLI trained models on such sentences? Conduct a thorough error analysis and present your findings.
[Challenge 🏆] We can imagine a trade-off between data quantity and data quality in driving training performance. After having tested filtering out low-quality instances, modify the classification model to use a weighted loss during training, using the QE scores as weights to assign low weight to examples with poor translation quality. How does this affect the final performances with respect to the original setting?
Materials
A HuggingFace dataset associated with the original & translated data has been created and is available on the Dataset Hub.
Refer to the dataset card on the Dataset Hub for all information related to available features and an example from the dataset.
You can find the SICK-NL dataset here. The original SICK corpus can be found here.
References
Ruder, Sebastian. Why You Should Do NLP Beyond English. Blog post (2020).
Vries, Wietse de et al. “BERTje: A Dutch BERT Model.” ArXiv abs/1912.09582 (2019).
Rei, Ricardo et al. “COMET: A Neural Framework for MT Evaluation.” EMNLP (2020).
Rei, Ricardo et al. “Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task.” WMT (2021).
Christopher Potts. “Natural Language Inference’, Stanford University (2019).
Freitag, Markus et al. “Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation.” Transactions of the Association for Computational Linguistics 9 (2021): 1460-1474.
Wijnholds, Gijs Jasper and Michael Moortgat. “SICK-NL: A Dataset for Dutch Natural Language Inference.” EACL (2021).
Kocyigit et al. "Better Quality Estimation for Low Resource Corpus Mining". ACL (2022)