Understanding Context Usage in Machine Translation
Understanding Context Usage in Machine Translation
Can machine translation models use context in a human-plausible way when translating multi-sentence texts? Interpret the inner workings of these systems to find out.
Abstract
Establishing whether language models use context information in a reasonable way during generation is fundamental to ensure their safe adoption in real-world settings. Recent work showed that inspecting the internals of machine translation (MT) models can help trace a connection between specific parts of the input context and model predictions. In this project, you will extend previous analyses to identify (un)reasonable cases of context usage in MT models across various languages and see how they relate to translation mistakes identified by human annotators.
Description
Additional context is often necessary to resolve ambiguities in translation. However, MT models are often trained to translate sentence by sentence. For example, when translating English into French, the word "your" can be translated differently depending on different factors, for example, the age of the interlocutor:
The child said to the grandfather: thanks for your help => L'enfant a dit au grand-père: merci pour votre aide [Formal => Often used for elderly people]
The grandfather said to the child: thanks for your help => Le grand-père dit à l'enfant: merci pour ton aide [Informal => Often used for younger people]
But how can we know that, in the case above, the translation model is actually using the context words 'child' and 'grandfather' during translation? The PECoRe method (Sarti et al. 2023) is based on the intuition that we can examine what happens "under the hood" of the translation model to detect when the context plays an influential role for generation. In short, the method is composed of two steps:
Context-sensitive Token Identification (CTI): Aimed at identifying which generated tokens were most influenced by the presence of context. This is done by comparing, for example, output probabilities for the given generation with and without input context (in jargon, a contrastive comparison). In the example above, this corresponds to detecting 'votre' or 'ton' as context-sensitive.
Contextual Cues Imputation (CCI): After finding one or more context-sensitive tokens, this step uses feature attribution on each token to identify context tokens contributing to its prediction. In the example above, this corresponds to detecting the influence of 'child' or 'grandfather' on the prediction of 'ton' or 'votre'.
The paper above explored a narrow set of applications for PECoRe, focusing on English-to-French translation of gender agreement and lexical choice contextual phenomena. In this project, your core task will be to extend this analysis to other translation directions and linguistic phenomena using multilingual MT systems. More specifically, you will translate a small set of sentences augmented with preceding context in a language of your choice and use PECoRe to investigate whether the model's use of context matches your intuition.
Ideas for research directions:
Is context usage similar across different languages? Datasets like DivEMT and FLORES-200 contain the same sentences translated into several languages. This allows you to evaluate whether the contextual phenomena you identify are language-specific or if you can find the same patterns in other translation directions.
Can (failed) context usage predict translation errors? The DivEMT dataset contains machine translations produced by mBART 1-to-50 (task_type = pe2) and post-edited by human translators. The tags in column mt_wmt22_qe correspond to tokens in mt_tokens and mark whether the token was kept (OK) or changed (BAD) by the translator during post-editing. Select a moderate set of sentences in a language you are comfortable with, and use PECoRe whether these errors are due to an unreasonable usage of context information. Are context-related errors the majority in the DivEMT dataset?
[Challenge 🏆] Automatic extraction of context-dependent sentences: A downside of the PECoRe approach is that it requires to pre-select sentences where context plays a relevant role to evaluate whether the model behavior is appropriate. The Multilingual Discourse-Aware Benchmark (MuDA) is a set of language-specific taggers that can mitigate this issue. Specifically, MuDA can be used to detect sentences matching specific context-dependent translation phenomena (e.g. gender, formality, lexical cohesion) from large collections of documents. Select one or more language pairs from the IWSLT 2017 corpus, run MuDA on those to identify contextual phenomena, and extend your PECoRe experiments to verify whether your initial observations hold on this different dataset.
Materials
The DivEMT dataset (Sarti et al. 2022) containing translations from English into six diverse languages (Dutch, Italian, Vietnamese, Turkish, Ukrainian, Arabic) is available on the Hugging Face Hub. Refer to the dataset card for all information related to available features and an example from the dataset. If you don't know any of these languages, you can also use the FLORES-200 corpus (use the URL to identify sentences belonging to the same document to use as context), which spans 200 languages. The DivEMT Explorer can facilitate the exploration and qualitative analysis of the examples in the DivEMT corpus.
The PECoRe method for attributing context usage can be used through the Inseq interpretability library (documentation, demo and example usage are available). The method is compatible with any generative language model from the Hugging Face Transformers library, but we suggest you use the NLLB 600M model (many-to-many for 200 languages) or the mBART 1-to-50 (English to 50 languages) since they have a manageable size for regular computational resources.
The MuDA repository contains taggers for several languages to automatically extract context-dependent phenomena from sentences translated at a document level. The IWSLT 2017 corpus contains translations in various languages with document IDs to mark which segments were part of the same document (needed by MuDA).
References
Sarti, Gabriele et al. (2023) Quantifying the Plausibility of Context Reliance in Neural Machine Translation. ICLR 2024
Yin, Kayo and Neubig, Graham (2022) Interpreting Language Models with Contrastive Explanations. EMNLP 2022
Fernandes, Patrick et al. (2023) When Does Translation Require Context? A Data-driven, Multilingual Exploration. ACL 2023
Fernandes, Patrick et al. (2021) Measuring and Increasing Context Usage in Context-Aware Machine Translation. ACL 2021
Sarti, Gabriele et al. (2023) Inseq: An Interpretability Toolkit for Sequence Generation Models. ACL 2023.
Sarti, Gabriele et al. (2022) DivEMT: Neural Machine Translation Post-Editing Effort Across Typologically Diverse Languages. EMNLP 2022.
NLLB Team (2022) No Language Left Behind: Scaling Human-Centered Machine Translation. Arxiv Preprint.
Mohammed, Wafaa and Niculae, Vlad (2024) On Measuring Context Utilization in Document-Level MT Systems. Arxiv Preprint