The Universal
Word Inflector
The Universal Word Inflector
Can you build an efficient model to inflect words in any language?
Abstract
The vast majority of world languages have a richer morphology than English, meaning that they include a richer system of word formation mechanisms starting from small meaning units (or morphemes). Within morphology, word inflection is the task of mapping a lemma (or canonical form) and a target morphological tag to the corresponding surface form. For instance, in English:
lemma=girl tag=N;NOM;PL => girls
lemma=child tag=N;NOM;PL => children
or in Turkish:
lemma=guakamole tag=N;NOM;PL => guakamoleler
In this project, you'll build a Transformer-based multilingual model of the word inflection task and explore several approaches to improve accuracy across languages. From the machine learning perspective, this project is a nice opportunity to explore efficient approaches (e.g. adapters, soft prompts) to handle multilingual adaptation challenges. On the linguistic side, it offers the opportunity to delve into interesting and less-studied morphological phenomena.
Description
The computational modeling of morphology is a lively research field with direct applications to various NLP tasks. The application of modern neural networks to morphology has led to substantial accuracy gains in the task of word inflection, however some languages still suffer from low or very low accuracies even with state-of-the-art models.
The SIGMORPHON 2023 shared task (part 0) focused on morphological inflection for a large and typologically diverse set of languages. Specifically, data was released for 26 languages from 9 language families, covering a broad variety of language phenomena. Some language families are represented by a single language (e.g. Japanese, Turkish), while others include multiple related languages (e.g. Italian/French/Spanish or Arabic/Hebrew).
Your main task is to fine-tune HuggingFace's implementation of ByT5, a variant of the popular T5 Transformer encoder-decoder model operating at the level of characters (or more precisely, bytes), and to compare your results to those of the official shared task (Goldman et al., 2023). The ByT5 model (Xue et al., 2022) was pre-trained on a multilingual mixture of unsupervised and supervised tasks, allowing for an interesting comparison against the models trained from scratch in the context of Sigmorphon shared tasks.
You will then explore one or more of the following directions:
Ideas for research directions:
Language relatedness: Focus on a subset of languages for which at least one related language exists in the shared task (see Table 1 in Goldman et al., 2023). Explore whether fine-tuning on the related languages helps compared to (i) training only on the target language or (ii) training on other non-related languages. Is training only on the target language always the best solution? If so, simulate low-resourcedness in the target language by downsampling its training data, and study at what point cross-lingual training becomes useful.
Parameter-efficient fine-tuning (PEFT): As models get larger and larger, fine-tuning can be very computationally expensive. Moreover, storing and deploying models that are fine-tuned independently on many tasks requires lots of memory (each fine-tuned model has the same size of the original one). To address this, PEFT techniques, such as LoRA, fine-tune only a small number of (extra) model parameters. Can you use LORA to reduce the computational costs of fine-tuning while maintaining inflection accuracy?
[Challenge 🏆] Positive transfer vs negative interference: Training NLP models on multiple languages can be a blessing (when useful knowledge transfer across languages) but also a curse (when language-specific knowledge of other languages interferes with the target language task). Look for the best trade-off by combining a task-level LoRA adapter (i.e. trained on all languages) with a language-level LoRA adapter (i.e. trained on a single language) using multi-adapter inference in HuggingFace PEFT library.
[Challenge 🏆] Difficult languages: Some languages at the shared task (like Navajo, Ancient Greek, Sanskrit, Belarusian, Sami, and French) turned out to be particularly difficult to inflect for reasons that remain to be explained. Can you predict and explain inflection difficulty? This can be done at two levels:
Across languages: Previous work has tried to explain why some languages are harder to model than others using neural language models (e.g. Mielke et al. 2019). What about inflection difficulty? Does inflection accuracy across languages correlate with known typological properties? For instance, we could imagine that having many grammatical genders, many cases, or many declension classes makes the task of inflection more difficult. To find that out, select salient properties from a typological database like WALS and compute the correlation with the language-level inflection accuracies of your model or those of the best models taking part in the shared task. Good to know: WALS includes a set of ten morphological properties (20 to 29). However, not all of these properties are specified for all languages.
or
At the single-language level: Pick a difficult language and collect the inflections generated by your model for the dev set. Can you train a simple linear model to predict when your model makes inflection mistakes? For example, some morphological features (e.g. plural) or combinations of features (e.g. accusative-plural) may be harder to inflect than others in a given language. Also, some parts-of-speech, e.g. verbs, might be harder to inflect than others, e.g. names. Word frequency could also be a predictor. Inspect the weights learnt by the linear model to explain which phenomena of a language are hard to model. Want to go the extra mile? Use the Inseq library to extract feature attribution maps and probabilities from your ByT5 inflection model, and look for systematic patterns that match (or not) your linguistic intuition.
Choice of languages:
You can choose to work with all languages, or a subset (at least 2) depending on your interests and research questions. In any case, your choice should be well-motivated and discussed with the instructors before starting the experiments.
Materials
SIGMORPHON 2023 shared task repository. Focus on the part called "Typologically Diverse Morphological (Re-)Inflection
The World Atlas of Language Structures (WALS)
List of word frequencies in many languages, collected from various corpora:
The Inseq library and its documentation (paper available below)
References
[Goldman et al., 2023] SIGMORPHON–UniMorph 2023 Shared Task 0: Typologically Diverse Morphological Inflection
[Xue et al., 2022] ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models
[Hu et al. 2021] LoRA: Low-Rank Adaptation of Large Language Models
[Mielke et al. 2021] What Kind of Language Is Hard to Language-Model?
[Sarti et al. 2023] Inseq: An Interpretability Toolkit for Sequence Generation Models