The Universal
Word Inflector


The Universal Word Inflector

Can you build an efficient model to inflect words in any language?


Abstract


The vast majority of world languages have a richer morphology than English, meaning that they include a richer system of word formation mechanisms starting from small meaning units (or morphemes). Within morphology, word inflection is the task of mapping a lemma (or canonical form) and a target morphological tag to the corresponding surface form. For instance, in English:


lemma=girl tag=N;NOM;PL => girls

lemma=child tag=N;NOM;PL => children


or in Turkish:


lemma=guakamole tag=N;NOM;PL => guakamoleler


In this project, you'll build a Transformer-based multilingual model of the word inflection task and explore several approaches to improve accuracy across languages. From the machine learning perspective, this project is a nice opportunity to explore efficient approaches (e.g. adapters, soft prompts) to handle multilingual adaptation challenges. On the linguistic side, it offers the opportunity to delve into interesting and less-studied morphological phenomena.


Description


The computational modeling of morphology is a lively research field with direct applications to various NLP tasks. The application of modern neural networks to morphology has led to substantial accuracy gains in the task of word inflection, however some languages still suffer from low or very low accuracies even with state-of-the-art models. 


The SIGMORPHON 2023 shared task (part 0) focused on morphological inflection for a large and typologically diverse set of languages. Specifically, data was released for 26 languages from 9 language families, covering a broad variety of language phenomena. Some language families are represented by a single language (e.g. Japanese, Turkish), while others include multiple related languages (e.g. Italian/French/Spanish or Arabic/Hebrew).


Your main task is to fine-tune HuggingFace's implementation of ByT5, a variant of the popular T5 Transformer encoder-decoder model operating at the level of characters (or more precisely, bytes), and to compare your results to those of the official shared task (Goldman et al., 2023). The ByT5 model (Xue et al., 2022) was pre-trained on a multilingual mixture of unsupervised and supervised tasks, allowing for an interesting comparison against the models trained from scratch in the context of Sigmorphon shared tasks. 


You will then explore one or more of the following directions:


Ideas for research directions:

or

Choice of languages:


You can choose to work with all languages, or a subset (at least 2) depending on your interests and research questions. In any case, your choice should be well-motivated and discussed with the instructors before starting the experiments.


Materials

References


[Goldman et al., 2023] SIGMORPHON–UniMorph 2023 Shared Task 0: Typologically Diverse Morphological Inflection

[Xue et al., 2022] ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

[Hu et al. 2021] LoRA: Low-Rank Adaptation of Large Language Models

[Mielke et al. 2021] What Kind of Language Is Hard to Language-Model?

[Sarti et al. 2023] Inseq: An Interpretability Toolkit for Sequence Generation Models