AfricaNLP Workshop

Strengthening African NLP. EACL 2021, Virtual Event

April 19, 2021

About the Workshop

Africa has over 2000 languages and yet these are among the least represented in NLP research.

The rise in ML community efforts on the African continent has led to a growing interest in Natural Language Processing, particularly for African languages which are typically low-resourced languages. This interest is manifesting in the form of national, regional, continental and even global collaborative efforts to build corpora, as well as the application of the aggregated corpora to various NLP tasks.

This workshop therefore has several aims:

  • to showcase work being done by the African NLP community and provide a platform to share this expertise with a global audience interested in NLP techniques for low-resource languages

  • to promote multidisciplinarity within the African NLP community with the goal of creating a holistic participatory NLP community that will produce NLP research and technologies that value fairness, ethics, decolonial theory, and data sovereignty

  • to provide a platform for the groups involved with the various projects to meet, interact, share and forge closer collaboration

  • to provide a platform for junior researchers to present papers, solutions, and begin interacting with the wider NLP community

  • to present an opportunity for more experienced researchers to further publicize their work and inspire younger researchers through keynotes and invited talks

This workshop follows the previous successful edition in 2020 co-located with ICLR.

It will be taking place ONLINE in co-location with EACL 2021 on Monday, April 19.

No paper will be desk-rejected :)

Invited Speakers

Interdisciplinarity is key to achieving progress in African NLP. As a result we have invited speakers hailing from all spheres, whether it be philosophy, law, engineering, literature or research.

Carr Center for Human Rights Policy / Bantucracy
Sabelo Mhlambi is the founder of Bantucracy, a public interest organization that focuses on ubuntu ethics and technology, a Technology & Human Rights Fellow at Carr Center for Human Rights Policy, and a Fellow at the Berkman-Klein Center for Internet & Society. Mhlambi's work is at the intersection of human rights, ethics, culture, and technology and emphasizes global south perspectives in AI policy.

CIPIT Stathmore Law
Dr Omino holds an LLB (University Of Fort Hare), LLM (Stellenbosch University) and an LLD (University of Fort Hare) and practices law as a partner in MJD Associates LLP in Nairobi, Kenya. Melissa has a special interest in intellectual property, digital trade, and data governance in Africa. Melissa is also co-founder of the IPCheckIn a monthly meeting of IP enthusiasts that includes patent examiners, attorneys, professors, musicians and law students who offer their services in IP awareness and knowledge dissemination pro bono in Kenya.

The Brick House, NaijaNLP
Kọla Túbọsún is a Nigerian linguist, editor, travel writer, and scholar. His works have been published in African Writer, Aké Review, Brittle Paper, International Literary Quarterly, Jalada, Popula, Saraba Magazine, etc. In 2016, he became the first African to be given the Premio Ostana, a prize given for work in indigenous language advocacy. Tubosun is the brain behind YorubaName.com, a first crowdsourced multimedia dictionary of Yorùbá names.

Google Research
Sara Hooker is a research scholar at Google Brain. Her research interests include interpretability, model compression and security in deep neural networks. In 2014, Sara founded Delta Analytics, a non-profit dedicated to building technical capacity to help communities across the world use machine learning. In 2014, she founded Delta Analytics, a non-profit dedicated to bringing technical capacity to help non-profits across the world use machine learning for good. She grew up in Mozambique, Lesotho, Swaziland, South Africa, and Kenya and currently resides in California.

Masakhane, Saarland University
David Adelani is a doctoral student in computer science at Saarland University, Saarbrücken, Germany. His current research focuses on the security and privacy of users’ information in dialogue systems and online social interactions. Originally from Nigeria, he is also actively involved in the development of natural language processing datasets and tools for low-resource languages, with a special focus on African languages.

Praekelt Consulting
Arshath Ramkilowan is a Natural Language Understanding Scientist at Praekelt Consulting, where he trains and deploys African NLP systems for use in a variety of African-centric contexts. He has an MSc in Physics from the University of Kwa-zulu Natal, in South Africa. He also contributes to research in South African language as part of Masakhane

Schedule

Accepted Papers

Poster Session #1: 12:10 - 13:10 CET

Low-Resource Neural Machine Translation for Southern African Languages paper videoAuthors: Evander Nyoni and Bruce Bassett

"Low-resource African languages have not fully benefited from the progress in neural machine translation because of a lack of data. Motivated by this challenge we compare zero-shot learning, transfer learning and multilingual learning on three Bantu languages (Shona, isiXhosa and isiZulu) and English. Our main target is English-to-isiZulu translation for which we have just 30,000 sentence pairs, 28% of the average size of our other corpora. We show the importance of language similarity on the performance of English-to-isiZulu transfer learning based on English-to-isiXhosa and English-to-Shona parent models whose BLEU scores differ by 5.2. We then demonstrate that multilingual learning surpasses both transfer learning and zero-shot learning on our dataset, with BLEU score improvements relative to the baseline English-to-isiZulu model of 9.9, 6.1 and 2.0 respectively. Our best model also improves the previous SOTA BLEU score by more than 10. "

Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The case of Fon Language paper video
Authors: Bonaventure Dossou, Chris Emezue

Building effective neural machine translation (NMT) models for very low-resourced and morphologically rich African indigenous languages is an open challenge. Besides the issue of finding available resources for them, a lot of work is put into preprocessing and tokenization. Recent studies have shown that standard tokenization methods do not always adequately deal with the grammatical, diacritical, and tonal properties of some African languages. That, coupled with the extremely low availability of training samples, hinders the production of reliable NMT models. In this paper, using Fon language as a case study, we revisit standard tokenization methods and introduce Word-Expressions-Based (WEB) tokenization, a human-involved super-words tokenization strategy to create a better representative vocabulary for training. Furthermore, we compare our tokenization strategy to others on the Fon-French and French-Fon translation tasks.


AfriVEC: Word Embedding Models for African Languages. Case Study of Fon and Nobiin paper video
Authors: Bonaventure F. P. Dossou, Mohammed Sabry

From Word2Vec to GloVe, word embedding models have played key roles in the current state-of-the-art results achieved in Natural Language Processing. Designed to give significant and unique vectorized representations of words and entities, those models have proven to efficiently extract similarities and establish relationships reflecting semantic and contextual meaning among words and entities. African Languages, representing more than 31% of the worldwide spoken languages, have recently been subject to lots of research. However, to the best of our knowledge, there are currently very few to none word embedding models for those languages words and entities, and none for the languages under study in this paper. After describing Glove, Word2Vec, and Poincaré embeddings functionalities, we build Word2Vec and Poincaré word embedding models for Fon and Nobiin, which show promising results. We test the applicability of transfer learning between these models as a landmark for African Languages to jointly involve in mitigating the scarcity of their resources, and attempt to provide linguistic and social interpretations of our results. Our main contribution is to arouse more interest in creating word embedding models proper to African Languages, ready for use, and that can significantly improve the performances of Natural Language Processing downstream tasks on them. The official repository and implementation is at https://github.com/bonaventuredossou/afrivec


Design and Implementation of English To Yorùbá Verb Phrase Machine Translation System. paper videoAuthors: Safiriyu Eludiora, Ajibade Benjamin

The Yorùbá group has diverse language speakers across the world, translating the language to other widely spoken languages must be emphasized. We aim to develop an English to Yorùbá machine translation system which can translate English verb phrase text to its Yorùbá equivalent. Words from both languages Source Language and Target Language were collected for the verb phrase group in the home domain. The lexical translation is done by assigning values of the matching word in the dictionary. The syntax of the two languages was realized using Context Free Grammar, we validated the rewrite rules with finite state automata. Human evaluation method was used and expert opinion scored. The evaluation shows the system performed better than that of sampled Google translation with over 70% of the response matching that of the system’s output.


OkwuGbé: End-to-End Speech Recognition for Fon and Igbo paper videoAuthors: Bonaventure F. P. Dossou, Chris C. Emezue

Language is inherent and compulsory for human communication. Whether expressed in a written or spoken way, it ensures understanding between people of the same and different regions. With the growing awareness and effort to include more low-resourced languages in NLP research, African languages have recently been a major subject of research in machine translation, and other text-based areas of NLP. However, there is still very little comparable research in speech recognition for African languages. Interestingly, some of the unique properties of African languages affecting NLP, like their diacritical and tonal complexities, have a major root in their speech, suggesting that careful speech interpretation could provide more intuition on how to deal with the linguistic complexities of African languages for text-based NLP. OkwuGbé is a step towards building speech recognition systems for African low-resourced languages. Using Fon and Igbo as our case study, we conduct a comprehensive linguistic analysis of each language and describe the creation of end-to-end, deep neural network-based speech recognition models for both languages. We present a state-of-art ASR model for Fon, as well as benchmark ASR model results for Igbo. Our linguistic analyses (for Fon and Igbo) provide valuable insights and guidance into the creation of speech recognition models for other African low-resourced languages, as well as guide future NLP research for Fon and Igbo. The Fon and Igbo models source code have been made publicly available.


Impacts of Homophone Normalization for Amharic Natural Language Processing paper videoAuthors: Tadesse Destaw, Seid Muhie , Abinew Ayele, Getie Gelaye, Chris Biemann

Amharic is the second most Semitic language spoken after Arabic and serves as the official working language of the government of Ethiopia. In Amharic/Ge’ez writing, there are different characters with the same sound but different in shape and meaning, which are called homophones. Even though there are rules and regulations for Amharic writing, the online community tends to use homophone characters randomly. This means, the first character in the word ሃብታም (rich) can be replaced with one of the characters from ሀ(hā), ሐ(ḥā), ሓ(ḥa), ኀ(ḫā), ኃ(ḫā), or ኻ(ẖa). To study the usage of homophone characters, we have collected and analyzed around 5m sentences and build Word2Vec and FastText embedding models. Our analysis shows that 1) the usage of homophone characters in Amharic text is mostly random, and 2) the normalization of homophone characters has a negative impact while computing word similarity using the embedding models.

---

" አማርኛ ከአረብኛ ቀጥሎ በሁለተኛነት በስፋት የሚነገር ሴማዊ ቋንቋ ሲሆን የኢትዮጵያ ፌዴራላዊ ዴሞክራሲያዊ ሪፑብሊክ መንግሥት ሕጋዊ (ይፋዊ) የሥራ ቋንቋ ነው። በአማርኛ ወይም በግእዝ ሥርዓተ ጽሕፈት ውስጥ በቅርጽ እና በትርጕም የተለያዩ ሆነው ተመሳሳይ ድምፅ ያላቸው ብዙ ፊደላት አሉ። እነዚህ ፊደላት “ሞክሼ ሆሄያት” በመባል ይታወቃሉ። የአማርኛ ቋንቋ የሥርዓተ ጽሕፈት ሕግ ያለው ቢሆንም የ”ኦንላይን” ተጠቃሚዎች ግን እነዚህን ሞክሼ ሆሄያት እንደፈለጋቸው የመጠቀም ዝንባሌ ያሳያሉ። ይህም ማለት ለምሳሌ፣ ሃብታም በሚለው ቃል ውስጥ ያለው የመጀመሪያው ሆሄ ወይም ፊደል ከሚከተሉት ሆሄያት በአንዱ ሊተካ ይችላል (ሀ፣ ሐ፣ ሓ፣ ኀ፣ ኃ፣ ኻ)። የሞክሼ ሆሄያትን አጠቃቀም ለማጥናት ወደ 5 ሚሊዮን የሚደርሱ ዐረፍተ ነገሮችን ሰብስበን በመተንተን Word2Vec እና FastText የተሰኙትን የ“ኢምቤዲንግ ሞዴሎችን” ገንብተናል። ይህ የጥናት እና የምርምር ሙከራችን የሚከተሉትን ውጤቶች ያሳያል፡-

1. የሞክሼ ሆሄያት አጠቃቀም በአብዛኛው በዘፈቀደ የሚጻፉ መሆኑ፣

2. ሞክሼ ሆሄያትን በአንድ ፊደል ብቻ እንዲወከሉና እንዲለመድ የሚደረገው ጥረት በኢምቤዲንግ ሞዴሎች ላይ የቃላት ቅመራ ሲደረግ አሉታዊ ተጽዕኖ አለው።"

Extended Parallel Corpus for Amharic-English Machine Translation paper videoAuthors: Andargachew Mekonnen Gezmu, Andreas Nürnberger, Tesfaye Bayu Bati

This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be useful for machine translation of a low-resource language, Amharic. The corpus is larger than previously compiled corpora; it is released for research purposes. We trained neural machine translation and phrase-based statistical machine translation models using the corpus. In the automatic evaluation, neural machine translation models outperform phrase-based statistical machine translation models.ition models for other African low-resourced languages, as well as guide future NLP research for Fon and Igbo. The Fon and Igbo models source code have been made publicly available.


Sexism detection: The first corpus in Algerian dialect with code-switching in Arabic/ French and English paper videoAuthors: Imane Guellil, Ahsen Adeel, Faical Azouaou, Mohamed Boubred, Yousra Houichi and Akram Abdelhaq Moumna

In this paper, an approach for hate speech detection against women in Arabic community on social media (e.g. Youtube) is proposed. In the literature, similar works have been presented for other languages such as English. However, to the best of our knowledge, not much work has been conducted in the Arabic language. A new hate speech corpus (Arabic_fr_en) is developed using three different annotators. For corpus validation, three different machine learning algorithms are used, including deep Convolutional Neural Network(CNN), long short-term memory (LSTM) network and Bi-directional LSTM(Bi-LSTM) network. Simulation results demonstrate the best performance of the CNN model, which achieved F1-score up to 86% for the unbalanced corpus as compared to LSTM and Bi-LSTM.


An Amharic News Text classification Dataset paper videoAuthors: Israel Abebe Azime, Nebil Mohammed

In NLP, text classification is one of the primary problems we try to solve and its uses in language analyses are indisputable. The lack of labeled training data made it harder to do these tasks in low resource languages like Amharic. The task of collecting, labeling, annotating, and making valuable this kind of data will encourage junior researchers, schools, and machine learning practitioners to implement existing classification models in their language. In this short paper, we aim to introduce the Amharic text classification dataset that consists of more than 50k news articles that were categorized into 6 classes. This dataset is made available with easy baseline performances to encourage studies and better performance experiments.

Contextual Text Embeddings for Twi paper videoAuthors: Paul Azunre, Salomey Osei, Salomey Addo, Lawrence Asamoah Adu-Gyamfi, Stephen Moore, Bernard Adabankah, Bernard Opoku, Clara Asare-Nyarko, Samuel Nyarko, Cynthia Amoaba, Esther Dansoa Appiah, Felix Akwerh, Richard Nii Lante Lawson, Joel Budu, Emmanuel Debrah, Wisdom Ofori, Edwin Buabeng-Munkoh, Franklin Adjei, Isaac Kojo Essel Ampomah, Joseph Otoo, Reindorf Borkor, Standylove Birago Mensah, Lucien Mensah, Mark Amoako Marcel, Anokye Acheampong Amponsah and James Ben Hayfron-Acquah

Transformer-based language models have been changing the modern Natural Language Processing (NLP) landscape for high-resource languages such as English, Chinese, Russian, etc. However, this technology does not yet exist for any Ghanaian language. In this paper, we introduce the first of such models for Twi or Akan, the most widely spoken Ghanaian language. The specific contribution of this research work is the development of several pretrained transformer language models for the Akuapem and Asante dialects of Twi, paving the way for advances in application areas such as Named Entity Recognition (NER), Neural Machine Translation (NMT), Sentiment Analysis (SA) and Part-of-Speech (POS) tagging. Specifically, we introduce four different flavours of ABENA -- A BERT model Now in Akan that is fine-tuned on a set of Akan corpora, and BAKO - BERT with Akan Knowledge only, which is trained from scratch. We open-source the model through the Hugging Face model hub and demonstrate its use via a simple sentiment classification example.

English-Twi Parallel Corpus for Machine Translation paper videoAuthors: Paul Azunre, Salomey Osei, Salomey Addo, Lawrence Asamoah Adu-Gyamfi, Stephen Moore, Bernard Adabankah, Bernard Opoku, Clara Asare-Nyarko, Samuel Nyarko, Cynthia Amoaba, Esther Dansoa Appiah, Felix Akwerh, Richard Nii Lante Lawson, Joel Budu, Emmanuel Debrah, Wisdom Ofori, Edwin Buabeng-Munkoh, Franklin Adjei, Isaac Kojo Essel Ampomah, Joseph Otoo, Reindorf Borkor, Standylove Birago Mensah, Lucien Mensah, Mark Amoako Marcel, Anokye Acheampong Amponsah and James Ben Hayfron-Acquah

We present a parallel machine translation training corpus for English and Akuapem Twi of 25,421 sentence pairs. We used a transformer-based translator to generate initial translations in Akuapem Twi, which were later verified and corrected where necessary by native speakers to eliminate any occurrence of translationese. In addition, 697 higher quality crowd-sourced sentences are provided for use as an evaluation set for downstream Natural Language Processing (NLP) tasks. The typical use case for the larger human-verified dataset is for further training of machine translation models in Akuapem Twi. The higher quality 697 crowd-sourced dataset is recommended as a testing dataset for machine translation of English to Twi and Twi to English models. Furthermore, the Twi part of the crowd-sourced data may also be used for other tasks, such as representation learning, classification, etc. We fine-tune the transformer translation model on the training corpus and report benchmarks on the crowd-sourced test set.

MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation paper videoAuthors: David I. Adelani, Dana Ruiter, Jesujoba O. Alabi, Damilola Adebonojo, Adesina Ayeni, Mofe Adeyemi, Ayodele Awokoya, Cristina España-Bonet

Massively multilingual machine translation (MT) has shown impressive capabilities, including zero and few-shot translation between low-resource language pairs. However, these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due the lack of standardized evaluation datasets. In this paper, we present MENYO-20k, the first multi-domain parallel corpus for the low-resource Yorùbá--English (yo--en) language pair with standardized train-test splits for benchmarking. We provide several neural MT (NMT) benchmarks on this dataset and compare to the performance of popular pre-trained (massively multilingual) MT models, showing that, in almost all cases, our simple benchmarks outperform the pre-trained MT models. A major gain of BLEU +9.9 and +8.6 (en2yo) is achieved in comparison to Facebook's M2M-100 and Google multilingual NMT respectively when we use MENYO-20k to fine-tune generic models.

MasakhaNER: Named Entity Recognition for African Languages paper videoAuthors: David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, Salomey Osei

We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.

Fast Development of ASR in African Languages using Self Supervised Speech Representation Learning paper videoAuthors: Jama Mohamud, Lloyd Thompson, Aissatou Ndoye, Laurent Besacier

"This paper is an informal collaboration launched during the African Master of Machine Intelligence (AMMI) program in June 2020. After a series of lectures and labs on speech data collection using mobile applications and on self-supervised representation learning from speech, a small group of students and the lecturer continued working on automatic speech recognition (ASR) projects for three languages: Wolof, Ga, and Somali.

This paper describes the data collection process and ASR systems developed with a small amount (1h) of transcribed speech as training data. In these low resource conditions, pretraining a model on a large amount of raw speech is fundamental for developing an efficient ASR system."

Sentiment Classification in Swahili Language Using Multilingual BERT paper videoAuthors: Gati Martin, Medard Mswahili, Young-Seob Jeong

"The evolution of the Internet has increased the amount of information that is expressed by people on different platforms.

This information can be product reviews, discussions on forums, or social media platforms.

Accessibility of these opinions and people’s feelings open the door to opinion mining and sentiment analysis.

As language and speech technologies become more advanced, many languages have been used and the best models have been obtained.

However, due to linguistic diversity and lack of datasets, African languages have been left behind.

In this study, by using the current state of the art model, multilingual BERT, we perform sentiment classification on Swahili datasets created by extracting and annotating 8.2k reviews and comments on different social media platforms.

The data were classified as either positive or negative.

The model was fine-tuned and achieve the best accuracy of 87.59\%."

Graph Convolutional Network for Swahili News Classification paper videoAuthors: Alexandros Kastanos, Tyler Martin

This work empirically demonstrates the ability of Text Graph Convolutional Network (Text GCN) to outperform traditional natural language processing benchmarks for the task of semi-supervised Swahili news classification. In particular, we focus our experimentation on the sparsely-labelled semi-supervised context which is representative of the practical constraints facing low-resourced African languages. We follow up on this result by introducing a variant of the Text GCN model which utilises a bag of words embedding rather than a naive one-hot encoding to reduce the memory footprint of Text GCN whilst demonstrating similar predictive performance.

Ìtàkúròso: DialoGPT for Natural Language Generation of Yorùbá Dialog paper videoAuthors: Tosin Adewumi, Aremu Anuoluwapo, Ahmed Baruwa, Olubukola Peters, Toluope Ogunremi, Foteini Liwicki, Marcus Liwicki

We, in this work, perform an empirical study of natural language generation (NLG) of dialogues for a low-resource language. We do so by fine-tuning DialoGPT-medium on Yorùbá, a foreign language, as the target language. DialoGPT is a state-of-the-art (SotA) model pre-trained in English. Two variants of the Yorùbá language are evaluated (diacritized and undiacritized). The results, in terms of low perplexity and human evaluation, show that good performance can be achieved for the Yorùbá language when fine-tuned on the DialoGPT model, though this was pre-trained in English. As further contribution to the research community, we make available the dialogue dataset, given that data, particularly for low-resource languages, can be hard to come by.

BembaSpeech: A Speech Recognition Corpus for the Bemba Language paper videoAuthors: Claytone Sikasote, Antonios Anastasopoulos

"We present a preprocessed, ready-to-use automatic speech recognition corpus, BembaSpeech, consisting over 24 hours of read speech in the Bemba language, a written but low-resourced language spoken by over 30% of the population in Zambia. To assess its usefulness for training and testing ASR systems for Bemba, we train an end-to-end Bemba

ASR system by fine-tuning a pre-trained DeepSpeech English model on the training portion

of the BembaSpeech corpus. Our best model achieves a word error rate (WER) of 54.78%.

The results show that the corpus can be used for building ASR systems for Bemba."

---

"Ili ipepala lilelanda pamashiwi mu Cibemba na ifyebo fyalembwa ifyabikwa pamo nga mashiwi yakopwa elyo na yalembwa ukupanga iileitwa BembaSpeech. iikwete amashiwi ayengabelengwa ukufika kuma awala amakumi yabili na yane mu lulimi lwa Cibemba, ululandwa na impendwa ya bantu ba mu Zambia ukufika cipendo ca 30%. Pakufwaisha ukumona ubukankala bwakubomfiwa mu mukupanga ifya mibombele ya ASR mu Cibemba, tupanga imibombele ya ASR iya mu Cibemba ukufuma pantendekelo ukufika na pampela, kubomfya elyo na ukuwaminisha icilangililo ca mibomfeshe yamashiwi na ifyebo ifyabikwa pamo mu Cisungu icitwa DeepSpeech na ukupangako iciputulwa ca mashiwi na ifyebo fyalembwa mu Cibemba (BembaSpeech corpus). Imibobembele yesu iyakunuma ilangisha icipimo ca kupusa nelyo ukulufyanya kwa mashiwi ukwa 54.78%. Ifyakufumamo filangisha ukuti ifyalembwa kuti fyabomfiwa ukupanga imibombele ya ASR mu Cibemba.

"

Manually Annotated Spelling Error Corpus for Amharic paper videoAuthors: Andargachew Mekonnen Gezmu, Binyam Ephrem Seyoum, Tirufat Tesifaye Lema, Andreas Nürnberge

This paper presents a manually annotated spelling error corpus for Amharic, lingua franca in Ethiopia. The corpus is designed to be used for the evaluation of spelling error detection and correction. The misspellings are tagged as non-word and real-word errors. In addition, the contextual information available in the corpus makes it useful in dealing with both types of spelling errors.

NaijaNER : Comprehensive Named Entity Recognition for 5 Nigerian Languages. paper videoAuthors: Wuraola Fisayo Oyewusi​, Olubayo Adekanmbi,​ Ifeoma Okoh, ​Vitus Onuigwe​,Mary Idera Salami, ​Opeyemi Osakuade,​ Sharon Ibejih, ​Usman Abdullahi Musa​

Most of the common applications of Named Entity Recognition (NER) is on English and other highly available languages. In this work, we present our findings on Named Entity Recognition for 5 Nigerian Languages (Nigerian English, Nigerian Pidgin English, Igbo, Yoruba and Hausa). These languages are considered low-resourced, and very little openly available Natural Language Processing work has been done in most of them. In this work, individual NER models were trained and metrics recorded for each of the languages. We also worked on a combined model that can handle Named Entity Recognition (NER) for any of the five languages. The combined model works well for Named Entity Recognition(NER) on each of the languages and with better performance compared to individual NER models trained specifically on annotated data for the specific language. The aim of this work is to share our learning on how information extraction using Named Entity Recognition can be optimized for the listed Nigerian Languages for inclusion, ease of deployment in production and reusability of models. Models developed during this project are available on Github(https://git.io/JY0kk) and an interactive web app(https://nigner.herokuapp.com/)

Congolese Swahili Machine Translation for Humanitarian Response paper videoAuthors: Alp Öktem, Eric DeLuca, Rodrigue Bashizi, Eric Paquin, Grace Tang

In this paper we describe our efforts to make a bidirectional Congolese Swahili (SWC) to French (FRA) neural machine translation system with the motivation of improving humanitarian translation workflows. For training, we created a 25,302-sentence general domain parallel corpus and combined it with publicly available data. Experimenting with low-resource methodologies like cross-dialect transfer and semi-supervised learning, we recorded improvements of up to 2.4 and 3.5 BLEU points in the SWC–FRA and FRA–SWC directions, respectively. We performed human evaluations to assess the usability of our models in a COVID-domain chatbot that operates in the Democratic Republic of Congo (DRC). Direct assessment in the SWC–FRA direction demonstrated an average quality ranking of 6.3 out of 10 with 75% of the target strings conveying the main message of the source text. For the FRA–SWC direction, our preliminary tests on post-editing assessment showed its potential usefulness for machine-assisted translation. We make our models, datasets containing up to 1 million sentences, our development pipeline, and a translator web-app available for public use. eds.

AfriKI: Machine-in-the-Loop Afrikaans Poetry Generation paper videoAuthors: Imke van Heerden, Anil Bas

This paper proposes a generative language model called AfriKI. Our approach is based on an LSTM architecture trained on a small corpus of contemporary fiction. With the aim of promoting human creativity, we use the model as an authoring tool to explore machine-in-the-loop Afrikaans poetry generation. To our knowledge, this is the first study to attempt creative text generation in Afrikaans.

Poster Session #2: 14:00 - 15:00 CET

NLP for Ghanaian Languages paper videoAuthors: Paul Azunre, Salomey Osei, Salomey Addo, Lawrence Asamoah Adu-Gyamfi, Stephen Moore, Bernard Adabankah, Bernard Opoku, Clara Asare-Nyarko, Samuel Nyarko, Cynthia Amoaba, Esther Dansoa Appiah, Felix Akwerh, Richard Nii Lante Lawson, Joel Budu, Emmanuel Debrah, Wisdom Ofori, Edwin Buabeng-Munkoh, Franklin Adjei, Isaac Kojo Essel Ampomah, Joseph Otoo, Reindorf Borkor, Standylove Birago Mensah, Lucien Mensah, Mark Amoako Marcel, Anokye Acheampong Amponsah and James Ben Hayfron-Acquah

NLP Ghana is an open-source non-profit organisation aiming to advance the development and adoption of state-of-the-art NLP techniques and digital language tools to Ghanaian languages and problems. In this paper, we first present the motivation and necessity for the efforts of the organisation; by introducing some popular Ghanaian languages while presenting the state of NLP in Ghana. We then present the NLP Ghana organisation and outline its aims, scope of work, some of the methods employed and contributions made thus far in the NLP community in Ghana.

Impacts of Homophone Normalization for Amharic Natural Language Processing paper videoAuthors: Tadesse Destaw, Seid Muhie , Abinew Ayele, Getie Gelaye, Chris Biemann

Amharic is the second most Semitic language spoken after Arabic and serves as the official working language of the government of Ethiopia. In Amharic/Ge’ez writing, there are different characters with the same sound but different in shape and meaning, which are called homophones. Even though there are rules and regulations for Amharic writing, the online community tends to use homophone characters randomly. This means, the first character in the word ሃብታም (rich) can be replaced with one of the characters from ሀ(hā), ሐ(ḥā), ሓ(ḥa), ኀ(ḫā), ኃ(ḫā), or ኻ(ẖa). To study the usage of homophone characters, we have collected and analyzed around 5m sentences and build Word2Vec and FastText embedding models. Our analysis shows that 1) the usage of homophone characters in Amharic text is mostly random, and 2) the normalization of homophone characters has a negative impact while computing word similarity using the embedding models.

---

" አማርኛ ከአረብኛ ቀጥሎ በሁለተኛነት በስፋት የሚነገር ሴማዊ ቋንቋ ሲሆን የኢትዮጵያ ፌዴራላዊ ዴሞክራሲያዊ ሪፑብሊክ መንግሥት ሕጋዊ (ይፋዊ) የሥራ ቋንቋ ነው። በአማርኛ ወይም በግእዝ ሥርዓተ ጽሕፈት ውስጥ በቅርጽ እና በትርጕም የተለያዩ ሆነው ተመሳሳይ ድምፅ ያላቸው ብዙ ፊደላት አሉ። እነዚህ ፊደላት “ሞክሼ ሆሄያት” በመባል ይታወቃሉ። የአማርኛ ቋንቋ የሥርዓተ ጽሕፈት ሕግ ያለው ቢሆንም የ”ኦንላይን” ተጠቃሚዎች ግን እነዚህን ሞክሼ ሆሄያት እንደፈለጋቸው የመጠቀም ዝንባሌ ያሳያሉ። ይህም ማለት ለምሳሌ፣ ሃብታም በሚለው ቃል ውስጥ ያለው የመጀመሪያው ሆሄ ወይም ፊደል ከሚከተሉት ሆሄያት በአንዱ ሊተካ ይችላል (ሀ፣ ሐ፣ ሓ፣ ኀ፣ ኃ፣ ኻ)። የሞክሼ ሆሄያትን አጠቃቀም ለማጥናት ወደ 5 ሚሊዮን የሚደርሱ ዐረፍተ ነገሮችን ሰብስበን በመተንተን Word2Vec እና FastText የተሰኙትን የ“ኢምቤዲንግ ሞዴሎችን” ገንብተናል። ይህ የጥናት እና የምርምር ሙከራችን የሚከተሉትን ውጤቶች ያሳያል፡-

1. የሞክሼ ሆሄያት አጠቃቀም በአብዛኛው በዘፈቀደ የሚጻፉ መሆኑ፣

2. ሞክሼ ሆሄያትን በአንድ ፊደል ብቻ እንዲወከሉና እንዲለመድ የሚደረገው ጥረት በኢምቤዲንግ ሞዴሎች ላይ የቃላት ቅመራ ሲደረግ አሉታዊ ተጽዕኖ አለው።"

Translating the Unseen? Yorùbá-English Machine Translation (MT) in Low-Resource, Morphologically-Unmarked Settings paper videoAuthors: Ife Adebara, Muhammad Abdul-Mageed, Miikka Silfverberg

"Translating between languages where certain features are marked morphologically in one but absent or marked contextually in the other is an important test case for machine translation. When translating into English which marks (in)definiteness morphologically, from Yorùbá which uses bare nouns but mark these features contextually, ambiguities arise. In this work, we perform fine-grained analysis on how an SMT system compares with two NMT systems (BiLSTM and Transformer) when translating bare nouns in Yorùbá into English. We investigate how the systems what extent they identify BNs, correctly translate them, and compare with human translation patterns. We also analyze the type of errors each model makes and provide a linguistic description of these errors. We glean insights for evaluating model performance in low-resource settings. In translating bare nouns, our results show the transformer model outperforms the SMT and BiLSTM models for 4 categories, the BiLSTM outperforms the SMT model for 3 categories while the SMT outperforms the NMT models for 1 category."

Mining Wikidata for Name Resources for African Languages paper videoAuthors: Jonne Sälevä, Constantine Lignos

This work supports further development of language technology for the languages of Africa by providing a Wikidata-derived resource of name lists corresponding to common entity types (person, location, and organization). While we are not the first to mine Wikidata for name lists, our approach emphasizes scalability and replicability and addresses data quality issues for languages that do not use Latin scripts. We produce lists containing approximately 1.9 million names across 28 African languages. We describe the data, the process used to produce it, and its limitations, and provide the software and data for public use. Finally, we discuss the ethical considerations of producing this resource and others of its kind.

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets paper videoAuthors: Isaac Caswell, Julia Kreutzer, Lisa Wang, Ahsan Wahab, Daan van Esch, Nasanbayar Ulzii-Orshikh, Allahsera Tapo, Nishant Subramani, Artem Sokolov, Claytone Sikasote, Monang Setyawan, Supheakmungkol Sarin, Sokhar Samb, Benoît Sagot, Clara Rivera, Annette Rios, Isabel Papadimitriou, Salomey Osei, Pedro Ortiz Suárez, Iroro Orife, Kelechi Ogueji, Rubungo Andre Niyongabo, Toan Nguyen Mathias Müller, André Müller, Shamsuddeen Hassan Muhammad, Nanda Muhammad, Ayanda Mnyakeni, Jamshidbek Mirzakhalov, Tapiwanashe Matangira, Colin Leong, Nze Lawson, Sneha Kudugunta, Yacine Jernite, Mathias Jenny, Orhan Firat, Bonaventure F. P. Dossou, Sakhile Dlamini, Nisansa de Silva, Sakine Çabuk Ballı, Stella Biderman, Alessia Battisti, Ahmed Baruwa, Ankur Bapna, Pallavi Baljekar, Israel Abebe Azime, Ayodele Awokoya, Duygu Ataman, Orevaoghene Ahia, Oghenefego Ahia, Sweta Agrawal, Mofetoluwa Adeyemi

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, web-mined text datasets covering hundreds of languages. However, to date there has been no systematic analysis of the quality of these publicly available datasets, or whether the datasets actually contain content in the languages they claim to represent. In this work, we manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4), and audit the correctness of language codes in a sixth (JW300). We find that lower-resource corpora have systematic issues: at least 15 corpora are completely erroneous, and a significant fraction contains less than 50% sentences of acceptable quality. Similarly, we find 82 corpora that are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-speakers of the languages in question, and supplement the human judgements with automatic analyses. Inspired by our analysis, we recommend techniques to evaluate and improve multilingual corpora and discuss the risks that come with low-quality data releases.

An Exploration of Data Augmentation Techniques for Improving English to Tigrinya Translation paper videoAuthors: Lidia Kidane, Sachin Kumar, Yulia Tsvetkov

It has been shown that the performance of neural machine translation (NMT) drops starkly in low-resource conditions, often requiring large amounts of auxiliary data to achieve competitive results. An effective method of generating auxiliary data is back-translation of target language sentences. In this work, we present a case study of Tigrinya where we investigate several back-translation methods to generate synthetic source sentences. We find that in low-resource conditions, back-translation by pivoting through a higher-resource language related to the target language proves most effective resulting in substantial improvements over baselines.

Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique paper videoAuthors: Felermino Ali, Andrew Caines, Jaimito Malavi

Major advancement in the performance of machine translation models has been made possible in part thanks to the availability of large-scale parallel corpora. But for most languages in the world, the existence of such corpora is rare. Emakhuwa, a language spoken in Mozambique, is like most African languages low-resource in NLP terms. It lacks both computational and linguistic resources and, to the best of our knowledge, few parallel corpora including Emakhuwa already exist. In this paper we describe the creation of the Emakhuwa-Portuguese parallel corpus, which is a collection of texts from the Jehovah's Witness website and a variety of other sources including the African Story Book website, the Universal Declaration of Human Rights and Mozambican legal documents. The dataset contains 47,415 sentence pairs, amounting to 699,976 word tokens of Emakhuwa and 877,595 word tokens in Portuguese. After normalization processes which remain to be completed, the corpus will be made freely available for research use.

Amharic Text Clustering Using Encyclopedic Knowledge with Neural Word Embedding paper videoAuthors: Dessalew Yohannes, Yaregal Assabie

In this digital era, almost in every discipline people are using automated systems that generate information represented in document format in different natural languages. As a result, there is a growing interest towards better solutions for finding, organizing and analyzing these documents. In this paper, we propose a system that clusters Amharic text documents using Encyclopedic Knowledge (EK) with neural word embedding. EK enables the representation of related concepts and neural word embedding allows us to handle the contexts of the relatedness. During the clustering process, all the text documents pass through preprocessing stages. Enriched text document features are extracted from each document by mapping with EK and word embedding model. TF-IDF weighted vector of enriched feature was generated. Finally, text documents are clustered using popular spherical K-means algorithm. The proposed system is tested with Amharic text corpus and Amharic Wikipedia data. Test results show that the use of EK with word embedding for document clustering improves the average accuracy over the use of only EK. Furthermore, changing the size of the class has a significant effect on accuracy.

Collaborative construction of lexicographic and parallel datasets for African languages: first assessment paper videoAuthors: MBONING TCHIAZE Elvis

Faced with a considerable lack of resources in African languages to carry out work in Natural Language Processing (NLP), Natural Language Understanding (NLU) and artificial intelligence, the research teams of NTeALan association has set itself the objective of building open-source platforms for the collaborative construction of lexicographic data in African languages. In this article, we present our first reports after 2 years of collaborative construction of lexicographic resources useful for African NLP tools.

Text Normalization for Low-Resource Languages of AfricaAuthors: Andrew Zupon, Evan Crew, Sandy Ritchie paper video

Training data for machine learning models can come from many different sources, which can be of dubious quality. For resource-rich languages like English, there is a lot of data available, so we can afford to throw out the dubious data. For low-resource languages where there is much less data available, we can’t necessarily afford to throw out the dubious data, in case we end up with a training set which is too small to train a model. In this study, we examine the effects of text normalization and data set quality for a set of low-resource languages of Africa—Afrikaans, Amharic, Hausa, Igbo, Malagasy, Somali, Swahili, and Zulu. We describe our text normalizer which we built in the Pynini framework, a Python library for finite state transducers, and our experiments in training language models for African languages using the Natural Language Toolkit (NLTK), an open-source Python library for NLP.

The Role of Social Desirability Bias in Community Profiling: Modelling Interview Settings in Rural Uganda paper videoAuthors: Costanza Conforti (Stanze), Stephanie Hirmer

Social desirability bias is a type of response bias which arises when the answers to a survey or interview are influenced by what the interviewee believes to be more socially acceptable. The effect of social desirability bias may be more pronounced in traditional communities in Low-Income Countries, where social pressures can be ripe. While this type of bias has been well understood within social sciences, it has not yet been systematically studied within the Natural Language Processing (NLP) field. In this paper, we make a first attempt to quantify the impact of social desirability on people’s utterances in different contexts. To achieve this, we build on top of an existing corpus of annotated interviews from rural Ugandan communities. We then implement and test a range of BERT-based architectures which leverage information on the specific setting in which a given interview was collected. We find that models which are exposed to those variables tend to perform better. This suggests that the interview setting plays a relevant role in influencing what is being said, and opens new interesting research directions.

---

Social desirability bias kitegeeza okudibwamu okuvudde mukunonnyereza okukoledwa abakugu naye nga abawa endowooza bagobelela ebyo abantu abasinga byebakiririzame. Endowooza eno ey'okukiririza mwebyo abikiririzibwa abantu abasinga obungyi eri nyo mu mawanga aga kyakula olwesonga nti twesigama nnyo kwe'bbyo abalala byebakirirrizaamu. Wadde nga abasoma ebikyata nku bbela za'bantu bafubye okutegera nti waliwo ebbela ya abantu okwagala endowooza zabye obutawukana nazabalala, etabbi lya Natural Language Processing (NLP) telinakitekako nnyo sila. Okunonyereza kuno tukwesigamiza nnyo kumbeera z'omubantu ababera mu byalo mu Uganda, nga kino tukizimbidde kundowooza azatuweebwa okuva eri abantu abo. Endowooza za'bantu ze'kebegyezebwa mu mitedela egyengyawulo nga twesigama nku BERT-architectures ezituyamba okunnyonnyoka okwenjawulo nku bitundu abyo okunonnyereza mwekwakolebwa. Okwekebeja kwetwakola kukakasa nti ebyuma ebitendekebwa n'okumanya okwenjawulo ku bitundu abyo okunonnyereza mwekukolebwa bituyamba okola okusalawo okusinga kyebyo ebitandekwedwa n'okumanya kunno. Kino kitegeeza nti ebitundu okunonyereza mwekukolebwa biyina kinene kye bizanya kugeli abantu gyebawa endowooza zabwe.

Misinformation detection in Luganda-English code-mixed social media text paper videoAuthors: Peter Nabende, David Kabiito, Hewitt Tusiime, Claire Babirye, Joyce Nakatumba-Nabende

"The increasing occurrence, forms, and negative effects of misinformation on social media platforms has necessitated more misinformation detection tools. Currently, work is being done addressing COVID-19 misinformation however, there are no misinformation detection tools for any of the 40 distinct indigenous Ugandan languages. This paper addresses this gap by presenting basic language resources and a misinformation detection data set based on code-mixed Luganda-English messages sourced from the Facebook and Twitter social media platforms. Several machine learning methods are applied on the misinformation detection data set to develop classification models for detecting whether a code-mixed Luganda-English message contains misinformation or not. A 10-fold cross validation evaluation of the classification methods in an experimental misinformation detection task shows that a Discriminative Multinomial Naive Bayes (DMNB) method achieves the highest accuracy and F-measure of 78.19% and 77.90% respectively. Also, Support Vector Machine and Bagging ensemble classification models achieve comparable results. These results are promising since the machine learning models are based on n-gram features from only the misinformation detection data set."

Investigating the utility of custom ASR architectures for existing African language corpora paper videoAuthors: Ethan Morris, Robbie Jimerson, Emily Prud'hommeaux

The application of deep neural networks to the task of acoustic modeling for automatic speech recognition (ASR) has resulted in dramatic decreases in ASR word error rates, enabling the use of this technology for interacting with smart phones and personal home assistants in high-resource languages like English or Mandarin. Developing ASR models of this caliber, however, requires thousands of hours of transcribed speech recordings, which presents challenges to the vast majority of the world's languages. In this paper, we apply a non-neural and a simple neural ASR architecture, as well as a fully convolutional ASR architecture originally developed for an endangered Native American language, to two under-resourced but widely spoken African languages, Wolof and Amharic. We explore the impact of typological features, audio recording quality, and speaker diversity on the accuracy of ASR output using these two architectures in order to learn ways in which the development of a low-resource ASR system can be customized to fit the characteristics of a particular corpus.

Domain-specific MT for Low-resource Languages: The case of Bambara - French paper videoAuthors: Allahsera Auguste Tapo, Michael Leventhal, Sarah Luger, Christopher M. Homan, Marcos Zampieri

Translating to and from low-resource languages is a challenge for machine translation (MT) systems due to a lack of parallel data. In this paper we address the issue of domain-specific MT for Bambara, an under-resourced Mande language spoken in Mali. We present the first domain-specific parallel dataset for MT of Bambara into and from French. We discuss challenges in working with small quantities of domain-specific data for a low-resource language and we present the results of machine learning experiments on this data.

Did they direct the violence or admonish it? A cautionary tale on contronomy, androcentrism and back-translation foibles paper videoAuthors: Vinay Prabhu, Ryan Teehan, Eniko Srivastava, Abdul Nimeri

"The recent raft of high-profile gaffes involving neural machine translation technology has brought to light the unreliability and brittleness of this fledgling technology. These revelations have worryingly coincided with two other developments: The rise of back-translated text being increasingly used to augment training data in so termed low-resource natural language processing scenarios (such as those in the African context) and the emergence of 'AI-enhanced legal-tech' as a panacea that promises 'disruptive democratization' of access to legal services. In the backdrop of these quandaries, we present this cautionary tale where we shed light on the specifics of the risks surrounding cavalier deployment of this technology by exploring two specific failings: Androcentrism and Enantiosemy.

In this regard, we empirically investigate the fate of pronouns and a list of contronyms when subjected to back-translation using the state-of-the-art Google translate API. Through this, we seek to highlight the prevalence of the defaulting-to-the-masculine phenomenon in the context of gendered profession-related translations and also empirically demonstrate the scale and nature of threats pertaining to contronymous phrases covering both current-affairs and legal issues."

AI4D - African Language Program paper videoAuthors: Kathleen Siminyu, Godson Kalipe, Davor Orlic, Jade Abbott, Vukosi Marivate, Sackey Freshia, Prateek Sibal, Bhanu Neupane, David I. Adelani, Amelia Taylor, Jamiil Toure ALI, Kevin Degila, Momboladji Balogoun, Thierno Ibrahima DIOP, Davis David, Chayma Fourati, Hatem Haddad, Malek Naski

Advances in speech and language technologies enable tools such as voice-search, text-to-speech, speech recognition and machine trans-lation. These are however only available forhigh resource languages like English, Frenchor Chinese. Without foundational digital re-sources for African languages, which are con-sidered low-resource in the digital context,these advanced tools remain out of reach. Thiswork details the AI4D - African LanguageProgram, a 3-part project that 1) incentivisedthe crowd-sourcing, collection and curation oflanguage datasets through an online quantita-tive and qualitative challenge, 2) supported re-search fellows for a period of 3-4 months tocreate datasets annotated for NLP tasks, and3) hosted competitive Machine Learning chal-lenges on the basis of these datasets. Keyoutcomes of the work so far include 1) thecreation of 9+ open source, African languagedatasets annotated for a variety of ML tasks,and 2) the creation of baseline models for thesedatasets through hosting of competitive ML challenges.

PHONEME RECOGNITION THROUGH FINE TUNING OF PHONETICREPRESENTATIONS: A CASE STUDY ON LUHYA LANGUAGE VARIETIES paper videoAuthors: Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David Mortensen, Michael R. Marlo, Graham Neubig

"Models pre-trained on multiple languages have shown significant promise for improving speech recognition, particularly for low-resource languages.

In this work, we focus on phoneme recognition using Allosaurus, a method for multilingual recognition based on phonetic annotation, which incorporates phonological knowledge through a language-dependent allophone layer that associates a universal narrow phone-set with the phonemes that appear in each language.

To evaluate in a challenging real-world scenario, we curate phone recognition datasets for Bukusu and Saamia, two varieties of the Luhya language cluster of western Kenya and eastern Uganda. To our knowledge, these datasets are the first of their kind. We carry out similar experiments on the dataset of an endangered Tangkhulic language, East Tusom, a Tibeto-Burman language variety spoken mostly in India.

We explore both zero-shot and few-shot recognition by fine-tuning using datasets of varying sizes (10 to 1000 utterances). We find that fine-tuning of Allosaurus, even with just 100 utterances, leads to significant improvements in phone error rates."

Canonical and Surface Morphological Segmentation for Nguni Languages paper videoAuthors: Tumi Moeng, Sheldon Reay, Aaron Daniels, Jan Buys

Morphological Segmentation involves decomposing words into morphemes, the smallest meaning-bearing units of language. This is an important NLP task for morphologically-rich agglutinative languages such as the Southern African Nguni language group. In this paper, we investigate supervised and unsupervised models for two variants of morphological segmentation: canonical and surface segmentation. We train sequence-to-sequence models for canonical segmentation, where the underlying morphemes may not be equal to the surface form of the word, and Conditional Random Fields (CRF) for surface segmentation. Transformers outperform LSTMs with attention on canonical segmentation, obtaining an average F1 score of 72.5% across 4 languages. Feature-based CRFs outperform bidirectional LSTM-CRFs to obtain an average of 97.1% F1 on surface segmentation. In the unsupervised setting, an entropy-based approach using a character-level LSTM language model fails to outperforms a Morfessor baseline, while on some of the languages neither approach perform much better than a random baseline. We hope that the high performance of the supervised segmentation models will help to facilitate the development of better NLP tools for Nguni languages.

---

Morfologiese segmentasie is die taak om woorde op te breek in morfeme, die kleinste betekenisdraende eenhede van 'n taal. Dit it 'n belangrike taak vir Natuurlike Taalverwerking (NTV) vir agglutinatiewe tale met 'n ryk morfologie, soos die Suider-Afrikaanse Nguni tale. In hierdie artikel ondersoek ons afgerigte en onafgerigte modelle vir twee formulerings van morfologiese segmentasie: kanonieke en oppervlaksegmentasie. On rig ry-na-ry modelle af vir kanonieke segmentasie, waar die onderliggende morfeme nie noodwendig gelyk is aan die oppervlaksform van die woord nie, asook Voorwaardelike Kansfelde (VFKs) vir oppervlaksegmentasie. Transformators doen beter as Lang Korttermyn Geheue netwerke (LKTGs) met aandag op kanonieke segmentasie, met 'n gemiddelde F1 van 72.5% oor 4 tale. Kenmerk-gebasseerde VFKs doen beter as twee-rigting LKTG-VFKs, met 'n gemiddelde F1 van 97.1% vir oppervlaksegmentasie. In die onafgerigte opstelling slaag 'n entropie-gebasseerde benadering met 'n karakter LKTG taalmodel nie daarin om beter te doen as 'n Morfessor basismodel nie, terwyl vir sommige tale nie een van hierdie benaderings beter doen as 'n kansbenadering nie. Ons hoop dat die hoë akkuraatheid van die afgerigte segmentasiemodelle sal help om die ontwikkeling van beter NTV toepassings vir Nguni tale te bevorder.

Low-Resource Language Modelling of South African Languages paper videoAuthors: Stuart Mesham, Luc Hayward, Jared Shapiro, Jan Buys

Language models are the foundation of current neural network-based models for natural language understanding and generation. However, research on the intrinsic performance of language models on African languages has been extremely limited, which is made more challenging by the lack of large or standardised training and evaluation sets that exist for English and other high-resource languages. In this paper, we evaluate the performance of open-vocabulary language models on low-resource South African languages, using byte-pair encoding to handle the rich morphology of these languages. We evaluate different variants of n-gram models, feedforward neural networks, recurrent neural networks (RNNs), and Transformers on small-scale datasets. Overall, well-regularized RNNs give the best performance across two isiZulu and one Sepedi datasets. Multilingual training further improve performance on these datasets. We hope that this research will open new avenues for research into multilingual and low-resource language modelling for African languages.

Best Papers Awards 🏆

3 papers have been selected to receive a best paper award of 100$ each sponsored by Naver Labs Europe


MasakhaNER: Named Entity Recognition for African Languages paper videoAuthors: David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D'souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Anuoluwapo Aremu, Catherine Gitau, Derguene Mbaye, Jesujoba Alabi, Seid Muhie Yimam, Tajuddeen Gwadabe, Ignatius Ezeani, Rubungo Andre Niyongabo, Jonathan Mukiibi, Verrah Otiende, Iroro Orife, Davis David, Samba Ngom, Tosin Adewumi, Paul Rayson, Mofetoluwa Adeyemi, Gerald Muriuki, Emmanuel Anebi, Chiamaka Chukwuneke, Nkiruka Odu, Eric Peter Wairagala, Samuel Oyerinde, Clemencia Siro, Tobius Saul Bateesa, Temilola Oloyede, Yvonne Wambui, Victor Akinode, Deborah Nabagereka, Maurice Katusiime, Ayodele Awokoya, Mouhamadane MBOUP, Dibora Gebreyohannes, Henok Tilaye, Kelechi Nwaike, Degaga Wolde, Abdoulaye Faye, Blessing Sibanda, Orevaoghene Ahia, Bonaventure F. P. Dossou, Kelechi Ogueji, Thierno Ibrahima DIOP, Abdoulaye Diallo, Adewale Akinfaderin, Tendai Marengereke, Salomey Osei

Did they direct the violence or admonish it? A cautionary tale on contronomy, androcentrism and back-translation foibles paper videoAuthors: Vinay Prabhu, Ryan Teehan, Eniko Srivastava, Abdul Nimeri

Congolese Swahili Machine Translation for Humanitarian Response paper videoAuthors: Alp Öktem, Eric DeLuca, Rodrigue Bashizi, Eric Paquin, Grace Tang

Organizers

Kathleen Siminyu

AI4D Africa

Julia Kreutzer

Google Research

Hady Elsahar

Naver Labs Europe

Vukosi Marivate

University of Pretoria

Nishant Subramani

Intel Labs

Jade Abbott

Retro Rabbit

Bernardt Duvenhage

Praekelt Consulting

Program Committee

Big thanks to these folks for their appreciative, kind and constructive reviews!

  • Muhammad Abdul-Mageed

  • David Adelani

  • Tosin Adewumi

  • Orevaoghene Ahia

  • Adewale Akinfaderin

  • Ali Alavi

  • Eleftherios Avramidis

  • Laurent Besacier

  • Jan Buys

  • Isaac Caswell

  • Ernie Chang

  • Sunipa Dev

  • Bonaventure F. P. Dossou

  • Nouha Dziri

  • Angela Fan

  • Spandana Gella

  • Yacine Jernite

  • Armand Joulin

  • Nishant Kambhatla

  • Surafel Melaku Lakew

  • Constantine Lignos

  • Kosisochukwu Madukwe

  • Khalil Mrini





  • Shamsuddeen Muhammad

  • Mathias Müller

  • Wilhelmina Nekoto

  • Vassilina Nikoulina

  • Kelechi Ogueji

  • Arshath Ramkilowan

  • Machel Reid

  • Alex Rudnick

  • Marzieh Saeidi

  • Rachael Tatman

  • Alicia Tsai

  • Francis Tyers

  • Olamilekan Wahab

  • Seid Muhie Yimam

  • Alp Öktem

  • Ignatius Ezeani

  • Orhan Firat

Slack Workspace

You're invited to join our slack. Meet other participants, find collaborators, mentors and advice there. Organizers will be available on slack to answer questions regarding submissions, format, topics, etc.

If you have any doubt whether you can contribute to this workshop (e.g. have never written a paper, are new to NLP, do not have any collaborators, do not know LaTeX), please join the slack and contact us there as well.

Sponsors