AfricaNLP 2023 Workshop
African NLP in the Era of Large Language Models.
(Collocated with ICLR 2023, 5th May 2023 )
Radisson Blu Hotel and Convention Center, Kigali, Rwanda, Africa
About the Workshop
Over 1 billion people live in Africa, and its residents speak more than 2,000 languages. But those languages are among the least represented in NLP research, and work on African languages is often sidelined at major venues. In 2022, the wave of large language models built through collaborative networks and large investments in compute has come to the shores of African languages. This year has seen the release of large multilingual models such as BLOOM and NLLB-200 for machine translation. While those models have been publicly open-sourced, their impact on the community of African NLP researchers is yet to be assessed and deserves to be a matter of wider discussion. This has inspired the theme for the 2023 workshop: African NLP in the Era of Large Language Models.
The workshop has several aims
to invite a variety of speakers from industry, research networks and academia to get their perspectives on the development of large language models and how African languages have and have not been represented in this work
to provide a venue to discuss the benefits and potential harms of these language models on the speakers of African languages and African researchers.
to enable positive interaction between academic, industry, and independent researchers around this theme and encourage collaboration and engagement for the benefit of the African continent
to foster further relationships between the African linguistics and NLP communities. It is clear that linguistic input about African languages is key in the evaluation and development of African models
to showcase work being done by the African NLP community and provide a platform to share this expertise with a global audience interested in NLP techniques for low-resource languages
to promote multidisciplinarity within the African NLP community with the goal of creating a holistic participatory NLP community that will produce NLP research and technologies that value fairness, ethics, decolonial theory, and data sovereignty
to provide a platform for the groups involved with the various projects to meet, interact, share and forge closer collaboration
to provide a platform for junior researchers to present papers, solutions, and begin interacting with the wider NLP community
to present an opportunity for more experienced researchers to further publicize their work and inspire younger researchers through keynotes and invited talks
This workshop follows the previously successful editions in 2020, 2021, and 2022. It will be hybrid and co-located with ICLR2023. No paper will be automatically desk-rejected :).
Important Dates
Submission Deadline: 5th February , 2023 (AoE time)
Acceptance Notifications: 3rd March , 2023 (AoE time)
Camera-ready: 15th April, 2023 (AoE)
Workshop date: 5th May, 2023 in Kigali, Rwanda & Virtual
Speakers
Perez Ogayo
Perez Ogayo is a master's student at Carnegie Mellon University's Technologies Institute (LTI). Prior to her studies at Carnegie Mellon, she received her BSc in Computer Science from African Leadership University-Rwanda. Perez's research pursuits lie in the realm of multilingual and low natural language processing (NLP), where she focuses on machine translation, speech synthesis and recognition, and NLP for endangered languages. Additionally, she is interested in the efficient deployment of NLP models on smaller devices, as she recognizes the importance of accessibility and sustainability in the field. Alongside her studies at Carnegie Mellon, Perez currently serves as a researcher at Masakhane, where she works on the Luo, Swahili, and Suba languages.
Elizabeth Salesky
Elizabeth Salesky is a Ph.D. student at Johns Hopkins University, advised by Philipp Koehn and Matt Post. Her research primarily focuses on language representations for machine translation and multilinguality, including how to create models that are more data-efficient and robust to variation across languages and data sources.
Dr. Seid Muhie Yimam
Dr. Seid Muhie Yimam is currently a technical lead at HCDS and a research associate at Language Technology Group, under the supervision of Prof. Chris Biemann. At HCDS, he will mostly work on leading and consulting research on digital humanities that involve big data processing of textual content. He will continue teaching NLP and Data science courses in the house while supervising students on interdisciplinary AI and data science research topics. He is currently participating in the development of a research data and knowledge management project, an intersectional project with knowledge management, AI, and library science. The project is envisioned to ingest metadata from research reports and projects automatically from diverse sources to present the outcomes using appealing visualization components.
He has been working as a postdoctoral researcher at Language Technology Group, UHH, since January 2020. He received his Ph.D. degree from the Universität Hamburg, with a specialization in the integration of adaptive machine learning models into annotation tools and NLP applications. From January 2020-March 2022, he has been working on multiple research topics including social media NLP (hate speech detection, fake news identification, and sentiment analysis) and low-resource language NLP research, mostly for the Ethiopian language of Amharic that include named entity recognition, semantic models, hate speech detection, and sentiment analysis. He has been teaching NLP courses and supervising Master’s projects and thesis in the group.
Paco Guzman
Paco is Research Scientist Manager supporting translation teams in Meta AI (FAIR). He works in the field of machine translation with a focus on low-resource translation (e.g. NLLB, FLORES) and the aim to break language barriers. He joined Meta in 2016. His research has been published in top-tier NLP venues like ACL, EMNLP.
He was the co-chair of the Research director at AMTA (2020-2022). He has organized several research competitions focused on low-resource translation (including the WMT2022 shared task on African Languages) and data filtering. Paco obtained his PhD from the ITESM in Mexico, was a visiting scholar at the LTI-CMU from 2008-2009, and participated in DARPA’s GALE evaluation program. Paco was a post-doc and scientist at Qatar Computing Research Institute in Qatar in 2012-2016
Laurent Besacier
Laurent Besacier is a principal scientist and Natural Language Processing (NLP) research team lead at Naver Labs Europe. Before that, he became a professor at the University Grenoble Alpes (UGA) in 2009 where he led the GETALP group (natural language and speech processing). Laurent is still affiliated with UGA.
His main research expertise and interests lie in the field of natural language processing, automatic speech recognition, machine translation, under-resourced languages, machine-assisted language documentation and the evaluation of NLP systems.
Asmelash Teka Hadgu
Asmelash Teka Hadgu is the Co-founder and CTO of Lesan and a fellow at the Distributed AI Research Institute (DAIR). At Lesan, he has built state-of-the-art machine translation systems to and from Amharic, Tigrinya, and English. Prior to Lesan, Asmelash did his Ph.D. at the Leibniz University Hannover where his research focused on applied machine learning for applications in scholarly communication, crisis communication, and natural language processing in low resource settings. Currently, as part of the Lesan-DAIR partnership, he is working on language technologies for Ge’ez based languages such as Tigrinya and Amharic.
Niyonkuru Audace
Audace Niyonkuru is Chief executive officer of Digital Umuganda , an AI and Open data company focusing on democratising access to information in African languages by the creation of open & publicly available datasets to spur AI research and innovation on the continent .He is also a member of United Nations Internet governance forum multi stakeholder advisory group.
Schedule
Best Papers
We are pleased to announce the three best papers awarded at the AfricaNLP workshop 2023. The papers are:
David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Atnafu Lambebo Tonja, Christine Mwase et al. MasakhaNEWS: News Topic Classification for African languages
Shester Gueuwou, Kate Takyi, Mathias Müller, Marco Stanley Nyarko, Richard Adade, Rose-Mary Owusuaa Mensah Gyening. AfriSign: Machine Translation for African Sign Languages
Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma OUSIDHOUM, David Ifeoluwa Adelani, Seid Muhie Yimam, et al. AfriSenti: A Benchmark Twitter Sentiment Analysis Dataset for African Languages
Accepted Papers
NATURAL LANGUAGE UNDERSTANDING FOR AFRICAN LANGUAGES [paper] [video]
Authors: Pierrette MAHORO MASTEL, Pierrette MAHORO MASTEL, Ester Namara, Aime Munezero, Richard Kagame, Zihan WANG, Allan Anzagira, Akshat Gupta, Jema David Ndibwile
Abstract: Natural Language Understanding(NLU) is a fundamental building block of goal-oriented conversational AI. In NLU, the two key tasks are predicting the intent of the user’s query and the corresponding slots. Most NLU resources available are for high-resource languages like English. In this paper, we address the limited availability of NLU resources for African languages, most of which are considered Low Resource Languages(LRLs), by presenting the first extension of one the most widely used NLU dataset, the Airline Travel Information Systems (ATIS) dataset to Swahili, Kinyarwanda. We perform baseline experiments using BERT,mBERT, RoBERTa, XLM-RoBERTa under zero-shot settings and achieve promising results. We release the datasets and the annotation tool used for the utterance slot labeling to the community to further NLU research on NLU for African Languages.
AfriSign: Machine Translation for African Sign Languages [paper] [video]
Authors:Shester Gueuwou, Kate Takyi, Mathias Müller, Marco Stanley Nyarko, Richard Adade, Rose-Mary Owusuaa Mensah Gyening
Abstract: Sign language translation is an active area of research with the main goal of bridging the communication gap between deaf and hearing individuals. In Natural Language Processing (NLP), there is a growing interest in this task, leading to new datasets and research on translation approaches. But while there has been significant progress for sign languages from high-income countries, minimal research has been conducted on African sign language translation. In this paper, we curate a novel dataset of African sign languages, with a focus on machine translation as the main application. The dataset contains English Bible verses and videos with translations into six different African sign languages. Using this dataset, we report experiments on African sign language machine translation, including baseline Transformer systems, multilingual training and cross-lingual transfer learning.
Kinyarwanda TTS: Using a multi-speaker dataset to build a Kinyarwanda TTS model [paper] [video]
Authors: Samuel Rutunda, Kleber Kabanda, Adriana Stan
Abstract: The field of text-to-speech (TTS) technology has been rapidly advancing in recent years, and has become an increasingly important aspect of our lives. This presents an opportunity for Africa, especially in facilitating access to information to many vulnerable socio-economic groups. However, the lack of availability of high-quality datasets is a major hindrance. In this work, we create a dataset based on recordings of the Bible. Using an existing Kinyarwanda speech-to-text model we were able to segment and align the speech and the text, and then created a multi-speaker Kinyarwanda TTS model.
Improving African Language Identification with Multi-task Learning [paper] [video]
Authors: Ife Adebara, AbdelRahim A. Elmadany, Muhammad Abdul-Mageed
Abstract: We present AfroLIDv2.0, a multi-task neural language identification toolkit for 517 African languages and varieties. The languages that make up AfroLIDv2.0 belong to 14 language families spoken across 50 African countries. To ensure robustness of AfroLIDv2.0, we employ a multi-domain, multi-script dataset. Compared to a previous version of the tool (AfroLID), AfroLIDv2.0 is trained with a multi-task learning objective exploiting language family information. That is, AfroLIDv2.0 performs language identification as the main task and language family identification as an auxiliary task. We demonstrate how our multi-task learning setup yields better performance compared to all previous work, allowing AfroLIDv2.0 to reach a 96.44 F_1 on our blind test set. Language identification is a core technology in NLP, and we hope that AfroLIDv2.0 will be a valuable contribution to multilingual NLP in general and African NLP in particular.
ERROR ANALYSIS OF TIGRINYA – ENGLISH MACHINE TRANSLATION SYSTEMS [paper] [video]
Authors: Nuredin Ali Abdelkadir, Negasi Haile Abadi, Asmelash Teka Hadgu
Abstract: Machine translation (MT) is an important language technology that can democratize access to information. In recent years, we have seen some progress in the development and deployment in production of MT systems for a handful of African languages. Evaluating the quality of such systems is fundamental to accelerating progress in this area. Tigrinya is a language that is spoken by more than 10 million native speakers mainly in Tigray, Ethiopia and Eritrea. In this work, we evaluated the current status of state-of-the-art MT systems that support the translation of Tigrinya to and from English: Google translate, Microsoft translator, and Lesan. We systematically collected a dataset for evaluating Tigrinya MT systems across four domains: Arts and Culture, Business and Economics, Politics as well as Science and Technology. The dataset contains snippets from 806 articles gathered from diverse sources. We performed an in-depth analysis of the errors current systems make using MQM-DQF standard error typology. We found that Mistranslation and Omission are the most frequent translation issues. We believe this work gives a methodology for evaluating other machine translation systems for low resource languages and we provide practical suggestions to improve current Tigrinya - English MT systems
Tigrinya Dialect Identification [paper] [video]
Authors: Asfaw Gedamu Haileslasie, Asmelash Teka Hadgu, Solomon Teferra Abate
Abstract: Dialect Identification is an important topic of research in Natural Language Processing (NLP) as it has broad implications in many real-world applications such as machine translation, speech recognition and chatbots to name a few. In this work, we investigate Tigrinya dialect identification using machine learning techniques. To that end, we have identified three Tigrinya dialects, namely: Z, L and D. Then we systematically collected datasets for each dialect. Finally, we perform experiments using classical machine learning and deep learning methods to quantify effectiveness of current methods on the problem of Tigrinya dialect identification. The highest overall accuracy of 92.98\% was achieved using character-level Convolutional Neural Networks (CNNs).
HausaNLP at SemEval-2023 Task 12: Leveraging African Low Resource TweetData for Sentiment Analysis [paper] [video]
Authors: Saheed Salahudeen Abdullahi, Falalu Ibrahim Lawan, Ahmad Mustapha Wali, Amina Abubakar Imam, Aliyu Rabiu Shuaibu, Yusuf Aliyu, Nur Bala Rabiu, Musa Bello, Shamsuddeen Umar Adamu, Saminu Mohammad Aliyu, Murja Sani Gadanya, Sanah Abdullahi Muaz, Mahmoud Said Ahmad, Abdulkadir Abdullahi, Abdulmalik Yusuf Jamoh
Abstract: We present the findings of SemEval-2023 Task 12, a shared task on sentiment analysis for low-resource African languages using Twitter dataset. The task featured three subtasks; subtask A is monolingual sentiment classification with 12 tracks which are all monolingual languages, subtask B is multilingual sentiment classification using the tracks in subtask A and subtask C is a zero-shot sentiment classification. We present the results and findings of subtask A, subtask B and subtask C. We also release the code on github. Our goal is to leverage low-resource tweet data using pre-trained Afro-xlmr-large, AfriBERTa-Large, Bert-base-arabic-camelbert-da-sentiment (Arabic-camelbert), Multilingual-BERT (mBERT) and BERT models for sentiment analysis of 14 African languages. The datasets for these subtasks consists of a gold standard multi-class labeled Twitter datasets from these languages. Our results demonstrate that Afro-xlmr-large model performed better compared to the other models in most of the languages datasets. Similarly, Nigerian languages: Hausa, Igbo, and Yoruba achieved better performance compared to other languages and this can be attributed to the higher volume of data present in the languages.
IGBONER 2.0: EXPANDING NAMED ENTITY RECOGNITION DATASETS VIA PROJECTION [paper] [video]
Authors: Chiamaka Ijeoma Chukwuneke, Paul Rayson, Ignatius Ezeani, Mo El-Haj, DORIS CHINEDU ASOGWA, Chidimma Lilian Okpalla, CHINEDU EMMANUEL MBONU
Abstract: Since the inception of the state-of-the-art neural network models for natural language processing research, the major challenge faced by low-resource languages is the lack or insufficiency of annotated training data. The named entity recognition (NER) task is no exception. The need for an efficient data creation and annotation process, especially for low-resource languages cannot be over-emphasized. In this work, we leverage an existing NER tool for English in a cross-language projection method that automatically creates a mapping dictionary of entities in a source language and their translations in the target language using a parallel English-Igbo corpus. The resultant mapping dictionary, which was manually checked and corrected by human annotators, was used to automatically generate and format an NER training dataset from the Igbo monolingual corpus thereby saving a lot of annotation time for the Igbo NER task. The generated dataset was also included in the training process and our experiments show improved performance results from previous works.
MphayaNER: Named Entity Recognition for Tshivenda[paper] [video]
Authors: Rendani Mbuvha, David Ifeoluwa Adelani, Tendani Mutavhatsindi, Tshimangadzo Rakhuhu, Aluwani Mauda, Tshifhiwa Joshua Maumela, Andisani Masindi, Seani Rananga, Vukosi Marivate, Tshilidzi Marwala
Abstract: Named Entity Recognition (NER) plays a vital role in various Natural Language Processing tasks such as information retrieval, text classification, and question answering. However, NER can be challenging, especially in low-resource languages with limited annotated datasets and tools. This paper adds to the effort of addressing these challenges by introducing MphayaNER, the first Tshivenda NER corpus in the news domain. We establish NER baselines by fine-tuning state-of-the-art models on MphayaNER. The study also explores zero-shot transfer between Tshivenda and other related Bantu languages, with Setswana, chiShona and Kiswahili showing the best results. Augmenting MphayaNER with Setwana data was also found to improve model performance significantly. Both MphayaNER and the baseline models are made publicly available.
FINE-TUNING MULTILINGUAL PRETRAINED AFRICAN LANGUAGE MODELS [paper] [video]
Authors: Rozina Lucy Myoya, Fiskani Banda, Vukosi Marivate, Abiodun Modupe
Abstract: With the recent increase in low-resource African language text corpora , there have been advancements which have led to development of multilingual pre-trained language models (PLMs), based on African languages. These PLMS include AfriBerta \citep{ogueji2021-afriberta}, Afro-XLMR \citep{alabi-etal-2022-adapting-afro-xlmr} and AfroLM \citep{afrolm} , which perform significantly well. The downstream tasks of these models range from text classification , name-entity-recognition and sentiment analysis. By exploring the idea of fine-tuning the different PLMs, these models can be trained on different African language datasets. This could lead to multilingual models that can perform well on the new data for the required downstream task of classification. This leads to the question we are attempting to answer: Can these PLMs be fine-tuned to perform similarly well on different African language data?
Breaking the Low-Resource Barrier for Dagbani ASR: From Data Collection to Modeling [paper] [video]
Authors: Paul Azunre, Naafi Dasana Ibrahim
Abstract: Developing Automatic Speech Recognition (ASR) systems requires large amounts of high-quality speech data. However, for low-resourced African languages, collecting and annotating such data is challenging due to acute data scarcity and limited funding. As a result, building ASR technologies for these languages remains a daunting task. This paper addresses this challenge for Dagbani by presenting a data collection pipeline and process for a transcribed Dagbani audio dataset. Dagbani is an African language spoken predominantly in Ghana and in parts of northern Togo. We then apply the data to build the world’s first Automatic Speech Recognition (ASR) system for Dagbani. We hope this methodology can serve as a blueprint or guideline for other similar efforts..
Yoruba and Unicode: An Overview of a Problem [paper] [video]
Authors: Kolawole Olatubosun
Abstract: There is a recurring problem in the writing of Yorùbá on the internet (or on the computer) that has proven intractable over the years. This problem applies to Igbo and other African languages that depend on diacritics for disambiguation, and it has to do with not just the application of diacritics themselves but the way words are eventually rendered on the screen after said diacritics have been applied or, in most cases, after such writings have been transferred to another platform different from the one where the original writing was done. E.g. from Microsoft Word to PDF, etc. The work of Unicode has been fingered as having something to do with this problem -- a belief that has now been borne out by some fact and public confirmation -- but it also appears that the issue is more nuanced than just Unicode = bad. This paper attempts to discuss the with personal and public examples, on books and on the internet, to argue for a more holistic response to the intractable problem. The paper discusses Yorùbá tonal ambiguities, covering the history of Yorùbá orthography from Ajayi Crowther through the work of Ayo Bamgbose to modern times. It covers the technology paradox through which solutions designed to help provide inclusion have come to create more problems. It then examines the role of Unicode from its inception to date, and how it currently affects underserved languages like Yorùbá. The paper shows examples of books and web pages where these technology problems have caused misunderstandings and unintended consequences for intelligibility. It examines Unicode's explanations of its role in these problems, and weighs them against its work on emojis and other languages. It then mentions current solutions and interventions by others in the field, and concludes with suggestions of the way forward for languages like Yorùbá which depend on diacritics and good working of tonemarking software to facilitate intelligibility..
African Substrates Rather Than European Lexifiers to Augment African-diaspora Creole Translation [paper] [video]
Authors: Nathaniel Romney Robinson, Matthew Dean Stutzman, Stephen D. Richardson, David R Mortensen
Abstract: Machine translation (MT) model training is difficult for low-resource languages, such African-diaspora Creole languages, because of data scarcity. Cross-lingual data augmentation methods with knowledge transfer from related high-resource languages are a common technique to overcome this disadvantage. For instance, practitioners may transfer knowledge from a language in the same language family as the low-resource language of interest. African-diaspora Creole languages are low-resource and have simultaneous relationships with multiple language groups. These languages, such as Haitian and Jamaican, are typically lexified by colonial European languages, but they are structurally similar to African languages. We explore the advantages of transferring knowledge from the European lexifier language versus the phylogenetic and typological relatives of the African substrate languages. We analysed Haitian and Jamaican MT: both controlling tightly for data properties across compared transfer languages and later allowing use of all data we collected. Our inquiry demonstrates a significant advantage in using African transfer languages in some settings..
Multilingual Automatic Speech Recognition for Kinyarwanda, Swahili, and Luganda: Advancing ASR in Select East African Languages [paper] [video]
Authors: Moayad Elamin, Yonas Chanie, Paul Ewuzie, Samuel Rutunda
Abstract: This paper presents a multilingual Automatic Speech Recognition (ASR) model for three East African languages—Kinyarwanda, Swahili, and Luganda. The Common Voice project's African languages datasets were used to produce a curated code-switched dataset of 3,900 hours on which the ASR model was trained. The work included validating the Kinyarwanda dataset and developing a model that achieves a 17.57 Word Error Rate (WER) on the language. Across all three languages, the Kinyarwanda model was finetuned and achieved a WER of 21.91 on the three curated datasets, with a WER of 25.48 for Kinyarwanda, 17.22 for Swahili, and 21.95 for Luganda. The paper emphasizes the necessity of considering the African environment when developing effective ASR systems and the significance of supporting many languages when developing ASR for languages with limited resources..
Speech Recognition Datasets for Low-resource Congolese Languages [paper] [video]
Authors: USSEN ABRE KIMANUKA, Ciira wa Maina, Osman Büyük
Abstract: Large pre-trained Automatic Speech Recognition (ASR) models have begun to perform better in low-resource languages, as a result of the availability of data and transfer learning. However, a small number of languages have sufficient resources to benefit from transfer learning. This paper contributes to expanding speech recognition resources for under-represented languages. We release two new datasets to the research community namely: Lingala Read Speech Corpus consisting of 4 hours labeled audio clips and Congolese Speech Radio Corpus containing 741 hours of unlabeled audio in 4 major spoken languages in the Democratic Republic of the Congo. Additionally, we obtain state-of-the-art results for Congolese wav2vec2. We observe an average decrease of 2 % in WER when a Congolese multilingual pre-trained model is used for finetuning on Lingala. Importantly, our study is the first attempt towards benchmarking speech recognition systems for Lingala and the first-ever multilingual model for 4 Congolese languages spoken by a combined 65 million people. Our data and models will be publicly available, and we hope they help advance research in ASR for low-resource languages.
Multilingual Model and Data Resources for Text-To-Speech in Ugandan Languages [paper] [video]
Authors: Isaac Owomugisha, Benjamin Akera, Ernest Tonny Mwebaze, John Quinn
Abstract: We present new resources for text-to-speech in Ugandan languages. Studio-grade recordings in Luganda and English were captured for 2,413 and 2,437 utterances respectively (totaling 4,850 utterances representing 5 hours of speech). We show that this is sufficient to train high-quality TTS models which can generate natural sounding speech in either language or combinations of both with code switching. We also present results on training TTS in Luganda using crowdsourced recordings from Common Voice. Additional data collection is currently underway for the Acholi, Ateso, Lugbara and Runyankole languages. The data we describe is an extension to the SALT dataset, which already contains multi-way parallel translated text in six languages. The dataset and models described are publicly available at https://github.com/SunbirdAI/salt
LEXICON AND RULE-BASED WORD LEMMATIZATION APPROACH FOR SOMALI LANGUAGE [paper] [video]
Authors: Shafie Abdi Mohamed, Muhidin A. Mohamed
Abstract: Lemmatization is a Natural Language Processing (NLP) technique used to normalize text by changing morphological derivations of words to their root forms. It is used as a core pre-processing step in many NLP tasks including text indexing, information retrieval, and machine learning for NLP, among others. This paper pioneers the development of text lemmatization for the Somali language, a low-resource language with very limited or no prior effective adoption of NLP methods and datasets. We especially develop a lexicon and rule-based lemmatizer for Somali text, which is a starting point for a full-fledged Somali lemmatization system for various NLP tasks. With consideration of the language morphological rules, we have developed an initial lexicon of 1247 root words and 7173 derivationally related terms enriched with rules for lemmatizing words not present in the lexicon. We have tested the algorithm on 120 documents of various lengths including news articles, social media posts, and text messages. Our initial results demonstrate that the algorithm achieves an accuracy of 57% for relatively long documents (e.g. full news articles), 60.57% for news article extracts, and high accuracy of 95.87% for short texts such as social media messages.
AfriSenti: A Benchmark Twitter Sentiment Analysis for African Languages [paper] [video]
Authors: Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Abinew Ali Ayele, Nedjma OUSIDHOUM, David Ifeoluwa Adelani, Seid Muhie Yimam, Ibrahim Said Ahmad, Meriem Beloucif, Saif M. Mohammad, Oumaima Hourrane, Pavel Brazdil, Felermino D. M. A. Ali, Davis David, Salomey Osei, Bello Shehu Bello, Falalu Ibrahim Lawan, Tajuddeen Gwadabe, Samuel Rutunda, Tadesse Belay, Bernard Opoku
Abstract: Africa, which is home to over 2000 languages from more than six language families, has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the presence of labeled datasets by native speakers. In this paper, we introduce 14 sentiment labeled Twitter datasets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yoruba) from four language families (Afro-Asiatic, English Creole, Indo European, and Niger-Congo). We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We also build different sentiment classification baseline models on the datasets and discuss their usefulness..
Koya: A Recommender System for Large Language Model Selection [paper] [video]
Authors: Abraham Toluwase Owodunni, Chris Chinenye Emezue
Abstract: Pretrained large language models (LLMs) are widely used for various downstream tasks in different languages. However, selecting the best LLM (from a large set of potential LLMs) for a given downstream task and language is a challenging and computationally expensive task, making the efficient use of LLMs difficult for low-compute communities. To address this challenge, we present Koya, a recommender system built to assist researchers and practitioners in choosing the right LLM for their task and language, without ever having to finetune the LLMs. Koya is built with the Koya Pseudo-Perplexity (KPPPL), our adaptation of the pseudo perplexity, and ranks LLMs in order of compatibility with the language of interest, making it easier and cheaper to choose the most compatible LLM. By evaluating Koya using five pretrained LLMs and three African languages (Yoruba, Kinyarwanda, and Amharic), we show an average recommender accuracy of 95\%, demonstrating its effectiveness. Koya aims to offer an easy to use (through a simple web interface accessible at https://huggingface.co/spaces/koya-recommender/system), cost-effective, fast and efficient tool to assist researchers and practitioners with low or limited compute access..
How good are Commercial Large Language Models on African Languages? [paper] [video]
Authors: Jessica Ojo, Kelechi Ogueji
Abstract: Recent advancements in Natural Language Processing (NLP) has led to the proliferation of large pretrained language models.These models have been shown to yield good performance, using in-context learning, even on unseen tasks and languages. They have also been exposed as commercial APIs as a form of language-model-as-a-service, with great adoption. However, their performance on African languages is largely unknown. We present a preliminary analysis of commercial large language models on two tasks (machine translation and text classification) across eight African languages, spanning different language families and geographical areas. Our results suggest that commercial language models produce below-par performance on African languages. We also find that they perform better on text classification than machine translation.In general, our findings present a call-to-action to ensure African languages are well represented in commercial large language models, given their growing popularity..
AfroDigits: A Community-Driven Spoken Digit Dataset for African Languages [paper] [video]
Authors: Chris Chinenye Emezue, Sanchit Gandhi, Lewis Tunstall, Abubakar Abid, Joshua Meyer, Quentin Lhoest, Pete Allen, Patrick Von Platen, Douwe Kiela, Yacine Jernite, Julien Chaumond, Merve Noyan, Omar Sanseviero
Abstract: The advancement of speech technologies has been remarkable, yet its integration with African languages remains limited due to the scarcity of African speech corpora. To address this issue, we present AfroDigits, a minimalist, community-driven dataset of spoken digits for African languages, currently covering 38 African languages. As a demonstration of the practical applications of AfroDigits, we conduct audio digit classification experiments on six African languages [Igbo (ibo), Yoruba (yor), Rundi (run), Oshiwambo (kua), Shona (sna), and Oromo (gax)] using the Wav2Vec2.0-Large and XLS-R models. Our experiments reveal a useful insight on the effect of mixing African speech corpora during finetuning. AfroDigits is the first published spoken digit dataset for African languages and we believe it will, among other things, pave the way for Afro-centric speech applications such as the recognition of telephone numbers, and street numbers..
MasakhaNEWS: News Topic Classification for African languages [paper] [video]
Authors: David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Oluwadara Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundep et al.
Abstract: African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS --- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach..
ε KÚ <MASK>: INTEGRATING YORÙBÁ CULTURAL GREETINGS INTO MACHINE TRANSLATION [paper] [video]
Authors: Idris Akinade, Jesujoba Oluwadara Alabi, David Ifeoluwa Adelani, Clement Oyeleke Odoje, Dietrich Klakow
Abstract: This paper investigates the performance of massively multilingual neural machine translation (NMT) systems in translating Yorùbá greetings (ε kú <MASK>), which are a big part of Yorùbá language and culture, into English. To evaluate these models, we present IkiniYorùbá, a Yorùbá-English translation dataset containing some Yorùbá greetings, and sample use cases. We analysed the performance of different multilingual NMT systems including Google and NLLB and show that these models struggle to accurately translate Yorùbá greetings into English. In addition, we trained a Yorùbá-English model by finetuning an existing NMT model on the training split of IkiniYorùbá and this achieved better performance when compared to the pre-trained multilingual NMT models, although they were trained on a large volume of data.
AfriNames: Most ASR models "butcher" African Names [paper] [video]
Authors: Tobi Olatunji, Tejumade Afonja, Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Chris Chinenye Emezue, Amina Mardiyyah Rufai, Sahib Singh
Abstract: Useful conversational agents must accurately capture named entities to minimize error for downstream tasks, for example, asking a voice assistant to play a track from a certain artist, initiating navigation to a specific location, or documenting a diagnosis result for a specific patient. However, where named entities such as "Ukachukwu" (Igbo), "Lakicia" (Swahili), or "Ingabire" (Rwandan) are spoken, automatic speech recognition (ASR) models' performance degrades significantly, propagating errors to downstream systems. We model this problem as a distribution shift and demonstrate that such model bias can be mitigated through multilingual pre-training, intelligent data augmentation strategies to increase the representation of African named entities, and fine-tuning multilingual ASR models on multiple African accents. The resulting fine-tuned models show an 86.4% relative improvement compared with the baseline on samples with African named entities.
Evaluating the Robustness of Machine Reading Comprehension Models to Low Resource Entity Renaming [paper] [video]
Authors: Clemencia Siro, Tunde Oluwaseyi Ajayi
Abstract: Question answering (QA) models have shown compelling results in the task of Machine Reading Comprehension (MRC). Recently these systems have proved to perform better than humans on held-out test sets of datasets e.g. SQuAD, but their robustness is not guaranteed. The QA model’s brittleness is exposed when evaluated on adversarial generated examples by a performance drop. In this study, we explore the robustness of MRC models to entity renaming, with entities from low-resource regions such as Africa. We propose EntSwap, a method for test-time perturbations, to create a test set whose entities have been renamed. In particular, we rename entities of type: country, person, nationality, location, organization, and city, to create AfriSQuAD2. Using the perturbed test set, we evaluate the robustness of three popular MRC models. We find that compared to base models, large models perform well comparatively on novel entities. Furthermore, our analysis indicates that entity type person highly challenges the model performance.
Adapting to the Low-Resource Double-Bind: Investigating Low-Compute Methods on Low-Resource African Languages [paper] [video]
Authors: Colin Leong, Herumb Shandilya, Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Joel Mathew, Abdul-Hakeem Omotayo, Oreen Yousuf, Zainab Akinjobi, Chris Chinenye Emezue, Shamsudeen Muhammad, Steven Kolawole, Younwoo Choi, Tosin Adewumi
Abstract: Many natural language processing (NLP) tasks make use of massively pretrained language models, which are computationally expensive. However, access to high computational resources added to the issue of data scarcity of African languages constitutes a real barrier to research experiments on these languages. In this work, we explore the applicability of low-compute approaches such as language adapters in the context of this low-resource double-bind. We intend to answer the following question: do language adapters allow those who are doubly bound by data and compute, to practically build useful models? Through fine-tuning experiments on African languages, we evaluate their effectiveness as cost-effective approaches to low-resource African NLP. Using solely free compute resources, our results show that language adapters achieve comparable performances to massive pretrained language models which are heavy on computational resources. This opens the door to further experimentation and exploration on full-extent of language adapters capacities.
Online Threats Detection in Hausa Language [paper] [video]
Authors: Abubakar Yakubu Zandam, Fatima Adam Muhammad, Isa Inuwa-Dutse
Abstract: One of the widely used technological inventions is the Internet which gives rise to online social media platforms such as Twitter and Facebook to proliferate. These platforms are quite instrumental as a means for socialisation and information exchange among diverse users. The use of online social media to spread information can be both beneficial and harmful. From the positive side, the information can be useful in the areas of security, economy and climate change. Motivated by the growing number of online users and widespread availability of contents with the potential of causing harm, this study examines how online contents with threatening themes are being expressed in Hausa language. We collected the first collection of Hausa datasets with threatening contents from Twitter and develop a classification system to help in curtailing security risks by informing decisions on tackling insecurity and related challenges. We employ and train four machine learning algorithms: Random Forest (RF), XGBoost, Decision Tree (DT) and Naive Bayes, to classify the annotated dataset. The result of the classifications shows an accuracy score of 72% for XGBoost, 71% for RF, 67% for DT and Naive Bayes having the lowest of 57%.
VoxMg: An Automatic Speech Recognition Dataset for Malagasy [paper] [video]
Authors: Falia Ramanantsoa
Abstract: African languages are not well-represented in Natural Language Processing (NLP). The main reason is a lack of resources for training models. Low-resource languages, such as Malagasy, cannot benefit from modern NLP methods if no datasets are available. This paper presents the curation and annotation of VoxMg, a speech dataset for Malagasy that consists of 3873 audio files totaling 10.80 hours. We also run a baseline, which is the first Automatic Speech Recognition (ASR) model ever built in this language and obtained a Word Error Rate (WER) of 33%.
[paper] [video]
Authors: Olubayo Adekanmbi, Anthony Soronnadi
Abstract: In this work, we share our methodology and findings from applying named entity recognition (NER) using machine learning to identify behavioural patterns in transcribed family planning client call centre data in Nigeria based on the Fogg’s model. The Fogg Behaviour Model (FBM) describes the interaction of three key elements: Motivation (M), Ability (A), and a Prompt (P) and their interaction to produce behavioural change. This work is part of a larger project that is focused on practical application of artificial intelligence to analyse and derive insight from large scale data call centre data. The entity recognition model called Fogg Model. Entity Recognition(FMER) was trained using spaCy, an open source software library for advanced natural language processing on a total of 11510 words and F1 score of 98.5
The first large scale collection of diverse Hausa language datasets [paper] [video]
Authors: Isa Inuwa-Dutse
Abstract: Hausa language belongs to the Afroasiatic phylum, and with more first-language speakers than any other sub-Saharan African language. With a majority of its speakers residing in the Northern and Southern areas of Nigeria and the Republic of Niger, respectively, it is estimated that over 100 million people speak the language. Hence, making it one of the most spoken Chadic language. While Hausa is considered well-studied and documented language among the sub-Saharan African languages, it is viewed as a low resource language from the perspective of natural language processing (NLP) due to limited resources to utilise in NLP-related tasks. This is common to most languages in Africa; thus, it is crucial to enrich such languages with resources that will support and speed the pace of conducting various downstream tasks to meet the demand of the modern society. While there exist useful datasets, notably from news sites and religious texts, more diversity is needed in the corpus. We provide an expansive collection of curated datasets consisting of both formal and informal forms of the language from refutable websites and online social media networks, respectively. The collection is large and more diverse than the existing corpora by providing the first and largest set of Hausa social media data posts to capture the peculiarities in the language. The collection also consists of a parallel dataset, which can be used for tasks such as machine translation with applications in areas such as the detection of spurious or inciteful online content. We describe the curation process -- from the collection, preprocessing and how to obtain the data -- and proffer some research problems that could be addressed using the data.
Authors: Gebregziabihier Nigusie, Tesfa Tegegne
Abstract: Text complexity is the level of difficulty of the document for understanding by the target readers. The Amharic language contains such complex and unfamiliar words which lead low literacy readers to misunderstand the document. In addition to human readers such text complexity also challenges NLP applications like machine translation. To reduce this type of text complexity for low-resourced and morphologically reached language Amharic, we have developed an Amharic text complexity annotator tool which is built using 1002 complex Amharic terms. Then based on the annotated dataset we have developed a complexity classification model using a machine learning approach. For the experiment, we used 19k sentences. To vectorize and embed these sentences, we have used BOW for classical ML. For deep learning and pre-trained model BERT, we have built Word2Vec and BERT embedding layers which are trained using 9756 vocabularies. For the complexity classification process, we have conducted the experiment on SVM and RF from classical machine learnings, LSTM, BiLSTM, and BERT from deep and pre-trained models. The experimental result of these models scores an accuracy of 83.5% (SVM), 80.3%(RF), 87.8%(LSTM), 88.6%(BiLSTM), and 91%(BERT). Based on the experimental result the BERT model has better classification accuracy, because of its ability to handle long-term information dependency.
AROT-COV23: A Dataset of 500K Original Arabic Tweets on COVID-19 [paper] [video]
Abstract: This paper presents a dataset called AROT-COV23 (ARabic Original Tweets on COVID-19 as of 2023) containing about 500,000 original Arabic COVID-19-related tweets from January 2020 to January 2023. The dataset has been analyzed using a corpus-based approach to identify common themes and trends in the data and gain insights into the ways in which Arabic Twitter users have discussed the pandemic. The results of the analysis are also presented and discussed in terms of their implications for the field of Natural Language Processing (NLP) in Africa and for understanding the role of Twitter in the spread of COVID-19-related information in the region.Furthermore, our analysis indicates that entity type person highly challenges the model performance
Organizers
David Ifeoluwa Adelani
Research Fellow, UCL
Bonaventure F. P. Dossou
Ph.D. Student, Mila & McGill
Shamsuddeen Muhammad
Ph.D. Student, UPorto
Atnafu Lambebo Tonja
Ph. D. Student, IPN
Hady Elsahar
Research Scientist, Meta AI
Happy Buzaaba
Postdoc, RIKEN Center for AIP
Aremu Anuoluwapo
Linguist, YorubaNames
Salomey Osei
PhD. student, DeustoTech
Tunde Ajayi
Ph.D. Student, Insight Centre, University of Galway
Constantine Lignos
Assistant Professor,
Brandeis University
Tajuddeen Rabiu Gwadabe
Project Manager, Masakhane Research Foundation
Clemencia Siro
PhD student, University of Amsterdam
Everlyn Asiko Chimoto
Ph.D. Student, University of Cape Town, AIMS
Contacts & Slack Workplace
You're invited to join the Masakhane community slack (channel #africanlp-iclr2023-support) . Meet other participants, find collaborators, mentors and advice there. Organizers will be available on slack to answer questions regarding submissions, format, topics, etc. If you have any doubt whether you can contribute to this workshop (e.g. if you have never written a paper, if you are new to NLP, if you do not have any collaborators, if you do not know LaTeX, etc.), please join the slack and contact us there as well.
To contact the workshop organizers please send an email to: africanlp-ICLR2023@googlegroups.com
Sponsors
Digital Umuganda