AfricaNLP 2024 Workshop
Theme: Adaptation of Generative AI models for African Languages
Co-located with ICLR 2024, May 11th 2024
Messe Wien Exhibition Congress Center, Vienna, Austria
You can find the list of accepted papers on OpenReview at the following link: Accepted papers
About the Workshop
The AfricaNLP workshop has become a core event for the African NLP community and has drawn global attendance and interest for researchers working on African languages, African corpora, and tasks with importance in the African context. AfricaNLP 2024 is being hosted as an ICLR 2024 workshop.
In the contemporary AI landscape, generative AI has rapidly expanded with significant input and innovation from the global research community. This technology enables machines to generate novel content, showcase potential across a multitude of sectors. However, the under-representation of African languages persists within this growth. Recognizing the urgency to address this gap has inspired the theme for the 2024 workshop: Adaptation of Generative AI for African languages. The workshop aspires to congregate experts, linguists, and AI enthusiasts to delve into solutions, collaborations, and strategies to amplify the presence of African languages in generative AI models.
The workshop has several aims:
to invite a variety of speakers from industry, research networks and academia to get their perspectives on the development of large language models and how African languages have and have not been represented in this work
to provide a venue to discuss the benefits and potential harms of these language models on the speakers of African languages and African researchers.
to enable positive interaction between academic, industry, and independent researchers around this theme and encourage collaboration and engagement for the benefit of the African continent
to foster further relationships between the African linguistics and NLP communities. It is clear that linguistic input about African languages is key in the evaluation and development of African models
to showcase work being done by the African NLP community and provide a platform to share this expertise with a global audience interested in NLP techniques for low-resource languages
to promote multidisciplinary within the African NLP community with the goal of creating a holistic participatory NLP community that will produce NLP research and technologies that value fairness, ethics, decolonial theory, and data sovereignty
to provide a platform for the groups involved with the various projects to meet, interact, share and forge closer collaboration
to provide a platform for junior researchers to present papers, solutions, and begin interacting with the wider NLP community
to present an opportunity for more experienced researchers to further publicize their work and inspire younger researchers through keynotes and invited talks
This workshop follows the previously successful editions in 2020, 2021, 2022, and 2023. It will be hybrid and co-located with ICLR2024. No paper will be automatically desk-rejected.
Important Dates
Submission Deadline (Extended): February 11th, 2024 (AoE time)
Acceptance Notifications: March 3rd, 2024 (AoE time)
Camera-ready: April 1, 2024 (AoE time)
Workshop date: May 11th, 2024
Invited Speakers
Graham Neubig
Graham Neubig is an associate professor at the Language Technologies Institute of Carnegie Mellon University. His research focuses on natural language processing, with a particular interest in fundamentals, applications, and understanding of large language models for tasks such as question answering, code generation, and multilingual applications. His final goal is that every person in the world should be able to communicate with each-other, and with computers in their own language. He also contributes to making NLP research more accessible through open publishing of research papers, advanced NLP course materials and video lectures, and open-source software, all of which are available on his web site.
Claytone Sikasote
Claytone Sikasote is a Ph.D. student in Computer Science at the Hasso-Plattner Institute (HPI) Research School at University of Cape Town (UCT) in Cape Town, South Africa, supervised by Jan Buys and Hussein Suleman. He also serves as a research fellow in the Department of Computer Science at the the University of Zambia. His research interest is in building data-efficient and robust speech recognition and translations models for under-resourced languages. He is also interested in exploring cost-effective and responsible approaches to creating high-quality NLP data resources for under-resourced languages, especially for Zambia.
Ife Adebara
Ife Adebara is a researcher with over seven years of experience in natural language processing (NLP), linguistics, and language policy. She is a member of the Deep Learning and Natural Language Processing Group at UBC and an associate member of the African Languages Technology Initiative (ALT-i) in Nigeria. For her PhD dissertation, Ife developed deep learning technologies for 517 African Languages and engaged in work to make "computers usable in African languages." In her research, Ife advocates for an Afrocentric approach to technology development, to ensure that - what technologies to build, how to build, evaluate, and deploy them are based on the needs of local African communities. She has published her research at top NLP conferences including ACL, EMNLP, COLING, and the LT4ALL conference organized by UNESCO. Ife's work has also been recognized beyond the academic sphere, including media coverage by CBC News, Global News Canada, AMD, and City News Vancouver. Ife's work on AfroLID and Serengeti—a language identification model and Natural Language Understanding Language Model for 517 African languages—received the Top 10 Outstanding Global AI Solutions Award from the IRCAI, under the auspices of UNESCO. Ife's work holds profound implications for AI accessibility in Indigenous languages, preserving cultural heritage, promoting diversity, and inclusion in global discourse.
Akintunde Oladipo
Akintunde Oladipo is a Research Assistant at the David Cheriton School of Computer Science, University of Waterloo. His research focuses on multilingual natural language processing and information retrieval for African languages, and he has extensive industry experience in machine learning operations.
Pelonomi Moiloa
Pelonomi is CEO of Lelapa AI - A socially grounded research and product lab developing language technology for African languages. Lelapa AI has a mission to enable a noticeable uptick in the quality of life on the African continent. It aspires to do so through expanding the capacity of the African digital economy with resource-efficient language AI. Pelonomi is also a trustee of a girl scholarship fund and director of a community-based NPC. Pelonomi is drawn to the curiosities of community and connection in how they can inform imaginings of a better future, such that we can assist that future in arriving well. An electrical and biomedical engineer by training. TIME 100 most influential people in AI. TED speaker. Bloomberg Catalyst 2023.
Josh Meyer
Dr. Meyer works at the intersection of data, machine learning, and language. Meyer holds a PhD in automatic speech recognition and has been working on voice and language technologies for over a decade. As a Mozilla Fellow, Meyer collaborated with startups, academia, and governments to develop and deploy voice-based technologies for Kinyarwanda and Luganda. In collaboration with Masakhane, he led the BibleTTS project to release TTS data for African languages. Most recently, Meyer has been working on consumer-facing voice technologies at Artie, Coqui, and Rabbit, inc.
Schedule
Accepted Papers
TangaleNLP: Building Po Tangle to English Parallel Corpora and Machine Translation of the Tangle (Tangale) Language [paper]
Authors: Gideon George, Olubayo Adekanmbi, Anthony Soronnadi
Abstract: In a digitally connected world, language barriers are silencing millions, leaving communities like Tangle (Tangale) with limited access to information and online opportunities, and their rich heritage fading. This research offers hope that natural language processing and machine translation can bridge this gap. Our efforts go beyond Po Tangle. We are paving the way for similar systems in other African languages and promoting a more diverse digital space. We have successfully created a Po Tangle-English machine translation system using state-of-the-art AI by fine-tuning the pre-trained M2M100 model using 1150 parallel sentences from the dataset and obtained results showing that the system works and produces translations. The system achieves an evaluation BLEU score of 6.7604 and a prediction BLEU score of 6.0101. This indicates the potential for fluent translations with more substantial data. By building a parallel corpus with native speakers to ensure cultural authenticity, we are discovering much more than just numbers. This empowers communities to take control, enabling socio-economic development and preserving linguistic heritage. Our research is having an impact in the form of more targeted interventions, better education, and more vibrant online communities. It is paving the way for a future where every voice is heard and celebrated, regardless of language. This is a movement towards inclusion and equality, we are breaking down language barriers, celebrating the symphony of human voices, and ensuring that no community is left behind in the digital age.
Authors: Saminu Mohammad Aliyu, Gregory Maksha Wajiga, Muhammad Murtala, Lukman Jibril Aliyu
Abstract: The proliferation of online offensive language necessitates the development of ef- fective detection mechanisms, especially in multilingual contexts. This study ad- dresses the challenge by developing and introducing novel datasets for hate speech detection in three major Nigerian languages: Hausa, Yoruba, and Igbo. We col- lected data from Twitter and manually annotated it to create datasets for each of the three languages, using native speakers. We used pre-trained language models to evaluate their efficacy in detecting offensive language in our datasets. The best- performing model achieved an accuracy of 90%. To further support research in offensive language detection, we plan to make the dataset and our model publicly available.
Improving Question-Answering Capabilities in Large Language Models Using Retrieval Augmented Generation (RAG): A Case Study on Yoruba Culture and Language [paper]
Authors: Adejumobi Monjolaoluwa Joshua
Abstract: This study addresses the phenomenon of hallucination in large language models (LLMs), particularly in GPT-3.5 turbo, when tasked with processing queries in Yoruba—a low resource language. Hallucination refers to the generation of incorrect information, often occurring due to the model’s unfamiliarity with specific content or languages not extensively covered during its pretraining phase. We propose a novel methodology that incorporates Retrieval-Augmented Generation (RAG) techniques to mitigate this issue. Our method utilizes an exclusive dataset derived from a Yoruba-centric blog, covering an array of subjects from the language’s learning resources to its folklore. By embedding this data into an open-source chroma database, we improve GPT-3.5 turbo’s ability to deliver responses that are not only linguistically and factually correct but also resonate with the cultural nuances of the Yoruba heritage. This enhancement marks a significant step towards the creation of a chatbot aimed at promoting and disseminating knowledge about the Yoruba culture and language.
Authors: Olamide Shogbamu, Olubayo Adekanmbi, Anthony Soronnadi
Abstract: Music is a multifaceted socio-cultural phenomenon, characterized by diverse genres that have evolved in specific geographical regions. This paper delves into the relatively unexplored intersection of geography and music in Nigeria. While previous research has addressed music geography, particularly in the context of the origin, evolution, and diffusion of music, the role of geographical language and lyrical elements in Nigerian music remains understudied. This study focuses on bridging this gap. Utilizing a Geo Semantics and Natural Language Processing approach, we conducted a thorough analysis of Nigerian music content to investigate the impact of geo-location on music success. Our study uncovers a noteworthy relationship, contingent upon other factors, between mentions of geographic locations and the music success potentially in Nigerian. This research does not seek to limit the creative process of music creation to a rigid framework. Instead, it suggests establishing a framework that can enhance the growth of music. The strategic application of this framework has the potential to illuminate uncharted territories in the Nigerian music landscape and foster boundary-less collaboration among artists, thus contributing to the enrichment of the music industry.
What Happens When Small Is Made Smaller? Exploring the Impact of Compression on Small Data Pretrained Language Models [paper]
Authors: Busayo Awobade, Mardiyyah Oduwole, Steven Kolawole
Abstract: Compression techniques have been crucial in advancing machine learning by enabling efficient training and deployment of large-scale language models. However, these techniques have received limited attention in the context of low-resource language models, which are trained on even smaller amounts of data and under computational constraints, a scenario known as the "low-resource double-bind." This paper investigates the effectiveness of pruning, knowledge distillation, and quantization on an exclusively low-resourced, small-data language model, AfriBERTa. Through a battery of experiments, we assess the effects of compression on performance across several metrics beyond accuracy. Our study provides evidence that compression techniques significantly improve the efficiency and effectiveness of small-data language models, confirming that the prevailing beliefs regarding the effects of compression on large, heavily parameterized models hold true for less-parameterized, small-data models.
Authors: Akindele Michael Olawole, Jesujoba Oluwadara Alabi, Aderonke Busayo Sakpere, David Ifeoluwa Adelani
Abstract: In this work we present Yorùbá automatic diacritization (YAD) benchmark dataset for evaluating Yorùbá diacritization systems. In addition, we pre-train text-to-text transformer, T5 model for Yorùbá and showed that this model outperform several multilingually trained T5 models. Lastly, we showed that more data and bigger models are better at diacritization for Yorùbá.
Authors: Anthony Soronnadi, Olubayo Adekanmbi, Chinazo Anebelundu, David Ifeoluwa Adelani
Abstract: This paper reports on an ongoing investigation aimed at reviewing and optimizing Transformer models for processing the African language Igbo, which has limited resources. Creating an effective language model is essential for enhancing NLP applications in this setting, given the specific challenges posed by Igbo's rich morphological structure, tonal system, and limited availability of digital resources. In order to investigate the adaptation and optimization of Transformer models and to improve the models for Igbo language processing, this work takes a critical comparison approach. First efforts have focused on developing a RoBERTa model pre-trained on clean Igbo text corpus, and evaluating its performance on downstream tasks such as named entity recognition, text classification, and sentiment analysis. In our evaluations across the above-mentioned NLP tasks, IgboBERTa demonstrates competitive or superior performance relative to larger models such as XLM-R-large, XLM-R-base, AfriBERTa, and AfroXLMR-base, particularly when considering its efficiency due to its smaller size of only 83.4M parameters. This efficiency makes IgboBERTa particularly appealing for resource-constrained environments common in African NLP applications.
Authors: Ahmad Ibrahim Ismail, Anthony Soronnadi, Olubayo Adekanmbi, Bashirudeen Opeyemi Ibrahim, David Olubukola Akanji
Abstract: In the context of a rapidly evolving global health landscape, this study aims to cast light on the focal points and regional intricacies of medical research in Nigeria. It addresses the critical need to align medical research with health policies, responding to the dynamic health requirements of Nigeria's diverse population. Utilizing a Geo-semantic approach, the research melds Geospatial Analysis with the advanced capabilities of Natural Language Processing. This methodology was applied to analyze and visually interpret Nigerian medical research's thematic and geographic trends based on articles from the PubMed database. The study uncovered distinct regional focuses and collaborative networks in medical research, underscoring the importance of aligning research efforts with the prevalent health challenges. The study found emergent challenges like COVID-19 and epidemiological studies receiving optimum attention, while prevalent health challenges like health insurance and neglected tropical diseases were on the dwindling end of research interest. These findings provide a blueprint for improving the effectiveness of medical research and healthcare policy in Nigeria, offering significant insights for strategic planning and resource allocation in the health sector. Moreover, this innovative approach demonstrates the feasibility and value of integrating NLP and geospatial analysis in medical research. It opens new avenues for low- and middle-income countries to derive insights and enhance their healthcare planning strategies by leveraging data from unstructured sources.
Leveraging Geo-NLP for Enhanced Antiretroviral Drug Distribution in Nigeria: Insights from Social Media and News Data [paper]
Authors: Bashirudeen Opeyemi Ibrahim, Olubayo Adekanmbi, Ahmad Ibrahim Ismail, Anthony Soronnadi, David Olubukola Akanji
Abstract: Faced with over 1.9 million HIV/AIDS cases, Nigeria’s need for efficient antiretroviral therapy (ART) distribution is critical. Conventional assessment methods, restrained by logistical issues and data scarcity, require innovative solutions. This study employs Geographic Natural Language Processing (Geo-NLP) to analyse social media and news content, offering novel insights into public discourse on HIV/AIDS and ART across Nigeria. Using a custom Named-Entity Recognition (NER) model to process data from NairaLand and major newspapers, the research uncovers geographical patterns in HIV/AIDS-related conversations, achieving a significant model performance with an overall F1-Score of 83.27. The findings highlight areas with intense discussions on HIV/AIDS, suggesting urban centres like Bauchi, Jos, and Ibadan as priority sites for targeted ART interventions. This approach promises to refine ART distribution strategies and sets a precedent for employing Geo-NLP in public health planning. Despite its brevity, the study underscores the potential of integrating Geo-NLP with traditional data to enhance healthcare delivery in Nigeria, paving the way for more effective public health interventions against the HIV/AIDS epidemic.
Authors: Aman Kassahun Wassie
Abstract: Machine translation (MT) for low-resource languages such as Ge’ez, an ancient language that is no longer the native language of any community, faces challenges such as out-of-vocabulary words, domain mismatches, and lack of sufficient labeled training data. In this work, we explore various methods to improve Ge’ez MT, including transfer-learning from related languages, optimizing shared vocabulary and token segmentation approaches, finetuning large pre-trained models, and using large language models (LLMs) for few-shot translation with fuzzy matches. We develop a multilingual neural machine translation (MNMT) model based on languages relatedness, which brings an average performance improvement of about 4 BLEU compared to standard bilingual models. We also attempt to finetune the NLLB-200 model, one of the most advanced translation models available today, but find that it performs poorly with only 4k training samples for Ge’ez. Furthermore, we experiment with using GPT-3.5, a state-of-the-art LLM, for few-shot translation with fuzzy matches, which leverages embedding similarity-based retrieval to find context examples from a parallel corpus. We observe that GPT-3.5 achieves a remarkable BLEU score of 9.2 with no initial knowledge of Ge’ez, but still lower than the MNMT baseline of 15.2. Our work provides insights into the potential and limitations of different approaches for low-resource and ancient language MT..
Authors: Walelign Tewabe Sewunetie, Atnafu Lambebo Tonja, Tadesse Destaw Belay, Hellina Hailu Nigatu, Gashaw Kidanu, Zewdie Mossie, Hussien Seid, Eshete Derb, Seid Muhie Yimam
Abstract: While Machine Translation (MT) research has progressed over the years, translation systems still suffer from exhibiting biases, including gender bias. While an active line of research studies the existence and mitigation strategies of gender bias in machine translation systems, there is limited research exploring this phenomenon for low-resource languages. The limited availability of linguistic and computational resources confounded with the lack of benchmark datasets makes studying bias for low-resourced languages that much more difficult. In this paper, we construct benchmark datasets for evaluating gender bias in machine translation for three low-resourced languages: Afan Oromo (orm), Amharic (amh), and Tigrinya (tig). Building on prior work, we collected 2400 gender-balanced sentences parallelly translated into the three languages. From our human evaluations on the dataset we collected, we found that about 93% of Afan Oromo, 80% of Tigrigna, and 72% of Amharic sentences exhibited gender bias. In addition to providing benchmarks for improving gender bias mitigation research in the three languages, we hope the careful documentation of our work will help other low-resourced language researchers extend our approach to their languages.
Contextual Evaluation of LLM’s Performance on Primary Education Science Learning Contents in the Yoruba Language [paper]
Authors: Olanrewaju Israel Lawal, Olubayo Adekanmbi, Anthony Soronnadi
Abstract: In an era marked by the rapid evolution of artificial intelligence, large language models (LLMs) such as ChatGPT 3.5, Llama, and PaLM 2 have become instrumental in transforming educational paradigms. Trained mainly by using English and a mix of data from other languages, these LLMs have exceptional abilities to understand and generate complex human language constructs, leading to revolutionary applications in education. This raises the possibility of creating enriched and personalized educational experiences. Using LLMs can streamline the instructional design process and focus it on developing the content that students need to progress and content that resonates with the learners’ realities, thereby improving learning outcomes even in the domain of primary science. Also, it has been proven that learning science in the student’s mother tongue significantly boosts learning and assimilation, and this should be encouraged, especially in rural area. However, the technological advancement of LLMs raises a pivotal question about the inclusivity and effectiveness of these models in catering to low-resource languages, such as Yoruba, particularly in the domain of primary education science. The unique linguistic structures, idiomatic expressions, and cultural references inherent in Yoruba present formidable challenges for models predominantly trained in high-resource languages. This research critically evaluates LLMs’ ability to comprehend and generate contextually relevant science education content in Yoruba and aims to bridge the educational resource gap for Yoruba-speaking primary learners, especially those in underrepresented communities. Our study conducts a thorough evaluation of large learning models such as ChatGPT 3.5, Gemini, and PaLM 2 across four tasks using a manually developed primary science dataset in the Yoruba language. This approach allows us to assess the models' abilities to understand and replicate the intricacies of Yoruba primary education science contents without losing the context or meaning of the sentences. We focus on zero-shot learning settings for these LLMs to improve reproducibility. Our extensive experimental results reveal the models’ comparative underperformance in various natural language processing (NLP) tasks in the Yoruba language. This observation underscores the necessity for further research and development of more language-specific and domain-specific technologies, particularly for primary science education in low-resource languages.
Authors: Israel Abebe Azime, Mitiku Yohannes Fuge, Atnafu Lambebo Tonja, Tadesse Destaw Belay, Aman Kassahun Wassie, Eyasu Shiferaw Jada, Yonas Chanie, Walelign Tewabe Sewunetie, Seid Muhie Yimam
Abstract: Large language models (LLMs) have received a lot of attention in natural language processing (NLP) research because of their exceptional performance in understanding and generating human languages. However, low-resource languages are left behind due to the unavailability of resources. In this work, we focus on enhancing the LLAMA-2-Amharic model by integrating task-specific and generative datasets to improve language model performance for Amharic. We compile an Amharic instruction fine-tuning dataset and fine-tuned LLAMA-2-Amharic model. The fine-tuned model shows promising results in different NLP tasks. We open-source our dataset creation pipeline, instruction datasets, trained models, and evaluation outputs to promote language-specific studies on these models.
Authors: Sulaiman Kagumire, Andrew Katumba, Joyce Nakatumba-Nabende, John Quinn
Abstract: Text-to-Speech (TTS) development for African languages such as Luganda is still limited, primarily due to the scarcity of high-quality, single-speaker recordings essential for training TTS models. Prior work has focused on utilizing the Crowdsourced Common Voice Luganda recordings of multiple female speakers aged between 20-49. Although the generated speech is intelligible, it is still of lower quality compared to their model trained on studio-grade recordings. This is due to the insufficient data preprocessing methods applied to improve the quality of the Common Voice recordings. Furthermore, speech convergence is more difficult to achieve due to varying intonations from multiple speakers, as well as background noise in the training samples. In this paper, we show that the quality of Luganda TTS from Common Voice can improve by training on multiple speakers of close intonation in addition to further preprocessing of the training data. Specifically, we selected six female speakers with close intonation determined by subjectively listening and comparing their voice recordings. In addition to trimming out silent portions from the beginning and end of the recordings, we applied a pre-trained speech enhancement model to reduce background noise and enhance audio quality. We also utilized a pretrained, non-intrusive, self-supervised Mean Opinion Score (MOS) estimation model to filter recordings with an estimated MOS over 3.5, indicating high perceived quality. Subjective MOS evaluations from nine native Luganda speakers demonstrate that our TTS model achieves a significantly better MOS of 3.55 compared to the reported 2.5 MOS of the existing model. Moreover, for a fair comparison, our model trained on six speakers outperforms models trained on a single-speaker (3.13 MOS) or two speakers (3.22 MOS). This showcases the effectiveness of compensating for the lack of data from one speaker with data from multiple speakers of close intonation to improve TTS quality..
Authors: Anuoluwapo Aremu, Jesujoba Oluwadara Alabi, Daud Abolade, Nkechinyere Faith Aguobi, Shamsuddeen Hassan Muhammad, David Ifeoluwa Adelani
Abstract: In this paper, we create NaijaRC— a new multi-choice Nigerian Reading Comprehension dataset that is based on high-school RC examination for three Nigerian national languages: Hausa (hau), Igbo (ibo), and \yoruba (yor). We provide baseline results by performing cross-lingual transfer using the Belebele training data which is majorly from RACE {RACE is based on English exams for middle and high school Chinese students, very similar to our dataset.} dataset based on several pre-trained encoder-only models. Additionally, we provide results by prompting large language models (LLMs) like GPT-4.
Authors: Toyib Ogunremi, Serah sessi Akojenu, Anthony Soronnadi, Olubayo Adekanmbi, David Ifeoluwa Adelani
Abstract: This paper introduces AfriHG, an extended multi-lingual corpus compiled from XL-Sum and Masakhanews focusing on 16 languages widely spoken by Africans across 9 language families. We experimented with two seq-2-seq models. We also evaluated our dataset with a massively multilingual instruction-tuned LLM and benchmarked our results in the domain of abstractive summarization for News headline generation..
Authors: Henok Biadglign Ademtew, Mikiyas Girma Birbo
Abstract: African languages are not well-represented in Natural Language Processing (NLP). The main reason is a lack of resources for training models. Low-resource languages, such as Amharic and Ge’ez, cannot benefit from modern NLP methods because of the lack of high-quality datasets. This paper presents AGE, an opensource tripartite alignment of Amharic, Ge’ez, and English parallel dataset. Additionally, we introduced a novel, 1,000 Ge’ez-centered sentences sourced from areas such as news and novels. Furthermore, we developed a model from a multilingual pre-trained language model, which brings 12.29 and 30.66 for English to Ge’ez and Ge’ez to English, respectively, and 9.39 and 12.29 for Amharic-Ge’ez and Ge’ez-Amharic respectively. Our dataset and models are available at the AGE Dataset repository.
Authors:Rashidat Damilola Sikiru, Olubayo Adekanmbi, Anthony Soronnadi
Abstract: Large language models have seen rapid progress in recent times, and this has resulted in many applications in diverse fields. The ability of LLMs to make use of large-scale text makes it relevant in the financial industry and for financial tasks. With the increasing availability of LLM models, these tasks can be considered easy or may be of use for a person who needs to track their financial lifestyle before visiting financial institutions. This study seeks to investigate the accuracy of LLM models in responding to basic day-to-day financial questions both in the English and Yoruba languages. The result shows that ChatGPT4.0 outperformed ChatGPT3.5 and Bard(LaMDA) in all three phases. The result shows that these language models can be improved to fit in low-resource languages.
EkoHate: Offensive and Hate Speech Detection for Code-switched Political discussions on Nigerian Twitter [paper]
Authors: Comfort Eseohen Ilevbare, Jesujoba Oluwadara Alabi, David Ifeoluwa Adelani, Bakare Firdous Damilola, Abiola Oluwatoyin Bunmi, ADEYEMO Oluwaseyi Adesina
Abstract: Nigerians have a notable online presence and actively discuss political and topical matters. This was particularly evident throughout the 2023 general election, where Twitter was utilized for campaigning, fact-checking and verification, and even positive and negative discourse. However, little or none has been done in the detection of abusive language and hate speech in Nigeria. In this paper, we curate code-switched Twitter data directed at three musketeers of the governorship election on the most populous and economically vibrant state in Nigeria; Lagos state, with the view to detect offensive and hate speech on political discussion. We develop EkoHate---an abusive language and hate speech dataset for political discussions between the three candidates and their followers using a binary (normal vs offensive) and fine-grained four-label annotation scheme. We analysed our dataset and provide an empirical evaluation of state-of-the-art methods across both supervised and cross-lingual transfer learning settings. In the supervised setting, our evaluation results in both binary and four-label annotation schemes show that we can achieve 95.1 and 70.3 F1 points respectively. Furthermore, we show that our dataset adequately transfers very well to two publicly available offensive datasets (OLID and HateUS2020) with at least 62.7 F1 points.
Geo-parsing and Geo-Visualization of Road Traffic Crash Incident Locations from Print Media for Emergency Response and Planning [paper]
Authors: Patricia Ojonoka Idakwo, Olubayo Adekanmbi, Anthony Soronnadi, Amos DAVID
Abstract: Road traffic crashes (RTC) are a major public health concern across the globe, particularly in Nigeria where road transport is the most common mode of transportation. In this paper, we present an approach to RTC related geographic information retrieval and visualization from news articles utilizing the geo-parsing natural language processing technique for emergency response and planning. To capture RTC-details with a high degree of accuracy and precision, we created a dataset from RTC related Nigerian news articles and developed the RTC-NER Baseline and RTC-NER custom spaCy - based Named Entity Recognition (NER) models using the RTC dataset. We evaluated and compared their performance using standard metrics of precision, recall, and f1-score. The RTC-NER performed better than the RTC-NER baseline model for all three metrics with a precision rating of 93.63, recall of 93.61, and f1-score of 93.62. We further used the models for toponym recognition to extract RTC location details, toponym resolution to retrieve corresponding geographical coordinates, and finally, geo-visualization of the data to display the RTC incident environment for emergency response and planning. Our study showcases the potential of unstructured data for decision-making in RTC emergency responses and planning in Nigeria.
Africa-Centric Self-Supervised Pretraining for Multilingual Speech Representation in a Sub-Saharan Context [paper]
Authors:
Abstract: We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa. On the SSA subset of the FLEURS-102 dataset, our approach based on a HuBERT$_{base}$ (0.09B) architecture shows competitive results, for ASR downstream task, compared to the w2v-bert-51 (0.6B) pre-trained model proposed in the FLEURS benchmark, while being more efficient by using 7x less data and 6x less parameters. Furthermore, in the context of a LID downstream task, our approach outperforms FLEURS baselines accuracy by over 22%.
Abstract: Sentiment detection remains a pivotal task in natural language processing, yet its development in Arabic lags due to a scarcity of training materials compared to English. Addressing this gap, we present ArSen-20, a benchmark dataset tailored to propel Arabic sentiment detection forward. ArSen-20 comprises 20,000 professionally labeled tweets sourced from Twitter, focusing on the theme of COVID-19 and spanning the period from 2020 to 2023. Beyond tweet content, the dataset incorporates metadata associated with the user, enriching the contextual understanding. ArSen-20 offers a comprehensive resource to foster advancements in Arabic sentiment analysis and facilitate research in this critical domain.
Current State, Challenges and Opportunities for Natural Language Processing Research and Development in Africa: A Systematic Review [paper]
Authors:
Abstract: Natural language processing (NLP) has recently gained much attention for representing and analyzing human language computations as evidenced by the release of sophisticated Large Language Models (LLMs) models such as GPT 4, Llama, Claude and Gemini among others. While NLP has been advancing globally, its progress in Africa has registered a different story. This systematic review analyzes the current state, challenges and opportunities for NLP research and development in Africa. We reviewed 20 recently published articles which focuses on African NLP. We took a look into the currently available tools and resources for processing NLP in Africa Languages. This review also proposes some opportunities for the African NLP ecosystem together with an inclusion framework.
Decolonizing African NLP: A Survey on Power Dynamics and Data Colonialism in Tech Development [paper]
Abstract: This research paper explores the imperative to decolonize African Natural Language Processing (NLP) by addressing power dynamics and data colonialism within technology development. Grounded in the historical context of colonialism in Africa, the paper examines the pervasive influence of colonial legacies on NLP research and development, highlighting the marginalization of African languages, cultures, and voices. Through an analysis of power dynamics, the paper advocates for diversifying representation within the NLP community, empowering local communities, and challenging Eurocentric frameworks to foster more inclusive and equitable technology development. Additionally, the paper explores the concept of data colonialism and its implications for African NLP, emphasizing the need for data sovereignty, community ownership, and ethical data practices. Case studies and examples illustrate the transformative potential of decolonial approaches within African NLP, while future directions outline pathways for advancing the decolonial agenda through interdisciplinary collaboration, policy advocacy, and community engagement. Ultimately, the paper calls for collective action and solidarity within the NLP community to dismantle colonial legacies and forge a more just and inclusive digital future for Africa and beyond.
Authors: John Trevor kasule, Sudi Murindanyi, Elvis Mugume, Andrew Katumba
Abstract: The advent of Internet of Things (IoT) technology has generated massive interest in voice-controlled smart homes. While many voice-controlled smart home systems are designed to understand and support widely spoken languages like English, speakers of low-resource languages like Luganda may need more support. This research project aimed to develop a Luganda speech intent classification system for IoT applications to integrate local languages into smart home environments. The project uses hardware components such as Raspberry Pi, Wio Terminal, and ESP32 nodes as microcontrollers. The Raspberry Pi processes Luganda voice commands, while the Wio Terminal is a display device. The ESP32 nodes control the IoT devices. The ultimate objective of this work was to enable voice control using Luganda, which was accomplished through a natural language processing (NLP) model deployed on the Raspberry Pi. The NLP model utilized Mel Frequency Cepstral Coefficients (MFCCs) as acoustic features and a Convolutional Neural Network (Conv2D) architecture for speech intent classification. A dataset of Luganda voice commands was curated for this purpose and this has been made open-source. This work addresses the localization challenges and linguistic diversity in IoT applications by incorporating Luganda voice commands, enabling users to interact with smart home devices without English proficiency, especially in regions where local languages are predominant.
AngOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model [paper]
Authors: Osvaldo Luamba Quinjica, David Ifeoluwa Adelani
Abstract: In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.
Authors: Francois Meyer, Haiyue Song, Abhisek Chakrabarty, Jan Buys, Raj Dabre, Hideki Tanaka
Abstract: The Nguni languages have over 20 million home language speakers in South Africa. There has been considerable growth in datasets for Nguni languages, but no analysis of performance of NLP models for these languages has been reported across all languages and tasks. In this paper we study pretrained language models for the 4 Nguni languages - isiXhosa, isiZulu, isiNdebele, and Siswati. We compile all publicly available datasets for natural language understanding and generation, spanning 6 tasks and 11 datasets. This benchmark, which we call NGLUEni, is the first centralised evaluation suite for the Nguni languages, allowing us to systematically evaluate the Nguni-language capabilities of PLMs. Besides evaluating existing PLMs, we develop new PLMs for the Nguni languages through multilingual adaptive finetuning. Our models, Nguni-XLMR and Nguni-ByT5, outperform their base models and large-scale adapted models, showing that performance gains are obtainable through limited language group-based adaptation. We also perform experiments on cross-lingual transfer and machine translation. Our models achieve notable cross-lingual transfer improvements in the lower resourced Nguni languages (isiNdebele and Siswati). To facilitate future use of NGLUEni as a standardised evaluation suite for the Nguni languages, we create a web portal to access the collection of datasets and publicly release our models.
Authors: Gebregziabihier Nigusie
Abstract: The Amharic language is morphologically rich language in which single lemma can form variety of words through inflection or derivation forms. Generating such variants of words manually for second language learners and NLP applications is challenging task that needs an automatic morphology generator tool. In this study, we have developed new Amharic morphology generator tool for inflecting lemmas of nouns and verbs to possessive, gender, and number forms. In case of possessive inflection, nouns can be inflected for both singular and plural forms while verbs can be inflected only for singular forms. For number inflection both nouns and verbs can be inflected. To construct these rules, we have followed Amharic word affixation rules of linguists. Before we apply the suffixation and letter series transformation rule we have analyzed the word’s root form in the sentence which helps us to accurately apply the new inflected word formation rules based on the lemmas POS. Finally, we have evaluated the performance of the tool by comparing the inflected form result generated by linguists and the tool generates 76.9% accuracy compared with linguists-generated results. So as the result shows Amharic common nouns, mass nouns, and verbs suffix inflection form is generated correctly while the tool considers some proper nouns as common nouns to generate their inflected forms that need to be optimized in further studies.
Organizers
David Ifeoluwa Adelani
Research Fellow, UCL
Bonaventure F. P. Dossou
Ph.D. Student, Mila & McGill
Shamsuddeen Muhammad
Ph.D. Student, UPorto
Atnafu Lambebo Tonja
Ph. D. Student, IPN
Hady Elsahar
Research Scientist, Meta AI
Happy Buzaaba
Postdoc, Princeton University
Aremu Anuoluwapo
Linguist, YorubaNames
Salomey Osei
PhD. student, DeustoTech
Perez Ogayo
Master's student at Carnegie Mellon University's Technologies Institute (LTI).
Kayode Olaleye
Postdoc, University of Pretoria
Israel Abebe Azime
PhD Student UdS
Clemencia Siro
PhD Student at the University of Amsterdam
Constantine Lignos
Assistant Professor
Brandeis University
Program Committee
Gebregziabihier Nigusie, Mizan Tepi University
Saheed Salahudeen Abdullahi, Kaduna State University
Antony Ndolo, Karadeniz Technical University
Rashidat Damilola Sikiru, Data Science Nigeria
Tadesse Kebede Guge, Haramaya University
Eric Le Ferrand, Boston College
Walelign Tewabe Sewunetie, University of Miskolc
Abdul-Hakeem Omotayo, University of California, Davis
Brian Roark, Google
Shridhar B Devamane, K.L.E. Institute of Technology, Hubballi, India
Serah sessi Akojenu, Data Science Nigeria
Victor Jotham Ashioya, Kabarak University
Raj Dabre, NICT
Nathaniel Romney Robinson, Johns Hopkins University
Ifeoma Okoh, University of Ibadan
Busayo Awobade, Federal University of Agriculture Abeokuta
Chester Palen-Michel, Brandeis University
Tajwaa Scott, California State University, Los Angeles
Emmanuel Akanji, IU International University of Applied Sciences
Zemenfes Hailemariam Gebremedhin, Addis Ababa University
Stephen D. Richardson, Brigham Young University
Weiran Lin, Carnegie Mellon University
Sulaiman Kagumire, Makerere University
Anjishnu Mukherjee, George Mason University
Ted Pedersen, University of Minnesota, Duluth
Flore Oladipupo, Federal University of Technology, Akure
Raghavan Muthuregunathan, Columbia University
Seid Muhie Yimam, Universität Hamburg
Naome A Etori, University of Minnesota - Twin Cities
Felermino D. M. A. Ali, Universidade do Porto
Amina Mardiyyah Rufai, Idiap Research Institute
Cari Beth Head, University of Florida
Muhammad Umar Diginsa, Universiti Teknologi Malaysia
Francois Meyer, University of Cape Town
Iroro Orife, Netflix
Mulubrhan Abebe Nerea, Addis Ababa University
Khalid Elmadani, New York University, Abu Dhabi
Adenike Tosin ODEGBILE, Bowen University
Denis Musinguzi, Carnegie Mellon University
aissam outchakoucht, Hassan II University of Casablanca
Alp Öktem, Universitat Pompeu Fabra
Hizkiel Mitiku Alemayehu, Universität Paderborn
William Barr Held, Georgia Institute of Technology
Gideon George, Data Science Nigeria
Stephen Edward Moore, University of Cape Coast, Ghana
Cheng Xu, University College Dublin
Osei Manu Kagyah, evolve journal
Börje F. Karlsson, Beijing Academy of Artificial Intelligence (BAAI)
Ignatius Ezeani, Lancaster University
Yonas Chanie, Carnegie Mellon University
Odunayo Ogundepo, University of Waterloo
Tolúlopé Ògúnrèmí, Stanford University
Tadesse Destaw Belay, Instituto Politécnico Nacional, Centro de Investigación en Computación
Henok Biadglign Ademtew, Ethiopian Artificial Intelligence Institute
Jesujoba Oluwadara Alabi, Universität des Saarlandes
Elizabeth Salesky, Johns Hopkins University
Aditi Khandelwal, Microsoft
Bunmi Akinremi, Obafemi Awolowo University Ile-Ife
Michael Andersland, University of Minnesota - Twin Cities
Emmanuel Kigen Chesire, Kabarak University
Emmanuel Dorley, University of Florida
Olamide SHogbamu, Data Science Nigeria
Agam Goyal, University of Wisconsin - Madison
Mikiyas Girma Birbo, Maharishi International University
Victor-Veon Ugwu, The Federal Polytechnic Bida
Nan Yan, CUNY Brooklyn College
Colin Leong, University of Dayton
Van-Thuy Phi, RIKEN
Yueqi Song, Carnegie Mellon University
Jackson Weako, Koç University
Sudhansu Bala Das, National Institute of Technology Rourkela
Moayad Elamin, Carnegie Mellon University
Adejumobi Monjolaoluwa Joshua, University of Agriculture Abeokuta
Lekan Samuel Adesina, Obafemi Awolowo University Ile-Ife
Elodie Gauthier, Orange Innovation
Anaelia Ovalle, University of California, Los Angeles
Sujay S Kumar, Tesla Autopilot
Melaku Lake, Injibara University
Contacts & Slack Workplace
You're invited to join the Masakhane community slack (channel #africanlp-iclr2024-support) . Meet other participants, find collaborators, mentors and advice there. Organizers will be available on slack to answer questions regarding submissions, format, topics, etc. If you have any doubt whether you can contribute to this workshop (e.g. if you have never written a paper, if you are new to NLP, if you do not have any collaborators, if you do not know LaTeX, etc.), please join the slack and contact us there as well.
To contact the workshop organizers please send an email to: africanlp-ICLR2024@googlegroups.com.
Sponsors