Co-located with EACL 2026, March 28th, 2026
Rabbat, Morocco
The AfricaNLP workshop has become a core event for the African NLP community and has drawn global attendance and interest for researchers working on African languages, African corpora, and NLP tasks with importance to the African continent.
AfricaNLP invites papers related to any aspect of NLP for African languages.
In the current landscape, large language models (LLMs) have seen widespread use and significant innovation, yet African languages remain underrepresented. To address this disparity, the theme for the 2026 workshop is "Multilingual Multimodal LLMs." We believe this is an especially timely theme as LLMs are becoming more multilingually capable, but their performance across other modalities such as images and speech continues to lag behind. The workshop aspires to bring together a diverse group of researchers to explore solutions, collaborations, and innovation around enhancing LLMs’ capabilities in African languages and ensuring cultural awareness in their applications.
The workshop has several aims:
To invite a variety of speakers from industry, research networks, and academia to get their perspectives on the development of large language models and how African languages have and have not been represented in this work
To provide a venue to discuss the benefits and potential harms of these language models on the speakers of African languages and African researchers.
To enable positive interaction between academic, industry, and independent researchers around this theme and encourage collaboration and engagement for the benefit of the African continent
To foster further relationships between African linguistics and NLP communities. It is clear that linguistic input about African languages is key in the evaluation and development of African models
To showcase work being done by the African NLP community and provide a platform to share this expertise with a global audience interested in NLP techniques for low-resource languages
To promote multidisciplinarity within the African NLP community to create a holistic participatory NLP community that will produce NLP research and technologies that value fairness, ethics, decolonial theory, and data sovereignty
To provide a platform for the groups involved with the various projects to meet, interact, share, and forge closer collaboration
To provide a platform for junior researchers to present papers and solutions and begin interacting with the wider NLP community
To present an opportunity for more experienced researchers to publicize their work further and inspire younger researchers through keynotes and invited talks
Topics include, but are not limited to:
analyses of African languages by means of computational linguistics
empirical studies reporting results from applying or adapting NLP developed for high-resource languages to African languages
new model architectures tailored for African languages
new resources for African languages
using NLP techniques on African datasets
text generation for African languages
methods addressing out-of-domain generalization for NLP tasks with training data in very limited domains
transfer learning between African languages or from higher-resourced to lower-resourced languages
challenges or solutions for resource gathering for African NLP tasks
crowd-sourcing and open-sourcing software for African NLP
multidisciplinary and participatory research in African NLP
tutorials for African NLP for education or development purposes
new tools/software for African NLP
development of NLP systems for African languages for production
socio-linguistic research for African languages and their decolonization
ethical considerations for African NLP
This workshop follows the previously successful editions in 2020, 2021, 2022, 2023, 2024, and 2025. It will be hybrid and co-located with EACL 2026. No submissions will be automatically desk-rejected.
Important Dates
Workshop date: March 28th, 2026
Francois Meyer is a Lecturer in the Computer Science Department at the University of Cape Town and co-investigator in the UCT NLP research group. His research is on data-efficient language modelling and linguistically informed subword tokenisation. He completed his PhD at the University of Cape Town and previously obtained a masters in AI at the University of Amsterdam.
Title: Data-Efficient Language Modelling for Low-Resource Languages
Progress in language modelling has been driven by scaling data and model size, but this approach is infeasible for most African languages. In this talk, I will present our work on developing data-efficient language models - architectures and training algorithms that improve performance on limited training data. I will present examples of how linguistically informed modelling, which targets and leverages the linguistic properties of specific languages, can improve sample efficiency. Finally, I will discuss the emerging intersection between low-resource NLP and developmentally inspired NLP, exploring how insights from human language learning can help us build more efficient models.
Felermino Ali is a researcher at MSR Africa and a PhD Candidate at the University of Porto in Portugal, focused on natural language processing (NLP) with a specialization in low-resource African languages. His work centers on building neural machine translation systems for low-resource languages and advancing methods to more effectively evaluate MT performance in low-resource settings.
Title: Beyond Parallel Data: Harnessing External Knowledge for Low‑Resource MT
Translating from high-resource languages into Mozambican languages remains a pressing challenge in African NLP. The scarcity of parallel corpora, orthographic variation across dialects, and the frequent presence of loanwords and code-switching complicate the task of building robust translation systems. In this talk, I will share how we address these barriers through lexicon-guided neural machine translation. By integrating bilingual dictionaries and systematic loanword mappings directly into the training process, we move beyond data scarcity toward structured lexical enrichment. Our approach leverages over dictionary entries and loanword mappings to construct sentence-specific glossaries, dynamically incorporated via input augmentation. On FLORES benchmarks, this method demonstrates clear gains: stronger lexical coverage, fewer inconsistencies, and translations that better capture contextual nuance. Beyond the technical improvements, this work points to a broader vision: advancing low-resource machine translation not only by scaling data but by intelligently bridging vocabulary gaps with structured linguistic knowledge. For Mozambican languages, this means opening pathways to more inclusive digital communication, empowering communities, and ensuring that the linguistic richness of African languages is represented in the global NLP landscape.
Atnafu Lambebo Tonja is a Postdoctoral Researcher at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in the UAE, where he leads projects on culturally-diverse multilingual visual question answering and multimodal machine translation. He earned his PhD in Computer Science from Instituto Politécnico Nacional in Mexico City, focusing on neural machine translation for low-resource languages. His research focuses on advancing NLP for underrepresented languages, particularly African and Ethiopian languages, through the development of multilingual language models and sustainable data curation frameworks. His work on low-resource languages, especially for African and Ethiopian languages, has been published in top NLP venues including ACL, EMNLP, and NAACL.
Title: Towards Multimodal AI for African Languages and Cultures: Lessons from Afri-MCQA
What will it take to develop multimodal AI that truly comprehends African languages and cultures? In this talk, I explore this question through lessons from Afri-MCQA, a benchmark covering 15 African languages across 12 countries. Our evaluation highlighted that current models face major challenges, such as 1) they are unable to process speech in African languages, 2) they lack cultural context, and 3) they struggle to generate culturally relevant responses, rather than merely recognizing them. I will share these insights and outline a pathway forward, emphasizing the importance of speech-first development, culturally grounded training, and cross-lingual knowledge transfer as critical steps in creating effective multimodal AI for Africa.
Julia Kreutzer is a Senior Research Scientist at Cohere Labs, where she focuses on research around multilingual large language models. She has a background in machine translation, with a PhD from Heidelberg University and prior work experience at Google Translate. She's passionate about advancing NLP technologies for underrepresented languages and has been part of multiple open science initiatives to work towards this goal collaboratively.
Title: The knowns and unknowns of multilingual data augmentation
In this talk I will present recipes for multilingual fine-tuning data augmentation that have been developed to overcome data scarcity in languages beyond English. We will then discuss what the limitations of these approaches are, and what directions are relevant for future research.
Chris Emezue
Researcher and Entrepreneur, Lanfrica
Chris Emezue is a mathematician & computer scientist with a passion for languages and improving people's lives with technology. He has spent half a decade designing inclusive AI technologies that better serve the African population. His research areas cut across the intersection of natural language processing, causality, and reinforcement learning. At Lanfrica, he is working to map the landscape and trajectory of AI in Africa, to accelerate innovation and enable holistic understanding of the continent's AI representation, gaps, and risks.
To be confirmed
Advancing African NLP: UDMorph and flexiPipe
Authors: Maarten Janssen
Abstract: In this paper, we describe various recent efforts to provide base NLP pipelines for African languages. These include an infrastructure called UDMorph to make UD-compatible training data available for resources that do not have dependency relations, and a Python package called flexiPipe to easily run an NLP pipeline in various NLP tools using a uniform front-end, including the models provided by UDMorph. flexiPipe also provides Unicode normalization, an often overlooked feature that has a significant impact on African NLP. flexiPipe currently provides an NLP pipeline for 33 African languages, a significant increase from the handful of models that are currently easily accessible. And UDMorph is designed to make it easy to provide training data for more languages.
AfriCaption: Establishing a New Paradigm for Image Captioning in African Languages
Authors: Mardiyyah Oduwole, Prince Mireku, Fatimo Adebanjo, Oluwatosin Olajide, Mahi Aminu Aliyu, Jekaterina Novikova
Abstract: Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across underrepresented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for underrepresented African languages, laying the groundwork for truly inclusive multimodal AI.
AfriNLLB: Efficient Translation Models for African Languages
Authors: Yasmin Moslem, Aman Kassahun Wassie, Amanuel Gizachew
Abstract: In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200. We compress the NLLB-200 600M with two approaches, iterative layer pruning and quantization. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being over 20\% times faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we curated for fine-tuning the baseline and pruned models to facilitate further research.
AfriStereo: A Culturally Grounded Dataset for Evaluating Stereotypical Bias in Large Language Models
Authors: Yann Le Beux, Oluchi Audu, Oche David Ankeli, Dhananjay Balakrishnan, Melissah Weya, Marie Daniella Ralaiarinosy, Ignatius Ezeani
Abstract: Existing AI bias evaluation benchmarks largely reflect Western perspectives, leaving African contexts underrepresented and enabling harmful stereotypes in applications across various domains. To address this gap, we introduce AfriStereo, the first open-source African stereotype dataset and evaluation framework grounded in local socio-cultural contexts. Through community engaged efforts across Senegal, Kenya, and Nigeria, we collected 1,163 stereotypes spanning gender, ethnicity, religion, age, and profession. Using few-shot prompting with human-in-the-loop validation, we augmented the dataset to over 5,000 stereotype–antistereotype pairs. Entries were validated through semantic clustering and manual annotation by culturally informed reviewers. Preliminary evaluation of language models reveals that nine of eleven models exhibit statistically significant bias, with Bias Preference Ratios (BPR) ranging from 0.63 to 0.78 (p ≤ 0.05), indicating systematic preferences for stereotypes over antistereotypes, particularly across age, profession, and gender dimensions. Domain-specific models appeared to show weaker bias in our setup, suggesting task-specific training may mitigate some associations. Looking ahead, AfriStereo opens pathways for future research on culturally grounded bias evaluation and mitigation, offering key methodologies for the AI community on building more equitable, context-aware, and globally inclusive NLP technologies.
Building a Conversational AI Assistant for African Travel Services with LLMs and RAG
Authors: Grace Kevine Ngoufo, Shamsuddeen Hassan Muhammad
Abstract: Travel agencies in many African countries face increasing pressure to handle large volumes of customer inquiries with limited staff or, either non-existent or outdated rule-based chat-bots. To address this challenge, we develop a conversational virtual assistant powered by a Large Language Model (LLM) and enhanced with a Retrieval-Augmented Generation (RAG) pipeline. The system combines LLM reasoning, company-specific knowledge retrieval, and real-time API (Application Programming Interface) integration to deliver accurate, context-aware responses through WhatsApp, the region’s most widely used communication platform. A dedicated web interface enables staff to upload and update internal documents, ensuring that the assistant remains aligned with changing service information. Demonstrations show that the proposed solution improves response speed, enhances user experience, and reduces operational burden.
Can Text-to-Speech Systems enable Inclusive Computer-Based Testing? An Evaluation of Yoruba TTS for Visually Impaired Learners
Authors: Kausar Yetunde Moshood, Victor Tolulope Olufemi, Oreoluwa Boluwatife Babatunde, Emmanuel Bolarinwa, Williams Oluwademilade
Abstract: Text-to-Speech (TTS) technology offers potential to improve exam accessibility for visually impaired learners, but existing systems often underperform in underrepresented languages like Yoruba. This study evaluates current Yoruba TTS models in delivering standardized exam content to five visually impaired students through a web-based interface. Before testing, four Yoruba TTS systems were compared; only Facebook’s mms-tts-yor and YarnGPT produced intelligible Yoruba speech. Students experienced exam questions delivered by human voice, Braille, and TTS. All preferred Braille for clarity and independence, some valued human narration, while TTS was least favored due to robotic and unclear output. These results reveal a significant gap between TTS capabilities and the needs of users in low-resource languages. The paper highlights the urgency of developing tone-aware, user-centered TTS solutions to ensure equitable access to digital education for visually impaired speakers of underrepresented languages
Dealing with the Hard Facts of Low-Resource African NLP
Authors: Michael Leventhal, Yacouba Diarra, Nouhoum COULIBALY, Panga Azazia Kamaté, Aymane Dembélé, Madani Amadou Tall, Emmanuel Élisé Koné
Abstract: Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
Developing an English–Efik Corpus and Machine Translation System for Digitization Inclusion
Authors: Offiong Bassey Edet, Mbuotidem Sunday Awak, Emmanuel Ubene Oyo-Ita, Benjamin Okon Nyong, Ita Etim Bassey
Abstract: Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages like Swahili, Yoruba, and Amharic, smaller indigenous languages such as Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of AI-based models in translating between English and Efik, leveraging a small-scale, community-curated parallel corpus. We fine-tuned the mT5 multilingual model on our manually curated $N = 13{,}865$ sentence parallel corpus. The model achieved a BLEU score of $15.61$ and a chrF score of $35.04$, demonstrating reasonable translation performance for a low-resource language. Our work demonstrates the feasibility of developing practical machine translation tools for low-resource languages and highlights the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.
EduNaija AI Tutor: A Multi-Agent Retrieval-Augmented Generation System for Nigerian Curriculum Education
Authors: Israel Olanrewaju Odeajo, Edifon Emmanuel Jimmy
Abstract: Equitable access to quality education remains a critical challenge in Nigeria, where millions of students prepare annually for standardized examinations (WAEC, NECO, JAMB) with limited access to personalized tutoring (Badei et.al, 2024). This research presents EduNaija AI Tutor, a multi-agent Retrieval-Augmented Generation (RAG) system designed to democratize educational support through AI-powered tutoring aligned with Nigerian curricula. The system integrates conversational AI with document-based question answering, automated assessment generation, and multilingual support for English, Yoruba, Hausa, and Igbo. Using LangChain for agent orchestration, OpenAI GPT models for natural language processing, and FAISS for vector retrieval, the system enables students to interact with educational content through natural language queries while maintaining cultural relevance through Nigerian-contextualized examples and conventions (Chukwuma et.al, 2024). The multi-agent architecture comprises five specialized components: a main orchestrator, explanation agent, quiz generation agent, web search agent, and RAG agent for processing uploaded educational materials. Preliminary evaluation demonstrates the system's capability to provide curriculum-aligned explanations, generate practice assessments, and answer questions from uploaded textbooks and study materials. This work contributes a culturally-aware educational AI framework addressing linguistic diversity and curriculum alignment challenges in African educational contexts, while leveraging open-source tools for reproducibility and accessibility (Shoukat et.al, 2025).
Evaluating Native-Speaker Preferences on Machine Translation and Post-Edits for Five African Languages
Authors: Hiba El Oirghi, Tajuddeen Gwadabe, Marine Carpuat
Abstract: Wikipedia editors undertake the task of editing machine translation (MT) outputs in various languages to disseminate multilingual knowledge from English. But are editors doing more than just translating or fixing MT output? To answer this broad question, we constructed a dataset of 4,335 fine-grained annotated parallel pairs of MT translations and human post-edit (HE) translations for five low-resource African languages: Hausa, Igbo, Swahili, Yoruba, and Zulu. We report on our data selection and annotation methodologies as well as findings from the annotated dataset, the most surprising of which is that annotators mostly preferred the MT translations over their HE counterparts for three out of five languages. We analyze the nature of these "fluency breaking" edits and provide recommendations for the MT post-editing workflows in the Wikipedia domain and beyond.
Eyaa-Tom 26, Yodi-Mantissa and Lom Bench: A Community Benchmark for TTS in Local Languages
Authors: BAKOUBOLO ESSOWE JUSTIN, Catherine Nana Nyaah Essuman
Abstract: Most of the more than 40 languages spoken in Togo lack the data and tools necessary for modern natural language processing (NLP) applications. We present an extension of previous work by introducing new datasets, improved models, and a community-driven evaluation benchmark for text-to-speech (TTS). We expanded the Eyaa-Tom multilingual corpus with additional speech data (e.g. 26.9k recordings, 30.9 hours) across 10 local languages and incorporated Mozilla Common Voice contributions (64.6k clips, 46.6 hours) for Adja, Nawdm, Mina, Tem to strengthen automatic speech recognition (ASR) and speech synthesis. We detail how community contributors (including collaboration with a national TV journalist) helped collect and validate the Kabiyɛ and French text, with an ethical compensation model in place. We also try to compare the performance of a few models in these datasets, we fine-tuned state-of-the-art models in these data for ASR, OpenAI Whisper and faster-whisper were benchmarked achieving improved word error rates after fine-tuning; for machine translation, we fine-tuned Meta's NLLB-200 model in 11 local languages, which produced significant BLEU/METEOR gains especially in Ewɛ, and Kabɩyɛ. To evaluate TTS, we introduce Lom Bench, a new community-based benchmark where native speakers rate synthetic speech. The preliminary results from Lom Bench indicate promising naturalness in Ewɛ and Kabɩyɛ TTS, although further data is needed.
Full Fine-Tuning vs. Parameter-Efficient Adaptation for Low-Resource African ASR: A Controlled Study with Whisper-Small
Authors: Sukairaj Hafiz Imam, Muhammad Yahuza Bello, Hadiza Ali Umar, Tadesse Destaw Belay, Idris Abdulmumin, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad
Abstract: Automatic speech recognition (ASR) for African low-resource languages (LRLs) is often limited by scarce labelled data and the high cost of adapting large foundation models. This study evaluates whether parameter-efficient fine-tuning (PEFT) can serve as a practical alternative to full fine-tuning (FFT) for adapting Whisper-Small with limited labelled speech and constrained compute. We used a 10-hour subset of NaijaVoices covering Hausa, Yorùbá, and Igbo, and we compared FFT with several PEFT strategies under a fixed evaluation protocol. DoRA attains a 22.0% macro-average WER, closely aligning with the 22.1% achieved by FFT while updating only 4M parameters rather than 240M, and this difference remains within run-to-run variation across random seeds. Yorùbá consistently yields the lowest word error rates, whereas Igbo remains the most challenging, indicating that PEFT can deliver near FFT accuracy with substantially lower training and storage requirements for low-resource African ASR.
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Authors: Mamadou K. KEITA, Sebastien Diarra, Christopher M Homan, Seydou DIALLO
Abstract: Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.
Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara
Authors: Michael Leventhal, Yacouba Diarra, Nouhoum COULIBALY, Panga Azazia Kamate
Abstract: We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47\% to 37.12\% on one and from 36.07\% to 32.33\% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
Language Choice in Nigerian Social Media Hate Speech
Authors: Nneoma C Udeze, Rob Voigt
Abstract: Language choice in multilingual societies is rarely arbitrary. In Nigerian, English, Nigerian Pidgin (NP) and indigenous languages are strategically deployed in online discourse, yet little is known about how they function in hostile contexts. Here we conduct the first systematic analysis of NP in online hate speech on two platforms, Twitter and Instagram. Using a linguistically enriched annotation scheme, we label each post for class, targeted group, language variety, and hate type. Our results show that NP is disproportionately used in offensive and hateful discourse, particularly against Hausa, women, and LGBTQ+ groups, and that insults are the dominant hate strategy. Cross-domain evaluation further reveals that classifiers trained on Twitter systematically over-predict hate on Instagram, highlighting challenges of domain transfer. These findings underscore NP’s role as a linguistic resource for hostility and its sociolinguistic salience in amplifying stereotypes and affect. For NLP, the work demonstrates the need for NP-specific resources, sensitivity to figurative strategies, and domain adaptation across platforms. By bridging sociolinguistics and computational modeling, this study contributes new evidence on how language choice shapes online hate speech in a multilingual African context.
Leveraging CoHere Multilingual Embeddings and Inverted Softmax Retrieval for Automatic Parallel Sentence Alignment in Low-Resource Languages
Authors: Abubakar Auwal Khalid, Salisu Musa Borodo, Amina Abubakar Imam
Abstract: We present an improved method for automatic parallel sentence alignment in low- resource languages. We used CoHere multilingual embeddings and inverted softmax retrieval. Our technique achieved a higher F1-score of 78.25\%, 76.66\%, and 73.69\% on the MAFAND-MT test, development, and train datasets, respectively, compared to the existing technique’s 54.75\%, 49.02\%, and 39.24\% on the same datasets. Precision and recall have shown similar performance. We assessed the quality of the extracted data by demonstrating that it outperforms the existing technique in terms of low-resource translation performance.
Linguistically Informed Evaluation of Multilingual ASR for African Languages
Authors: Fei-Yueh Chen, Lateef Adeleke, C. M. Downey
Abstract: Word Error Rate (WER) mischaracterizes ASR models' performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models' performance. We examine this hypothesis by evaluating three speech encoders on two African languages using WER, CER, and FER, with a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistic learning even when word-level accuracy remains low. Our results reveal that the models perform better on segmental features, while tones, especially mid and downstep tones, remain the most challenging features. For Yoruba, a language in the pretraining data, when the WER is as high as 0.788, the CER is 0.305, and the FER is as low as 0.151. Similarly, for Uneme, an endangered language absent from pretraining data, a model with near-total WER and 0.461 CER achieves a FER as low as 0.267. This indicates phonological features learning even when models still struggle with full lexical accuracy, and provides linguistically meaningful information about models' performance and what that means in speakers' knowledge.
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
Authors: Seung Hun Eddie Han, Youssef Mohamed, Mohamed Elhoseiny
Abstract: This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 40+ languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36\% accuracy on the multilingual MMMU benchmark, which is competitive with the English MMMU scores of state-of-the-art models in the same weight class. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.
Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform
Authors: Abdifatah Ahmed Gedi, Shafie Abdi Mohamed, Yusuf Ahmed Yusuf, Muhidin A. Mohamed, Fuad Mire Hassan, Houssein A Assowe
Abstract: Lemmatization, which reduces words to their root forms, plays a key role in tasks such as information retrieval, text indexing, and machine
learning-based language models. However, a key research challenge for low-resourced languages such as the Somali is the lack of humanannotated lemmatization datasets and reliable ground truth to underpin accurate morphological analysis and training relevant NLP models. To address this problem, we developed the first large-scale, purpose-built Somali lemmatization lexicon, coupled with a crowdsourcing platform for ongoing expansion. The system leverages Somali’s agglutinative and derivational morphology, encompassing over 5,584 root words and 78,629 derivative forms, each annotated with part-of-speech tags. For data validation purpose, we have devised a pilot lexicon based lemmatizer integrated with rulebased logic to handle out-of-vocabulary terms. Evaluation on a 294-document corpus spanning news articles, social media posts, and short messages shows lemmatization accuracies of 51.27% for full articles, 44.14% for excerpts, and 59.51% for short texts such as tweets. These results demonstrate that combining lexical resources, POS tagging, and rulebased strategies provides a robust and scalable framework for addressing morphological complexity in Somali and other low-resource languages"
Power Asymmetries, Bias, and AI, a Reflection of Society on Low-Resourced Languages - African Languages as Case Study
Authors: Simbiat Ajao
Abstract: In recent times, artificial intelligence (AI) systems have become the primary intermediary to information access, services, and opportunities. Currently, there are growing concerns as to how existing social inequalities are reproduced and amplified through AI. This is significantly evident in language technologies, where a small number of dominant languages or what we'll refer to as big languages and cultural contexts shape the training, design, and evaluation of models. This paper examines the intersections of power asymmetries, linguistic bias, and cultural representation in AI, with a major focus on African languages and communities. We argue that current Natural Language Processing (NLP) systems reflect a high level of global imbalances in the availability of data, infrastructure, and decision making power, often marginalizing low-resourced languages and cultural peculiarities. It is important we know that how these data are structured is a great determinant in what their outcome will be. With reference to examples from speech recognition, machine translation, and large language models, we highlight the social and cultural consequences of linguistic exclusion, including reduced accessibility, misinterpretation, and digital invisibility. Finally, we identify and discuss pathways toward more equitable language technologies, emphasizing community-led data practices, interdisciplinary collaboration, and context-aware evaluation frameworks. By foregrounding language as both a technical and political concern, this work advocates for African-centered approaches to NLP that promote fairness, accountability, and linguistic justice in AI development.
Real-Time Spoken Instruction Following and Translation in Ugandan Languages
Authors: Benjamin Akera, Tim Wenjie Hu, Patrick Walukagga, Evelyn Nafula Ouma, Yiga Gilbert, Ernest Mwebaze, John Quinn
Abstract: Many languages are predominantly spoken rather than written, and to bring the benefits of LLMs to speakers of these languages, it is essential that models cater to the voice modality. The typical approach is to cascade ASR, LLM and TTS models together, though this results in systems with high latency, making them unsuitable for natural, real-time interaction. We describe results on taking the encoder part of a Whisper-based model trained to recognise ten languages common in Uganda, and using the Ultravox architecture to project its output directly to the input embedding space of a text model based on Qwen 3 32B, also trained to have comprehension of those languages. The result is a speech LLM with high accuracy and very low latency. For most spoken prompts, we can begin streaming a text response within as low as 50 ms, and a speech audio response within around one second, making real-time spoken interaction with an LLM possible for the first time in these languages. The model is available open source.
Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts
Authors: Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Chege Maina, Keshet Ronen, Javier Gonzalez
Abstract: Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social science measurement lens, we operationalize LLMs outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating greater interpretive stability, while smaller open-weight models in our study show reduced stability under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.
SALT-31: A Machine Translation Benchmark Dataset for 31 Ugandan Languages
Authors: Solomon Nsumba, Benjamin Akera, Evelyn Nafula Ouma, Medadi Ssentanda, Deo Kawalya, Engineer Bainomugisha, Ernest Mwebaze
Abstract: "We present SALT-31 benchmark dataset, for evaluation of machine translation (MT) models and covering 31 Ugandan languages. Unlike sentence-level evaluation sets, SALT-31 is constructed from short, scenario-driven mini-dialogues designed to preserve discourse context, pragmatics, and culturally grounded communication patterns common in everyday Ugandan settings. The dataset contains 100 English
sentences organized into 20 typical communication scenarios, each represented as a five sentence mini-sequence. It can therefore be used to evaluate both sentence-level and paragraph level machine translation, and includes nearly every language spoken in a country with high linguistic diversity."
Sample-Size Scaling of the African Languages NLI Evaluation
Authors: Anuj Tiwari, Oluwapelumi Ogunremu, Terry Oko-odion, Jesujuwon Egbewale, Hannah Sopuruchi Nwokocha
Abstract: African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language-sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language-sensitive datasets creation and stronger multi-lingual modelling strategies.
Sudanese-Flores: Extending FLORES+ to Sudanese Arabic Dialect
Authors: HADIA MOHMMEDOSMAN AHMED SAMIL, David Ifeoluwa Adelani
Abstract: In this work, we introduce Sudanese-Flores, an extension of the popular Flores+ machine translation (MT) benchmark to the Sudanese Arabic dialect. We translate both the DEV and DEVTEST splits of the Modern Standard Arabic dataset into the corresponding Sudanese dialect, resulting in a total of 2,009 sentences. While the dialect was recently introduced in Google Translate, there are no available benchmark in this dialect despite spoken by over 40 million people. Our evaluation on two leading LLMs such as GPT-4.1 and Gemini 2.5 Flash showed that while the performance English to Arabic is impressive (more than 23 BLEU), they struggle on Sudanese dialect (less than 11 BLEU) in zero-shot settings. In few-shot scenario, we achieved only a slight boost in performance.
Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation
Authors: Samuel Gyamfi, Alfred Malengo Kondoro, Yankı Öztürk, Richard Hans Schreiber, Vadim Borisov
Abstract: Despite serving over 100 million speakers as a vital African lingua franca, Swahili remains critically under-resourced for Natural Language Processing, hindering technological progress across East Africa. We present a scalable solution: a controllable synthetic data generation pipeline that produces culturally grounded Swahili text for sentiment analysis, validated through automated LLM judges. To ensure reliability, we conduct targeted human evaluation with a native Swahili speaker on a stratified sample, achieving 80.95% agreement between generated sentiment labels and human ground truth, with strong agreement on judge quality assessments. This demonstrates that LLM-based generation and quality assessment can transfer effectively to low-resource languages. We release a dataset and provide a reproducible pipeline in tandem providing ample knowledge and working material for NLP researchers in low-resource contexts.
The Token Tax: Systematic Bias in Multilingual Tokenization
Authors: Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, Cody Carroll
Abstract: Tokenization inefficiency is associated with structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and reducing accuracy. We evaluate 10 Large Language Models (LLMs) on AfriMMLU (5 subjects; 16 African languages) and show that token fertility reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (e.g., DeepSeek, o1) consistently outperform non-reasoning peers across high- and low-resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. In terms of economics, a doubling in tokens results in quadrupled training cost and time, underscoring the “token tax” faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
Using Subword-Embeddings for Bilingual Lexicon Induction in Bantu Languages
Authors: Adrian Breiding, Alan Akbik
Abstract: Bilingual Lexicon Induction (BLI) is a valuable tool in machine translation and cross-lingual transfer learning, but it remains challenging for agglutinative and low-resource languages. In this work, we investigate the use of weighted sub-word embeddings in BLI for agglutinative languages. We further evaluate a graph-matching and Procrustes-based BLI approach on two Bantu languages, assessing its effectiveness in a previously underexplored language family. Our results for Swahili with an average P@1 score of $51.84$% for a $3000$ word dictionary demonstrate the success of the approach for Bantu languages. Weighted sub-word embeddings perform competitively on Swahili and outperform word embeddings in our experiments with Zulu.
Where Are We at with Automatic Speech Recognition for the Bambara Language?
Authors: Seydou DIALLO, Yacouba Diarra, Panga Azazia Kamaté, Aboubacar Ouattara, Mamadou K. KEITA
Abstract: This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards; the top-performing system in terms of Word Error Rate (WER) achieved 46.76\% and the best Character Error Rate (CER) of 13.00\% was set by another model, while several prominent multilingual models exceeded 100\% WER due to severe hallucinations. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures likely establish an upper bound for performance in practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
ÒWE-Voice: An Evaluation of Monolingual and Multilingual ASR Model Using Yoruba Proverb Speech Dataset
Authors: Daud Abolade
Abstract: Given the advancement of various Artificial Intelligence (AI) technologies in the 21st century, Automatic Speech Recognition (ASR) plays a vital role in human and machine interaction and serves as an interface for a wide range of applications. The development of these high-performing, robust and useful technologies continue to gain more attention on high-resource languages due to high availability of language data, market profitability dominance and access to funding and research initiatives compared to the marginalised low-resource languages. Despite efforts to develop ASR systems for African languages, there are still numerous challenges due to limited speech datasets, tonal complexity and dialectal variation. In this study, we curated a domain-specific speech dataset for one of the oral Yoruba literatures, proverbs, which are highly culturally inclined. We used the Yoruba recording app that was developed for Iroyin-speech project to record 6 hours of Yoruba proverb sentences. The NCAIR1/Yoruba-ASR model which was finetuned on Open AI Whisper Small and Massively Multilingual Speech, a multilingual speech model featuring low-resource languages including Yoruba language was evaluated with the recorded Yoruba proverbs. Evaluation was conducted based on Word Error Rate (WER) and Tone Error Rate (TER). Our result shows that current ASR systems that support Yoruba does not capture cultural nuances. These findings highlight an urgent need to curate more robust speech datasets that are culturally embedded for low resource languages and in this case particularly, Yoruba language in order to build technological tools that preserve African culture, language and identity.
African Voices Nigeria: 2500 hours of ethically sourced speech data for four Nigerian Languages
Authors: Ife Adebara, Oluwaseun Nifemi, Rashidat Damilola Sikiru, Olanrewaju Israel Lawal, Ololade Anjuwon, Olubayo Adekanmbi, Anthony Soronnadi
Abstract: African languages remain severely underrepresented in large-scale speech resources, particularly for spontaneous, naturally occurring speech that reflects real-world linguistic use. We present African Voices, a large-scale, ethically governed speech dataset covering four Nigerian languages, [~2,500] hours of audio, and 2,865 speakers, with a focus on spontaneous and scripted speech across diverse sociolinguistic contexts. Unlike existing resources that primarily rely on read or scripted speech, African Voices captures natural variation in accent, dialect, register, and code-switching, accompanied by rich demographic and contextual metadata. We describe the data collection methodology, transcription and a principled governance framework designed to support responsible use of speech data in low-resource settings. We further provide baseline automatic speech recognition results across languages. African Voices enables research on robust and fair ASR and serves as a foundational resource for advancing NLP research in African languages.
AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic
Authors: Israel Abebe Azime, Abenezer Angamo, Hana Mekonen Tamiru, Dagnachew Mekonnen Marilign, Philipp Slusallek, Seid Muhie Yimam, Dietrich Klakow
Abstract: With the growing emphasis on multilingual and cultural evaluation benchmarks for large language models, language and culture are often treated as synonymous, and performance is commonly used as a proxy for a model’s understanding of a given language. In this work, we argue that such evaluations overlook meaningful cultural variation that exists within a single language. We address this gap by focusing on narratives from different regions of Ethiopia and demonstrate that, despite shared linguistic characteristics, region-specific and domain-specific content substantially influences language evaluation outcomes. To this end, we introduce AmharicStoryQA, a long-sequence story question answering benchmark grounded in culturally diverse narratives from Amharic-speaking regions. Using this benchmark, we reveal a significant narrative understanding gap in existing LLMs, highlight pronounced regional differences in evaluation results, and show that supervised fine-tuning yields uneven improvements across regions and evaluation settings. Our findings emphasize the need for culturally grounded benchmarks that go beyond language-level evaluation to more accurately assess and improve narrative understanding in low-resource languages.
Analyzing Sentiment Polarity in Amharic Climate Change Discussions Using Large Language Models
Authors: Gebregziabihier Nigusie, Neima Mossa Ahmed, Tesfa Tegegne Asfaw
Abstract: Climate change refers to variations in temperature and weather conditions due to various climate-related factors on earth.These factors vary across regions, and people's perceptions of climate change. Analyzing public opinion on climate change at a regional level is crucial for developing targeted solutions. However, manually analyzing large volumes of data is challenging for informed dissension. Applying emerging pre-trained Large Language Models offers a promising solution for efficiently analyzing large datasets and understanding public perspectives on climate change. Amharic is one of the widely spoken African languages. Many speakers of the language are actively discussing and expressing their opinions on various topics, including climate change, on social media. Given the increasing discussions about climate change, this study focuses on the sentiment analysis of Amharic climate texts. We collected 6013 sentences from social media and news sources. The data is annotated manually by native speakers to its target polarity. We conducted experiments using the LLM that supports African languages during pre-training. In this study, MultilingualBert and AfriBERTa models were employed with hyperparameter tuning to perform sentiment polarity analysis on Amharic climate text. The experimental results shows that MultilingualBert outperforms AfriBERTa, achieving an accuracy of 69%. This performance is attributed to MultilingualBert’s enhanced capability to capture token-level semantics by giving a variety of attention across tokens, thereby improving its contextual understanding in downstream sentiment classification tasks
BambaraMLLM: A Unified Multilingual Multimodal Large Language Model for Comprehensive Bambara Language Processing
Authors: Seydou DIALLO, Allahsera Auguste Tapo, Kevin Assogba, Christopher M Homan
Abstract: BambaraMLLM is a unified multilingual multimodal large language model (MMLLM) designed to address the critical lack of digital resources for Bambara, a West African language spoken by over 15 million people. Unlike traditional approaches that rely on task-specific models for different linguistic functions, BambaraMLLM integrates text generation, automatic speech recognition (ASR), machine translation (MT), and text-to-tpeech (TTS) synthesis into a single, transformer-based architecture. This work establishes a scalable, open-source foundation for African language technology, optimizing for both performance and deployment under resource constraints.
LexiMCH: A Bilingual Medical Knowledge Lexicon for Maternal and Child Healthcare in Low-Resource Languages and Healthcare Environments
Authors: Aziza Umer Yibrie, Seid Muhie Yimam, Katrin Schöning‑Stierand, Kaleab Anteneh, Rebecca Ashagire, Robera Habtamu, Rahel Bekele
Abstract: Maternal and child healthcare (MCH) in low-resource contexts faces persistent challenges due to linguistic and cultural barriers to accessing medical information. To address this, we develop a multilingual terminology resource focusing on English and Amharic, using a combination of machine translation, large language models (LLMs), and expert-in-the-loop validation. In this work, we evaluate a subset of 90 terms and definitions across multiple translation models, including Google Translate, NLLB-200, M2M100, and several LLM variants (GPT, LLaMA, Gemma, DeepSeek, Gemini, and Mistral). We use BLEU, chrF, and ROUGE-L metrics to assess translation quality for both terms and definitions. Preliminary results indicate variable performance across models, with DeepSeek-R1 achieving the highest BLEU scores (0.916 for definitions and 0.985 for terms) and LLM-assisted translations generally performing better on definitions than on terms. Ongoing work is extending the evaluation to the full dataset and further refining translation pipelines to produce a comprehensive, open-access, AI-ready resource for maternal and child healthcare in low-resource languages.
May I Read My X-Ray Report? Towards Accessible Radiology in Low-Resource African Contexts
Authors: Aziza Umer Yibrie, Abinew Ali Ayele, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam
Abstract: Healthcare communication in native languages is a critical unmet need for Amharic-speaking populations in Ethiopia and diaspora communities. This study develops a preliminary framework for translating English radiology reports into Amharic using multilingual machine translation systems (Google Translate, NLLB-200, M2M100) and instruction-tuned large language models (GPT-4.1-mini, Gemini-2.0-Flash, and others), combined with human-in-the-loop evaluation. A subset of 100 IU X-Ray reports is translated, with 67 reports manually annotated for systematic assessment. Preliminary evaluation shows that Google Translate achieves the highest overall performance (BLEU 46.17, chrF 48.74, ROUGE-L 42.39), while LLMs such as Gemini-2.0-Flash (chrF 27.55) and GPT-4.1-mini (BLEU 13.14) produce fluent Amharic text but require substantial post-editing to ensure correct clinical terminology. Human annotator analysis emphasizes the importance of expert oversight in achieving terminological accuracy and report completeness. This work establishes an initial benchmark, introduces a scalable workflow, and provides a foundation for developing reliable Amharic radiology report translation systems, with potential applicability to other low-resource languages.
Media Framing Analysis of Ethiopian Conflict: An Approach Combining MAXQDA and NLP for Low-resource Languages
Authors: Adem Chanie Ali, Seid Muhie Yimam
Abstract: This ongoing research explores the application of computational methods to media framing analysis of Ethiopian conflict coverage, focusing on Amharic-language texts. Leveraging MAXQDA24 as a digital qualitative analysis platform, we utilize a computational coding approach to identify and visualize key framing strategies such as moral, demonization, humanitarian, and responsibility across a corpus of 150 articles from Addis Zemen and Addis Standard collected during 2023–2025. Although MAXQDA itself is not a traditional NLP tool, its computational functionalities (code system, memo editor, code frequencies, code relations analysis, visualization, etc) serve to operationalize qualitative framing constructs within a systematic digital workflow, enhancing transparency and reproducibility in conflict media analysis. This preliminary phase reveals distinct framing patterns aligned with each outlet's institutional role: government narrative emphasizing security and demonization, versus independent framing emphasizing humanitarian concerns and systemic accountability. These findings demonstrate how digital qualitative analysis can serve as a computational proxy for conceptual framing analysis, enabling important insights into politicized discourse, especially in low-resource, languages like Amharic. Looking ahead, the project plans to scale this analysis by developing and applying advanced NLP techniques such as machine learning classifiers, transformer based models, and topic modeling on a larger dataset of approximately 5,000 annotated articles. This dataset has already been collected, designed to capture a broader spectrum of conflict-related discourse. Combining the present qualitative insights with automated NLP pipelines aims to push towards scalable, semi-automated conflict framing detection tailored for low-resource languages. Our contribution is twofold: first, demonstrating the efficacy of computational-qualitative workflows using widely accessible tools like MAXQDA in challenging linguistic and political contexts; second, paving the way for NLP-driven conflict analysis in underrepresented languages through resource creation and model adaptation. By integrating semantic, lexical, and entity-based features, the planned NLP methods aspire to accurately classify and analyze complex frames such as blame, omission, and moral judgment at scale, opening avenues for real-time monitoring and analysis of conflict narratives. This will give us a computational framing analysis framework grounded in media framing theory in communication research. This research addresses critical challenges in low-resource NLP, including limited annotated data, language-specific morphological complexity, and the sensitive nature of conflict discourse. It underscores the importance of combining communication research approaches with cutting-edge NLP techniques to advance multilingual media analysis in conflict environments.
Probing Gender Bias in Masked Language Models for Low-Web Data Languages
Authors: Bontu Fufa Balcha, Jitu Ewnetu Hailu, Senait Mengesha Yayo, Hellina Hailu Nigatu
Abstract: Low-resourced languages are increasingly included in large multilingual models. While including more languages in pretrained models is a sign of progress, large models still underperform on low-resourced languages. In prioritizing scale over effective processing, we risk 1) deploying language technologies that misrepresent these languages and 2) amplifying gender biases embedded in training corpora. In this paper, we investigate how masked language models encode gender for three low-web-data languages, Afan Oromo, Amharic, and Tigrinya, and how these representations shift after continued pretraining on NLLB data. Using a controlled cloze-style probing setup, we examine prediction patterns. Our findings show consistent gender asymmetries and predictions aligned with stereotypical adjectives and occupations. After continued pretraining, we find that male-gendered predictions reach up to 68% in Amharic, while neutral predictions exceed 60% in Afan Oromo. Our work shows that expanding training data does not guarantee balanced gender representations without careful consideration in data curation.
SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
Authors: Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran, Sunipa Dev
Abstract: Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.
The Rise of AfricaNLP: Contributions, Contributors, and Community Impact (2005–2025)
Authors: Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Grigori Sidorov
Abstract: Natural Language Processing (NLP) is undergoing constant transformation, as Large Language Models (LLMs) are driving daily breakthroughs in research and practice. In this regard, tracking the progress of NLP research and automatically analyzing the contributions of research papers provides key insights into the nature of the field and the researchers. This study explores the progress of African NLP (AfricaNLP) by asking (and answering) research questions about the progress of AfricaNLP (publications, NLP topics, and NLP tasks), contributions (data, method, and task), and contributors (authors, affiliated institutions, and funding bodies). We quantitatively examine two decades (2005–2025) of contributions to AfricaNLP research, using a dataset of 1.9K NLP papers, 4.9K contributing authors, and 7.8K human-annotated contribution sentences (\texttt{AfricaNLPContributions}), along with benchmark results. Our dataset and AfricaNLP research explorer tool will provide a powerful lens for tracing AfricaNLP research trends and hold potential for generating data-driven research approaches.
Towards Multimodal Cultural Context Modeling for African Languages in Large Language Models
Authors: Mahule Roy, Subhas Roy
Abstract: This preliminary work addresses the critical gap in multimodal Large Language Models (LLMs) for African languages, which remain underrepresented despite their rich multimodal communication traditions. We propose a framework that leverages simulated multimodal data and cross-lingual transfer learning to bootstrap multimodal capabilities. Our initial experiments with Swahili demonstrate that proxy multimodal embeddings can be effectively generated using pre-trained encoders, achieving an average cosine similarity of 0.72 for culturally relevant concepts. We further show that simple fusion methods can effectively combine these embeddings, and that transfer learning from high-resource languages yields a 28% improvement in multimodal alignment over zero-shot approaches. These results validate the feasibility of our approach and provide a foundation for culturally-aware multimodal LLMs in low-resource African language contexts.
Trust but Check: LLM-Assisted Review of Human Translations in African Languages
Authors: Tadesse Destaw Belay, Henok Biadglign Ademtew, Idris Abdulmumin, Sukairaj Hafiz Imam, Abubakar Juma Chilala, Godfred Agyapong, CHINEDU EMMANUEL MBONU
Abstract: Large-scale translation projects for African languages increasingly rely on human translators to ensure cultural and linguistic fidelity. However, even professionally produced translations often contain subtle semantic errors, omissions, and terminology inconsistencies that are difficult to detect, particularly in many languages. As a result, manual quality control becomes prohibitively expensive at scale, creating a major bottleneck in the development of high-quality Natural Language Processing (NLP) resources. Recent advances in multilingual large language models (LLMs) offer promising opportunities to support translation quality review, for example, by serving as lightweight assistants that flag potentially problematic segments for further inspection. In this work, we investigate how LLMs assist translation quality review while preserving human oversight, thereby enabling more efficient and trustworthy translation quality audit pipelines for African languages. We conduct our study on 13 African languages from the MAFAND-MT dataset. Our findings indicate that state-of-the-art LLMs, such as GPT-5.1, can assist in auditing translation errors and suggesting candidate corrections for low-resource languages. However, they remain far from being a stand-alone solution for the automatic correction of human translations in African language datasets. The outputs of this work, including the improved MAFAND-MT test set and the accompanying quality audit annotation tool, provide valuable resources for researchers conducting further machine translation quality analysis and evaluation.
What Do Prompts Reveal About Model Capabilities in Low-Resource Languages?
Authors: Oluwaseun A. Ajayi
Abstract: Large language models are extremely sensitive to prompt design, a phenomenon which is amplified in multilingual scenarios especially with low-resource languages due to low coverage in model training data, orthographic variation and tokenization issues.
In this work, we evaluate a reflective prompt evaluation technique (GEPA) as an inference time optimization strategy on multiple multilingual benchmarks spanning various African languages. Using a more capable model as a reflection model and operating under strict optimization budgets - we show that reflective prompt optimization enables inference time improvements through textual policy evolution resulting in consistent improvements on different tasks without any weight updates. Our evaluation mostly focuses on closed-source models and we observe that on some of the benchmarks, prompt-optimized smaller models can outperform better and more recent models highlighting the importance of instruction design for measuring the capabilities of a language model on a given task. Qualitative review of model outputs before and after optimization also shows that prompt evolution not only reinforces models ability to perform a particular task, it also improves output formatting which is very important for proper model evaluation. We characterize the resulting prompt-latency tradeoff by quantifying the optimization cost in terms of prompt token growth, and our results show that modest increase in prompt size can result in substantial gains in performance. Finally, we argue that benchmark evaluations should report prompt-optimized results alongside baseline prompts in order to properly reflect model capabilities for low-resource languages.
Shamsuddeen Hassan Muhammad
Google DeepMind Fellow, Imperial College London
Simbiat Ajao, University of Lagos
Bunmi Akinremi, Obafemi Awolowo University Ile-Ife
Jesujoba Alabi, Universität des Saarlandes
Felermino D. M. A. Ali, Universidade do Porto
Victor Jotham Ashioya, Kabarak University
Tadesse Destaw Belay, Instituto Politécnico Nacional, Centro de Investigación en Computación
Happy Buzaaba, Princeton University
Emmanuel Kigen Chesire, Kabarak University
Emmanuel Dorley, University of Florida
Bonaventure F. P. Dossou, Mila & McGill University
Khalid Elmadani, New York University, Abu Dhabi
Naome A Etori, University of Minnesota - Twin Cities
Eric Le Ferrand, Boston College
Elodie Gauthier, Orange
Gideon George, Data Science Nigeria
Agam Goyal, University of Illinois at Urbana-Champaign
David Guzmán, University of Toronto
Tajuddeen Gwadabe, Masakhane Research Foundation
Cari Beth Head, University of Florida
Raphael Iyamu, University of Florida
Sandeep Kumar Jha, LinkedIn Core AI
Adejumobi Monjolaoluwa Joshua, University of Agriculture Abeokuta
Sulaiman Kagumire, Makerere University
Aditi Khandelwal, Mila & McGill University
Alfred Malengo Kondoro, Hanyang University
Sujay S Kumar, Tesla
Sven Lampe, Carl von Ossietzky Universität Oldenburg
Melaku Lake, Injibara
En-Shiun Annie Lee, Ontario Tech University
Senyu Li, Mila & McGill University
Weiran Lin, Carnegie Mellon University
Elie Mulamba, Université de Kinshasa
Francois Meyer, University of Cape Town
Anjishnu Mukherjee, George Mason University
Mulubrhan Abebe Nerea, University West
Gebregziabihier Nigusie, Mizan Tepi University
Chester Palen-Michel, Brandeis University
Perez Ogayo, Oracle
Kelechi Ogueji, ServiceNow
Odunayo Ogundepo, University of Waterloo
Tolúlopé Ògúnrèmí, Stanford University
Jessica Ojo, Mila & McGill University
Ifeoma Okoh, University of Ibadan
Akintunde Oladipo, University of Waterloo
Flora Oladipupo, Data Science Nigeria
Stephen D. Richardson, Brigham Young University
Nathaniel Romney Robinson, Whiting School of Engineering, JHU
Ted Pedersen, University of Minnesota, Duluth
Elizabeth Salesky, Google DeepMind
Fabian David Schmidt, Bayerische Julius-Maximilians-Universität Würzburg
Tajwaa Scott, California State University, Los Angeles
Walelign Tewabe Sewunetie, African Institute for Mathematical Science, AIMS Rwanda
Olamide Shogbamu, Data Science Nigeria
Rashidat Damilola Sikiru, Obafemi Awolowo University Ile-Ife
Yueqi Song, Carnegie Mellon University
Van-Thuy Phi, RIKEN
Jiayi Wang, University College London
Seid Muhie Yimam, Universität Hamburg
Hao Yu, Mila & McGill University
You are invited to join the Masakhane community Slack (channel #africanlp-acl2026-support). Meet other participants and find collaborators, mentors, and advice there. Organizers will be available on Slack to answer questions regarding submissions, format, topics, etc. If you have any doubt whether you can contribute to this workshop (e.g., if you have never written a paper, if you are new to NLP, if you do not have any collaborators, if you do not know LaTeX, etc.), please join Slack and contact us there as well.
To reach out to the workshop organizers, please email africanlp-eacl2026@googlegroups.com.