Co-located with EACL 2026, March 28th, 2026
Rabbat, Morocco
The AfricaNLP workshop has become a core event for the African NLP community and has drawn global attendance and interest for researchers working on African languages, African corpora, and NLP tasks with importance to the African continent.
AfricaNLP invites papers related to any aspect of NLP for African languages.
In the current landscape, large language models (LLMs) have seen widespread use and significant innovation, yet African languages remain underrepresented. To address this disparity, the theme for the 2026 workshop is "Multilingual Multimodal LLMs." We believe this is an especially timely theme as LLMs are becoming more multilingually capable, but their performance across other modalities such as images and speech continues to lag behind. The workshop aspires to bring together a diverse group of researchers to explore solutions, collaborations, and innovation around enhancing LLMs’ capabilities in African languages and ensuring cultural awareness in their applications.
The workshop has several aims:
To invite a variety of speakers from industry, research networks, and academia to get their perspectives on the development of large language models and how African languages have and have not been represented in this work
To provide a venue to discuss the benefits and potential harms of these language models on the speakers of African languages and African researchers.
To enable positive interaction between academic, industry, and independent researchers around this theme and encourage collaboration and engagement for the benefit of the African continent
To foster further relationships between African linguistics and NLP communities. It is clear that linguistic input about African languages is key in the evaluation and development of African models
To showcase work being done by the African NLP community and provide a platform to share this expertise with a global audience interested in NLP techniques for low-resource languages
To promote multidisciplinarity within the African NLP community to create a holistic participatory NLP community that will produce NLP research and technologies that value fairness, ethics, decolonial theory, and data sovereignty
To provide a platform for the groups involved with the various projects to meet, interact, share, and forge closer collaboration
To provide a platform for junior researchers to present papers and solutions and begin interacting with the wider NLP community
To present an opportunity for more experienced researchers to publicize their work further and inspire younger researchers through keynotes and invited talks
Topics include, but are not limited to:
analyses of African languages by means of computational linguistics
empirical studies reporting results from applying or adapting NLP developed for high-resource languages to African languages
new model architectures tailored for African languages
new resources for African languages
using NLP techniques on African datasets
text generation for African languages
methods addressing out-of-domain generalization for NLP tasks with training data in very limited domains
transfer learning between African languages or from higher-resourced to lower-resourced languages
challenges or solutions for resource gathering for African NLP tasks
crowd-sourcing and open-sourcing software for African NLP
multidisciplinary and participatory research in African NLP
tutorials for African NLP for education or development purposes
new tools/software for African NLP
development of NLP systems for African languages for production
socio-linguistic research for African languages and their decolonization
ethical considerations for African NLP
This workshop follows the previously successful editions in 2020, 2021, 2022, 2023, 2024, and 2025. It will be hybrid and co-located with EACL 2026. No submissions will be automatically desk-rejected.
Important Dates
Workshop date: March 28th, 2026
Barbara Plank is Full Professor and Chair for AI and Computational Linguistics at LMU Munich, Co-director of the Center for Information and Language Processing and Head of the MaiNLP (Munich AI and NLP) lab at LMU. Barbara Plank is an ELLIS Fellow (European Laboratory for Learning and Intelligent Systems) and regularly serves in international organizations and on scientific advisory committees.
Title: The Emergence of Multilingual Representations: Tracing Linguistic Capabilities During Language Model Pretraining
Multilingual large language models exhibit remarkable zero-shot and cross-lingual transfer capabilities. However, most analyses focus on fully trained models, leaving limited understanding of how and when different types of linguistic information emerge, interact, and align within multilingual representation spaces during training.
In this talk, I present a series of studies investigating the training dynamics of linguistic knowledge in language models, tracing how linguistic structure and cross-lingual alignment develop over time. Studying these dynamics requires access to intermediate checkpoints, which are only available to a limited extent. Nevertheless, analyzing emerging representations opens up new avenues for diagnosing and improving multilingual LLMs. Understanding how alignment forms during pretraining is particularly important for models intended to support underrepresented and low-resource languages, where effective transfer and shared representations are crucial for performance.
Francois Meyer is a Lecturer in the Computer Science Department at the University of Cape Town and co-investigator in the UCT NLP research group. His research is on data-efficient language modelling and linguistically informed subword tokenisation. He completed his PhD at the University of Cape Town and previously obtained a masters in AI at the University of Amsterdam.
Title: Data-Efficient Language Modelling for Low-Resource Languages
Progress in language modelling has been driven by scaling data and model size, but this approach is infeasible for most African languages. In this talk, I will present our work on developing data-efficient language models - architectures and training algorithms that improve performance on limited training data. I will present examples of how linguistically informed modelling, which targets and leverages the linguistic properties of specific languages, can improve sample efficiency. Finally, I will discuss the emerging intersection between low-resource NLP and developmentally inspired NLP, exploring how insights from human language learning can help us build more efficient models.
Felermino Ali is a researcher at MSR Africa and a PhD Candidate at the University of Porto in Portugal, focused on natural language processing (NLP) with a specialization in low-resource African languages. His work centers on building neural machine translation systems for low-resource languages and advancing methods to more effectively evaluate MT performance in low-resource settings.
Title: Beyond Parallel Data: Harnessing External Knowledge for Low‑Resource MT
Translating from high-resource languages into Mozambican languages remains a pressing challenge in African NLP. The scarcity of parallel corpora, orthographic variation across dialects, and the frequent presence of loanwords and code-switching complicate the task of building robust translation systems. In this talk, I will share how we address these barriers through lexicon-guided neural machine translation. By integrating bilingual dictionaries and systematic loanword mappings directly into the training process, we move beyond data scarcity toward structured lexical enrichment. Our approach leverages over dictionary entries and loanword mappings to construct sentence-specific glossaries, dynamically incorporated via input augmentation. On FLORES benchmarks, this method demonstrates clear gains: stronger lexical coverage, fewer inconsistencies, and translations that better capture contextual nuance. Beyond the technical improvements, this work points to a broader vision: advancing low-resource machine translation not only by scaling data but by intelligently bridging vocabulary gaps with structured linguistic knowledge. For Mozambican languages, this means opening pathways to more inclusive digital communication, empowering communities, and ensuring that the linguistic richness of African languages is represented in the global NLP landscape.
Atnafu Lambebo Tonja is a Postdoctoral Researcher at the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in the UAE, where he leads projects on culturally-diverse multilingual visual question answering and multimodal machine translation. He earned his PhD in Computer Science from Instituto Politécnico Nacional in Mexico City, focusing on neural machine translation for low-resource languages. His research focuses on advancing NLP for underrepresented languages, particularly African and Ethiopian languages, through the development of multilingual language models and sustainable data curation frameworks. His work on low-resource languages, especially for African and Ethiopian languages, has been published in top NLP venues including ACL, EMNLP, and NAACL.
Title: Towards Multimodal AI for African Languages and Cultures: Lessons from Afri-MCQA
What will it take to develop multimodal AI that truly comprehends African languages and cultures? In this talk, I explore this question through lessons from Afri-MCQA, a benchmark covering 15 African languages across 12 countries. Our evaluation highlighted that current models face major challenges, such as 1) they are unable to process speech in African languages, 2) they lack cultural context, and 3) they struggle to generate culturally relevant responses, rather than merely recognizing them. I will share these insights and outline a pathway forward, emphasizing the importance of speech-first development, culturally grounded training, and cross-lingual knowledge transfer as critical steps in creating effective multimodal AI for Africa.
Julia Kreutzer is a Senior Research Scientist at Cohere Labs, where she focuses on research around multilingual large language models. She has a background in machine translation, with a PhD from Heidelberg University and prior work experience at Google Translate. She's passionate about advancing NLP technologies for underrepresented languages and has been part of multiple open science initiatives to work towards this goal collaboratively.
Title: The knowns and unknowns of multilingual data augmentation
In this talk I will present recipes for multilingual fine-tuning data augmentation that have been developed to overcome data scarcity in languages beyond English. We will then discuss what the limitations of these approaches are, and what directions are relevant for future research.
Advancing African NLP: UDMorph and flexiPipe
Authors: Maarten Janssen [paper]
Abstract: In this paper, we present some of our recent efforts to provide base NLP pipelines for African languages. These include an infrastructure called UDMorph to make UD-compatible training data available for resources that do not have dependency relations, and a Python package called flexiPipe to easily run an NLP pipeline in various NLP tools using a uniform front-end, including the models provided by UDMorph. flexiPipe also provides Unicode normalization, an often overlooked feature that has a significant impact on African NLP. flexiPipe currently provides an NLP pipeline for 33 African languages, a significant increase from the handful of models that are currently easily accessible. And UDMorph is designed to make it easy to provide training data for more languages.
AfriCaption: Establishing a New Paradigm for Image Captioning in African Languages
Authors: Mardiyyah Oduwole, Prince Mireku, Fatimo Adebanjo, Oluwatosin Olajide, Mahi Aminu Aliyu, Jekaterina Novikova [paper]
Abstract: Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parameter vision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across underrepresented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for underrepresented African languages, laying the groundwork for truly inclusive multimodal AI.
AfriNLLB: Efficient Translation Models for African Languages
Authors: Yasmin Moslem, Aman Kassahun Wassie, Amanuel Gizachew [paper]
Abstract: In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200. We compress the NLLB-200 600M with two approaches, iterative layer pruning and quantization. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being over 20\% times faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we curated for fine-tuning the baseline and pruned models to facilitate further research.
Building a Conversational AI Assistant for African Travel Services with LLMs and RAG
Authors: Grace Kevine Ngoufo, Shamsuddeen Hassan Muhammad, Kevin Jeff Fogang Fokoa [paper]
Abstract: Travel agencies in many African countries face increasing pressure to handle large volumes of customer inquiries with limited staff or, either non-existent or outdated rule-based chat-bots. To address this challenge, we develop a conversational virtual assistant powered by a Large Language Model (LLM) and enhanced with a Retrieval-Augmented Generation (RAG) pipeline. The system combines LLM reasoning, company-specific knowledge retrieval, and real-time API (Application Programming Interface) integration to deliver accurate, context-aware responses through WhatsApp, the region’s most widely used communication platform. A dedicated web interface enables staff to upload and update internal documents, ensuring that the assistant remains aligned with changing service information. Demonstrations show that the proposed solution improves response speed, enhances user experience, and reduces operational burden.
Evaluating Yoruba Text-to-Speech Systems for Accessible Computer-Based Testing in Visually Impaired Learners
Authors: Kausar Yetunde Moshood, Victor Tolulope Olufemi, Oreoluwa Boluwatife Babatunde, Emmanuel Bolarinwa, Williams Oluwademilade [paper]
Abstract: Text-to-Speech (TTS) technology offers potential to improve exam accessibility for visually impaired learners, but existing systems often underperform in underrepresented languages like Yoruba. This study evaluates current Yoruba TTS models in delivering standardized exam content to five visually impaired students through a web-based interface. Before testing, four Yoruba TTS systems were compared; only Facebook’s mms-tts-yor and YarnGPT produced intelligible Yoruba speech. Students experienced exam questions delivered by human voice, Braille, and TTS. All preferred Braille for clarity and independence, some valued human narration, while TTS was least favored due to robotic and unclear output. These results reveal a significant gap between TTS capabilities and the needs of users in low-resource languages. The paper highlights the urgency of developing tone-aware, user-centered TTS solutions to ensure equitable access to digital education for visually impaired speakers of underrepresented languages.
Dealing with the Hard Facts of Low-Resource African NLP
Authors: Michael Leventhal, Yacouba Diarra, Nouhoum COULIBALY, Panga Azazia Kamaté, Aymane Dembélé, Madani Amadou Tall, Emmanuel Élisé Koné [paper]
Abstract: Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
Developing an English–Efik Corpus and Machine Translation System for Digitization Inclusion
Authors: Offiong Bassey Edet, Mbuotidem Sunday Awak, Emmanuel Ubene Oyo-Ita, Benjamin Okon Nyong, Ita Etim Bassey [paper]
Abstract: Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English–Efik translation, leveraging a small-scale, community-curated parallel corpus of $N = 13,865$ sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB-200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of $26.64$ for English–Efik and $31.21$ for Efik–English, with corresponding chrF scores of $51.04$ and $47.92$, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.
EduNaija AI Tutor: A Multi-Agent Retrieval-Augmented Generation System for Nigerian Curriculum Education
Authors: Israel Olanrewaju Odeajo, Edifon Emmanuel Jimmy [paper]
Abstract: Equitable access to quality education remains a critical challenge in Nigeria, where millions of students prepare annually for standardized examinations (WAEC, NECO, JAMB) with limited access to personalized tutoring (Badei et.al, 2024). This research presents EduNaija AI Tutor, a multi-agent Retrieval-Augmented Generation (RAG) system designed to democratize educational support through AI-powered tutoring aligned with Nigerian curricula. The system integrates conversational AI with document-based question answering, automated assessment generation, and multilingual support for English, Yoruba, Hausa, and Igbo. Using LangChain for agent orchestration, OpenAI GPT models for natural language processing, and FAISS for vector retrieval, the system enables students to interact with educational content through natural language queries while maintaining cultural relevance through Nigerian-contextualized examples and conventions (Chukwuma et.al, 2024). The multi-agent architecture comprises five specialized components: a main orchestrator, explanation agent, quiz generation agent, web search agent, and RAG agent for processing uploaded educational materials. Preliminary evaluation demonstrates the system's capability to provide curriculum-aligned explanations, generate practice assessments, and answer questions from uploaded textbooks and study materials. This work contributes a culturally-aware educational AI framework addressing linguistic diversity and curriculum alignment challenges in African educational contexts, while leveraging open-source tools for reproducibility and accessibility (Shoukat et.al, 2025).
Enhancing Automatic Speech Recognition Models for Maternal and Reproductive Health: Fine-Tuning and Real-World Evaluation in Wolof
Authors: Ertony Basilwango, Yann LE BEUX, Oche David Ankeli, Pierre Herve Berdys, Dhananjay Balakrishnan [paper]
Abstract: Automatic Speech Recognition (ASR) systems perform well for high-resource languages, but most African languages, including Wolof, remain underrepresented, particularly in maternal and reproductive healthcare. This work proposes a domain-specific approach to improving Wolof ASR under low-resource conditions, addressing limited annotated data, orthographic variability, and code-switching. We curated a dataset of 750 validated Wolof utterances covering 250 maternal health keywords and applied data augmentation to increase acoustic diversity. Pretrained models, including wav2vec~2.0 and Whisper, were benchmarked to select candidates for fine-tuning. Using parameter-efficient Low-Rank Adaptation (LoRA), a Whisper model was adapted to the maternal health domain. Evaluation using Word Error Rate (WER), Character Error Rate (CER), and Keyword Error Rate (KER), which measures medically critical term transcription accuracy, shows substantial gains, reducing WER from 46.5% to 23.2% and KER from 17% to 11%. Community-based evaluation on 1,340 real-world utterances reveals a moderate degradation, with WER increasing by 35%. These results demonstrate that lightweight domain adaptation with small, high-quality data can significantly improve ASR for low-resource healthcare applications.This work introduces one of the first Wolof ASR datasets for healthcare and presents a practical framework for developing reliable speech recognition tools in underrepresented languages, improving access to healthcare information and services.
Evaluating Native-Speaker Preferences on Machine Translation and Post-Edits for Five African Languages
Authors: Hiba El Oirghi, Tajuddeen Gwadabe, Marine Carpuat [paper]
Abstract: Wikipedia editors undertake the task of editing machine translation (MT) outputs in various languages to disseminate multilingual knowledge from English. But are editors doing more than just translating or fixing MT output? To answer this broad question, we constructed a dataset of 4,335 fine-grained annotated parallel pairs of MT translations and human post-edit (HE) translations for five low-resource African languages: Hausa, Igbo, Swahili, Yoruba, and Zulu. We report on our data selection and annotation methodologies as well as findings from the annotated dataset, the most surprising of which is that annotators mostly preferred the MT translations over their HE counterparts for three out of five languages. We analyze the nature of these "fluency breaking" edits and provide recommendations for the MT post-editing workflows in the Wikipedia domain and beyond.
Eyaa-Tom 26, Yodi-Mantissa and Lom Bench: A Community Benchmark for TTS in Local Languages
Authors: BAKOUBOLO ESSOWE JUSTIN, Catherine Nana Nyaah Essuman [paper]
Abstract: Most of the more than 40 languages spoken in Togo lack the data and tools necessary for modern natural language processing (NLP) applications. We present an extension of previous work by introducing new datasets, improved models, and a community-driven evaluation benchmark for text-to-speech (TTS). We expanded the Eyaa-Tom multilingual corpus with additional speech data (e.g. 26.9k recordings, 30.9 hours) across 10 local languages and incorporated Mozilla Common Voice contributions (64.6k clips, 46.6 hours) for Adja, Nawdm, Mina, Tem to strengthen automatic speech recognition (ASR) and speech synthesis. We detail how community contributors (including collaboration with a national TV journalist) helped collect and validate the Kabiyɛ and French text, with an ethical compensation model in place. We also try to compare the performance of a few models in these datasets, we fine-tuned state-of-the-art models in these data for ASR, OpenAI Whisper and faster-whisper were benchmarked achieving improved word error rates after fine-tuning; for machine translation, we fine-tuned Meta's NLLB-200 model in 11 local languages, which produced significant BLEU/METEOR gains especially in Ewɛ, and Kabɩyɛ. To evaluate TTS, we introduce Lom Bench, a new community-based benchmark where native speakers rate synthetic speech. The preliminary results from Lom Bench indicate promising naturalness in Ewɛ and Kabɩyɛ TTS, although further data is needed.
Full Fine-Tuning vs. Parameter-Efficient Adaptation for Low-Resource African ASR: A Controlled Study with Whisper-Small
Authors: Sukairaj Hafiz Imam, Muhammad Yahuza Bello, Hadiza Ali Umar, Tadesse Destaw Belay, Idris Abdulmumin, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad [paper]
Abstract: Automatic speech recognition (ASR) for African low-resource languages (LRLs) is often limited by scarce labelled data and the high cost of adapting large foundation models. This study evaluates whether parameter-efficient fine-tuning (PEFT) can serve as a practical alternative to full fine-tuning (FFT) for adapting Whisper-Small with limited labelled speech and constrained compute. We used a 10-hour subset of NaijaVoices covering Hausa, Yorùbá, and Igbo, and we compared FFT with several PEFT strategies under a fixed evaluation protocol. DoRA attains a 22.0% macro-average WER, closely aligning with the 22.1% achieved by FFT while updating only 4M parameters rather than 240M, and this difference remains within run-to-run variation across random seeds. Yorùbá consistently yields the lowest word error rates, whereas Igbo remains the most challenging, indicating that PEFT can deliver near FFT accuracy with substantially lower training and storage requirements for low-resource African ASR.
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Authors: Mamadou K. KEITA, Sebastien Diarra, Christopher M Homan, Seydou DIALLO [paper]
Abstract: Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.
Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara
Authors: Michael Leventhal, Yacouba Diarra, Nouhoum COULIBALY, Panga Azazia Kamate [paper]
Abstract: We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47% to 37.12% on one and from 36.07% to 32.33% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
Language Choice in Nigerian Social Media Hate Speech
Authors: Nneoma C Udeze, Rob Voigt [paper]
Abstract: Language choice in multilingual societies is rarely arbitrary. In Nigerian, English, Nigerian Pidgin (NP) and indigenous languages are strategically deployed in online discourse, yet little is known about how they function in hostile contexts. Here we conduct the first systematic analysis of NP in online hate speech on two platforms, Twitter and Instagram. Using a linguistically enriched annotation scheme, we label each post for class, targeted group, language variety, and hate type. Our results show that NP is disproportionately used in offensive and hateful discourse, particularly against Hausa, women, and LGBTQ+ groups, and that insults are the dominant hate strategy. Cross-domain evaluation further reveals that classifiers trained on Twitter systematically over-predict hate on Instagram, highlighting challenges of domain transfer. These findings underscore NP’s role as a linguistic resource for hostility and its sociolinguistic salience in amplifying stereotypes and affect. For NLP, the work demonstrates the need for NP-specific resources, sensitivity to figurative strategies, and domain adaptation across platforms. By bridging sociolinguistics and computational modeling, this study contributes new evidence on how language choice shapes online hate speech in a multilingual African context.
Leveraging CoHere Multilingual Embeddings and Inverted Softmax Retrieval for Automatic Parallel Sentence Alignment in Low-Resource Languages
Authors: Abubakar Auwal Khalid, Salisu Musa Borodo, Amina Abubakar Imam [paper]
Abstract: We present an improved method for automatic parallel sentence alignment in low- resource languages. We used CoHere multilingual embeddings and inverted softmax retrieval. Our technique achieved a higher F1-score of 78.30% on the MAFAND-MT test set, compared to the existing technique’s 54.75%. Precision and recall have shown similar performance. We assessed the quality of the extracted data by demonstrating that it outperforms the existing technique in terms of low-resource translation performance.
Linguistically Informed Evaluation of Multilingual ASR for African Languages
Authors: Fei-Yueh Chen, Lateef Adeleke, C. M. Downey [paper]
Abstract: Word Error Rate (WER) mischaracterizes ASR models' performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models' performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
Authors: Seung Hun Eddie Han, Youssef Mohamed, Mohamed Elhoseiny [paper]
Abstract: This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.
Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform
Authors: Abdifatah Ahmed Gedi, Shafie Abdi Mohamed, Yusuf Ahmed Yusuf, Muhidin A. Mohamed, Fuad Mire Hassan, Houssein A Assowe [paper]
Abstract: Lemmatization, which reduces words to their root forms, plays a key role in tasks such as information retrieval, text indexing, and machine learning-based language models. However, a key research challenge for low-resourced languages such as the Somali is the lack of human-annotated lemmatization datasets and reliable ground truth to underpin accurate morphological analysis and training relevant NLP models. To address this problem, we developed the first large-scale, purpose-built Somali lemmatization lexicon, coupled with a crowdsourcing platform for ongoing expansion. The system leverages Somali’s agglutinative and derivational morphology, encompassing over 5,584 root words and 78,629 derivative forms, each annotated with part-of-speech tags. For data validation purpose, we have devised a pilot lexicon-based lemmatizer integrated with rule-based logic to handle out-of-vocabulary terms. Evaluation on a 294-document corpus covering news articles, social media posts, and short messages shows lemmatization accuracies of 51.27% for full articles, 44.14% for excerpts, and 59.51% for short texts such as tweets. These results demonstrate that combining lexical resources, POS tagging, and rulebased strategies provides a robust and scalable framework for addressing morphological complexity in Somali and other low-resource languages.
Power Asymmetries, Bias, and AI, a Reflection of Society on Low-Resourced Languages - African Languages as Case Study
Authors: Simbiat Ajao [paper]
Abstract: In recent times, artificial intelligence (AI) systems have become the primary intermediary to information access, services, and opportunities. Currently, there are growing concerns as to how existing social inequalities are reproduced and amplified through AI. This is significantly evident in language technologies, where a small number of dominant languages or what we'll refer to as big languages and cultural contexts shape the training, design, and evaluation of models. This paper examines the intersections of power asymmetries, linguistic bias, and cultural representation in AI, with a major focus on African languages and communities. We argue that current Natural Language Processing (NLP) systems reflect a high level of global imbalances in the availability of data, infrastructure, and decision making power, often marginalizing low-resourced languages and cultural peculiarities. It is important we know that how these data are structured is a great determinant in what their outcome will be. With reference to examples from speech recognition, machine translation, and large language models, we highlight the social and cultural consequences of linguistic exclusion, including reduced accessibility, misinterpretation, and digital invisibility. Finally, we identify and discuss pathways toward more equitable language technologies, emphasizing community-led data practices, interdisciplinary collaboration, and context-aware evaluation frameworks. By foregrounding language as both a technical and political concern, this work advocates for African-centered approaches to NLP that promote fairness, accountability, and linguistic justice in AI development.
Real-Time Spoken Instruction Following and Translation in Ugandan Languages
Authors: Benjamin Akera, Tim Wenjie Hu, Patrick Walukagga, Evelyn Nafula Ouma, Yiga Gilbert, Ernest Tonny Mwebaze, John Quinn [paper]
Abstract: Many languages are predominantly spoken rather than written, and to bring the benefits of LLMs to speakers of these languages, it is essential that models cater to the voice modality. The typical approach is to cascade ASR, LLM and TTS models together, though this results in systems with high latency, making them unsuitable for natural, real-time interaction. We describe results on taking the encoder part of a Whisper-based model trained to recognise ten languages common in Uganda, and using the Ultravox architecture to project its output directly to the input embedding space of a text model based on Qwen 3 32B, also trained to have comprehension of those languages. The result is a speech LLM with high accuracy and very low latency. For most spoken prompts, we can begin streaming a text response within as low as 50 ms, and a speech audio response within around one second, making real-time spoken interaction with an LLM possible for the first time in these languages.
Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts
Authors: Millicent Ochieng, Anja Thieme, Ignatius Ezeani, Risa Ueno, Samuel Chege Maina, Keshet Ronen, Javier Gonzalez, Jacki O'Neill [paper]
Abstract: Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social science measurement lens, we operationalize LLM outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating greater interpretive stability, while smaller open-weight models in our study show reduced stability under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.
SALT-31: A Machine Translation Benchmark Dataset for 31 Ugandan Languages
Authors: Solomon Nsumba, Benjamin Akera, Evelyn Nafula Ouma, Medadi Ssentanda, Deo Kawalya, Engineer Bainomugisha, Ernest Tonny Mwebaze, John Quinn [paper]
Abstract: We present the SALT-31 benchmark dataset for evaluation of machine translation models covering 31 Ugandan languages. Unlike sentence-level evaluation sets, SALT-31 is constructed from short, scenario-driven mini-dialogues designed to preserve discourse context, pragmatics, and culturally grounded communication patterns common in everyday Ugandan settings. The dataset contains 100 English sentences organized into 20 typical communication scenarios, each represented as a five-sentence mini-sequence. It can therefore be used to evaluate both sentence-level and paragraph level machine translation, and includes nearly every language spoken in a country with high linguistic diversity.
Sample-Size Scaling of the African Languages NLI Evaluation
Authors: Anuj Tiwari, Oluwapelumi Ogunremu, Terry Oko-odion, Jesujuwon Egbewale, Hannah Sopuruchi Nwokocha [paper]
Abstract: African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language-sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language-sensitive datasets creation and stronger multi-lingual modelling strategies.
Sudanese-Flores: Extending FLORES+ to Sudanese Arabic Dialect
Authors: HADIA MOHMMEDOSMAN AHMED SAMIL, David Ifeoluwa Adelani [paper]
Abstract: In this work, we introduce Sudanese-Flores, an extension of the popular Flores+ machine translation (MT) benchmark to the Sudanese Arabic dialect. We translate both the DEV and DEVTEST splits of the Modern Standard Arabic dataset into the corresponding Sudanese dialect, resulting in a total of 2,009 sentences. While the dialect was recently introduced in Google Translate, there are no available benchmark in this dialect despite spoken by over 40 million people. Our evaluation on two leading LLMs such as GPT-4.1 and Gemini 2.5 Flash showed that while the performance English to Arabic is impressive (more than 23 BLEU), they struggle on Sudanese dialect (less than 11 BLEU) in zero-shot settings. In few-shot scenario, we achieved only a slight boost in performance.
Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation
Authors: Samuel Gyamfi, Alfred Malengo Kondoro, Yankı Öztürk, Richard Hans Schreiber, Vadim Borisov [paper]
Abstract: Despite serving over 100 million speakers as a vital African lingua franca, Swahili remains critically under-resourced for Natural Language Processing, hindering technological progress across East Africa. We present a scalable solution: a controllable synthetic data generation pipeline that produces culturally grounded Swahili text for sentiment analysis, validated through automated LLM judges. To ensure reliability, we conduct targeted human evaluation with a native Swahili speaker on a stratified sample, achieving 80.95% agreement between generated sentiment labels and human ground truth, with strong agreement on judge quality assessments. This demonstrates that LLM-based generation and quality assessment can transfer effectively to low-resource languages. We release a dataset and provide a reproducible pipeline in tandem, providing ample knowledge and working material for NLP researchers in low-resource contexts.
The Token Tax: Systematic Bias in Multilingual Tokenization
Authors: Jessica M. Lundin, Ada Zhang, Nihal Karim, Hamza Louzan, Guohao Wei, David Ifeoluwa Adelani, Cody Carroll [paper]
Abstract: Tokenization inefficiency is associated with structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and reducing accuracy. We evaluate 10 Large Language Models (LLMs) on AfriMMLU (5 subjects; 16 African languages) and show that token fertility reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (e.g., DeepSeek, o1) consistently outperform non-reasoning peers across high- and low-resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. In terms of economics, a doubling in tokens results in quadrupled training cost and time, underscoring the “token tax” faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
Using Subword-Embeddings for Bilingual Lexicon Induction in Bantu Languages
Authors: Adrian Breiding, Alan Akbik [paper]
Abstract: Bilingual Lexicon Induction (BLI) is a valuable tool in machine translation and cross-lingual transfer learning, but it remains challenging for agglutinative and low-resource languages. In this work, we investigate the use of weighted sub-word embeddings in BLI for agglutinative languages. We further evaluate a graph-matching and Procrustes-based BLI approach on two Bantu languages, assessing its effectiveness in a previously underexplored language family. Our results for Swahili with an average P@1 score of $51.84$% for a $3000$ word dictionary demonstrate the success of the approach for Bantu languages. Weighted sub-word embeddings perform competitively on Swahili and outperform word embeddings in our experiments with Zulu.
Where Are We at with Automatic Speech Recognition for the Bambara Language?
Authors: Seydou DIALLO, Yacouba Diarra, Panga Azazia Kamaté, Aboubacar Ouattara, Mamadou K. KEITA, Adam Bouno Kampo [paper]
Abstract: This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards; the top-performing system in terms of Word Error Rate (WER) achieved 46.76\% and the best Character Error Rate (CER) of 13.00\% was set by another model, while several prominent multilingual models exceeded 100\% WER due to severe hallucinations. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures likely establish an upper bound for performance in practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
ÒWE-Voice: An Evaluation of Monolingual and Multilingual ASR Model Using Yoruba Proverb Speech Dataset
Authors: Daud Abolade [paper]
Abstract: Given the advancement of various Artificial Intelligence (AI) technologies in the 21st century, Automatic Speech Recognition (ASR) plays a vital role in human and machine interaction and serves as an interface for a wide range of applications. The development of these high-performing, robust and useful technologies continue to gain more attention on high-resource languages due to high availability of language data, market profitability dominance and access to funding and research initiatives compared to the marginalised low-resource languages. Despite efforts to develop ASR systems for African languages, there are still numerous challenges due to limited speech datasets, tonal complexity and dialectal variation. In this study, we curated a domain-specific speech dataset for one of the oral Yoruba literatures, proverbs, which are highly culturally inclined. We used the Yoruba recording app that was developed for Iroyin-speech project to record 6 hours of Yoruba proverb sentences. The NCAIR1/Yoruba-ASR model which was finetuned on Open AI Whisper Small and Massively Multilingual Speech, a multilingual speech model featuring low-resource languages including Yoruba language was evaluated with the recorded Yoruba proverbs. Evaluation was conducted based on Word Error Rate (WER) and Tone Error Rate (TER). Our result shows that current ASR systems that support Yoruba does not capture cultural nuances. These findings highlight an urgent need to curate more robust speech datasets that are culturally embedded for low resource languages and in this case particularly, Yoruba language in order to build technological tools that preserve African culture, language and identity.
AfriStereo: A Culturally Grounded Dataset for Evaluating Stereotypical Bias in Large Language Models
Authors: Yann Le Beux, Oluchi Audu, Oche David Ankeli, Dhananjay Balakrishnan, Melissah Weya, Marie Daniella Ralaiarinosy, Ignatius Ezeani [paper]
Abstract: Existing AI bias evaluation benchmarks largely reflect Western perspectives, leaving African contexts underrepresented and enabling harmful stereotypes in applications across various domains. To address this gap, we introduce AfriStereo, the first open-source African stereotype dataset and evaluation framework grounded in local socio-cultural contexts. Through community engaged efforts across Senegal, Kenya, and Nigeria, we collect 1,163 stereotypes spanning gender, ethnicity, religion, age, and profession. Using few-shot prompting with human-in-the-loop validation, we augment the dataset to over 5,000 stereotype–antistereotype pairs. Entries are validated through semantic clustering and manual annotation by culturally informed reviewers. Preliminary evaluation of language models reveals that nine of eleven models exhibit statistically significant bias in our setup, with Bias Preference Ratios (BPR) ranging from 0.63 to 0.78 (p ≤ 0.05), indicating systematic preferences for stereotypes over antistereotypes, particularly across age, profession, and gender dimensions. Domain-specific models appear to show weaker bias in our setup, suggesting task-specific training may mitigate some associations. Looking ahead, AfriStereo opens pathways for future research on culturally grounded bias evaluation and mitigation, offering key methodologies for the AI community on building more equitable, context-aware, and globally inclusive NLP technologies.
African Voices Nigeria: 2500 hours of ethically sourced speech data for four Nigerian Languages
Authors: Ife Adebara, Oluwaseun Nifemi, Rashidat Damilola Sikiru, Olanrewaju Israel Lawal, Ololade Anjuwon, Olubayo Adekanmbi, Anthony Soronnadi, John Emeka Eze, Ewezu Ngim Ngim [paper]
Abstract: African languages remain severely underrepresented in large-scale speech resources, particularly for spontaneous, naturally occurring speech that reflects real-world linguistic use. We present African Voices, a large-scale, ethically governed speech dataset covering four Nigerian languages, [~2,500] hours of audio, and 2,865 speakers, with a focus on spontaneous and scripted speech across diverse sociolinguistic contexts. Unlike existing resources that primarily rely on read or scripted speech, African Voices captures natural variation in accent, dialect, register, and code-switching, accompanied by rich demographic and contextual metadata. We describe the data collection methodology, transcription and a principled governance framework designed to support responsible use of speech data in low-resource settings. We further provide baseline automatic speech recognition results across languages. African Voices enables research on robust and fair ASR and serves as a foundational resource for advancing NLP research in African languages.
AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic
Authors: Israel Abebe Azime, Abenezer Angamo, Hana Mekonen Tamiru, Dagnachew Mekonnen Marilign, Philipp Slusallek, Seid Muhie Yimam, Dietrich Klakow [paper]
Abstract: With the growing emphasis on multilingual and cultural evaluation benchmarks for large language models, language and culture are often treated as synonymous, and performance is commonly used as a proxy for a model’s understanding of a given language. In this work, we argue that such evaluations overlook meaningful cultural variation that exists within a single language. We address this gap by focusing on narratives from different regions of Ethiopia and demonstrate that, despite shared linguistic characteristics, region-specific and domain-specific content substantially influences language evaluation outcomes. To this end, we introduce AmharicStoryQA, a long-sequence story question answering benchmark grounded in culturally diverse narratives from Amharic-speaking regions. Using this benchmark, we reveal a significant narrative understanding gap in existing LLMs, highlight pronounced regional differences in evaluation results, and show that supervised fine-tuning yields uneven improvements across regions and evaluation settings. Our findings emphasize the need for culturally grounded benchmarks that go beyond language-level evaluation to more accurately assess and improve narrative understanding in low-resource languages.
BambaraMLLM: A Unified Multilingual Multimodal Large Language Model for Comprehensive Bambara Language Processing
Authors: Seydou DIALLO, Allahsera Auguste Tapo, Kevin Assogba, Christopher M Homan [paper]
Abstract: BambaraMLLM is a unified multilingual multimodal large language model (MMLLM) designed to address the critical lack of digital resources for Bambara, a West African language spoken by over 15 million people. Unlike traditional approaches that rely on task-specific models for different linguistic functions, BambaraMLLM integrates text generation, automatic speech recognition (ASR), machine translation (MT), and text-to-tpeech (TTS) synthesis into a single, transformer-based architecture. This work establishes a scalable, open-source foundation for African language technology, optimizing for both performance and deployment under resource constraints.
Can I Read My X-Ray Report? Towards Accessible Radiology Report in Low-Resource African Context
Authors: Aziza Umer Yibrie, Abinew Ali Ayele, Shamsuddeen Hassan Muhammad, Seid Muhie Yimam [paper]
Abstract: Healthcare communication in native languages is a critical unmet need for Amharic-speaking populations in Ethiopia and diaspora communities. This study develops a preliminary framework for translating English radiology reports into Amharic using multilingual machine translation systems (Google Translate, NLLB-200, M2M100) and instruction-tuned large language models (GPT-4.1-mini, Gemini-2.0-Flash, and others), combined with human-in-the-loop evaluation. A subset of 100 IU X-Ray reports is translated, with 67 reports manually annotated for systematic assessment. Preliminary evaluation shows that Google Translate achieves the highest overall performance (BLEU 46.17, chrF 48.74, ROUGE-L 42.39), while LLMs such as Gemini-2.0-Flash (chrF 27.55) and GPT-4.1-mini (BLEU 13.14) produce fluent Amharic text but require substantial post-editing to ensure correct clinical terminology. Human annotator analysis emphasizes the importance of expert oversight in achieving terminological accuracy and report completeness. This work establishes an initial benchmark, introduces a scalable workflow, and provides a foundation for developing reliable Amharic radiology report translation systems, with potential applicability to other low-resource languages.
LexiMCH: A Bilingual Medical Knowledge Lexicon for Maternal and Child Healthcare in Low-Resource Languages and Healthcare Environments
Authors: Aziza Umer Yibrie, Seid Muhie Yimam, Katrin Schöning‑Stierand, Kaleab Anteneh, Rebecca Ashagire, Robera Habtamu, Rahel Bekele, Martin Semmann [paper]
Abstract: Maternal and child healthcare (MCH) in low-resource contexts faces persistent challenges due to linguistic and cultural barriers to accessing medical information. To address this, we develop a multilingual terminology resource focusing on English and Amharic, using a combination of machine translation, large language models (LLMs), and expert-in-the-loop validation. In this work, we evaluate a subset of 90 terms and definitions across multiple translation models, including Google Translate, NLLB-200, M2M100, and several LLM variants (GPT, LLaMA, Gemma, DeepSeek, Gemini, and Mistral). We use BLEU, chrF, and ROUGE-L metrics to assess translation quality for both terms and definitions. Preliminary results indicate variable performance across models, with DeepSeek-R1 achieving the highest BLEU scores (0.916 for definitions and 0.985 for terms) and LLM-assisted translations generally performing better on definitions than on terms. Ongoing work is extending the evaluation to the full dataset and further refining translation pipelines to produce a comprehensive, open-access, AI-ready resource for maternal and child healthcare in low-resource languages.
Media Framing Analysis of Ethiopian Conflict: An Approach Combining MAXQDA and NLP for Low-resource Languages
Authors: Adem Chanie Ali, Seid Muhie Yimam [paper]
Abstract: This ongoing research employs computer-assisted methods and NLP techniques to analyze media framing of the Ethiopian conflict in Amharic texts, in two phases. The first phase uses qualitative frame analysis with systematic coding, thematic grouping, pattern detection, and visualization via MAXQDA. It investigates how Ethiopian media depict the conflict in Amhara and Oromia, focusing on framing strategies and responsibility attribution. Analyzing 150 Amharic newspaper articles from Addis Zemen (government-affiliated) and Addis Standard (independent) covering the conflict between 2023–2025. The study, grounded in media framing theory, reveals contrasting patterns: Addis Zemen emphasizes peace, responsibility, and demonization, often externalizing blame, while Addis Standard highlights civilian suffering and shared accountability, especially among the government, Fano, and OLA-Shene. Co-occurrence analysis shows connections between responsibility and humanitarian frames, emphasizing their interrelatedness. This demonstrates digital qualitative methods’ effectiveness in complementing traditional framing analysis. Looking ahead, phase two aims to scale this work by developing NLP techniques such as machine learning classifiers, transformer models, and topic modeling on a larger dataset of approximately 5,000 annotated articles. This dataset, already collected, aims to capture a wider spectrum of conflict-related discourse, integrating qualitative insights with automated NLP to enable scalable, semi-automated conflict framing detection for low-resource languages. The project addresses key challenges in low-resource NLP, including limited annotated data, morphological complexity, and the sensitive nature of conflict discourse, highlighting the potential of combining communication research with advanced NLP to improve multilingual media analysis in conflict zones.
Probing Gender Bias in Masked Language Models for Low-Web Data Languages
Authors: Bontu Fufa Balcha, Jitu Ewnetu Hailu, Senait Mengesha Yayo, Hellina Hailu Nigatu [paper]
Abstract: Low-resourced languages are increasingly included in large multilingual models. While including more languages in pretrained models is a sign of progress, large models still underperform on low-resourced languages. In prioritizing scale over effective processing, we risk 1) deploying language technologies that misrepresent these languages and 2) amplifying gender biases embedded in training corpora. In this paper, we investigate how masked language models encode gender for three low-web-data languages, Afan Oromo, Amharic, and Tigrinya, and how these representations shift after continued pretraining on NLLB data. Using a controlled cloze-style probing setup, we examine prediction patterns. Our findings show consistent gender asymmetries and predictions aligned with stereotypical adjectives and occupations. After continued pretraining, we find that male-gendered predictions reach up to 68% in Amharic, while neutral predictions exceed 60% in Afan Oromo. Our work shows that expanding training data does not guarantee balanced gender representations without careful consideration in data curation.
SAFARI: A Community-Engaged Approach and Dataset of Stereotype Resources in the Sub-Saharan African Context
Authors: Aishwarya Verma, Laud Ammah, Olivia Nercy Ndlovu Lucas, Andrew Zaldivar, Vinodkumar Prabhakaran, Sunipa Dev [paper]
Abstract: Stereotype repositories are critical to assess generative AI model safety, but currently lack adequate global coverage. It is imperative to prioritize targeted expansion, strategically addressing existing deficits, over merely increasing data volume. This work introduces a multilingual stereotype resource covering four sub-Saharan African countries that are severely underrepresented in NLP resources: Ghana, Kenya, Nigeria, and South Africa. By utilizing socioculturally-situated, community-engaged methods, including telephonic surveys moderated in native languages, we establish a reproducible methodology that is sensitive to the region's complex linguistic diversity and traditional orality. By deliberately balancing the sample across diverse ethnic and demographic backgrounds, we ensure broad coverage, resulting in a dataset of 3,534 stereotypes in English and 3,206 stereotypes across 15 native languages.
The Rise of AfricaNLP: Contributions, Contributors, and Community Impact (2005–2025)
Authors: Tadesse Destaw Belay, Kedir Yassin Hussen, Sukairaj Hafiz Imam, Ibrahim Said Ahmad, Isa Inuwa-Dutse, Aabr, Grigori Sidorov, Iqra Ameer, Idris Abdulmumin, Tajuddeen Gwadabe, Vukosi Marivate, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad [paper]
Abstract: Natural Language Processing (NLP) is undergoing constant transformation, as Large Language Models (LLMs) are driving daily breakthroughs in research and practice. In this regard, tracking the progress of NLP research and automatically analyzing the contributions of research papers provides key insights into the nature of the field and the researchers. This study explores the progress of African NLP (AfricaNLP) by asking (and answering) research questions about the progress of AfricaNLP (publications, NLP topics, and NLP tasks), contributions (data, method, and task), and contributors (authors, affiliated institutions, and funding bodies). We quantitatively examine two decades (2005–2025) of contributions to AfricaNLP research, using a dataset of 1.9K NLP papers, 4.9K contributing authors, and 7.8K human-annotated contribution sentences (\texttt{AfricaNLPContributions}), along with benchmark results. Our dataset and AfricaNLP research explorer tool will provide a powerful lens for tracing AfricaNLP research trends and hold potential for generating data-driven research approaches.
Towards Multimodal Cultural Context Modeling for African Languages in Large Language Models
Authors: Mahule Roy, Subhas Roy [paper]
Abstract: This preliminary work addresses the critical gap in multimodal Large Language Models (LLMs) for African languages, which remain underrepresented despite their rich multimodal communication traditions. We propose a framework that leverages simulated multimodal data and cross-lingual transfer learning to bootstrap multimodal capabilities. Our initial experiments with Swahili demonstrate that proxy multimodal embeddings can be effectively generated using pre-trained encoders, achieving an average cosine similarity of 0.72 for culturally relevant concepts. We further show that simple fusion methods can effectively combine these embeddings, and that transfer learning from high-resource languages yields a 28% improvement in multimodal alignment over zero-shot approaches. These results validate the feasibility of our approach and provide a foundation for culturally-aware multimodal LLMs in low-resource African language contexts.
Trust but Check: LLM-Assisted Review of Human Translations in African Languages
Authors: Tadesse Destaw Belay, Henok Biadglign Ademtew, Idris Abdulmumin, Sukairaj Hafiz Imam, Abubakar Juma Chilala, Godfred Agyapong, CHINEDU EMMANUEL MBONU, Basil Friday Ovu, Catherine Nana Nyaah Essuman, Alfred Malengo Kondoro, Sonia Adhiambo, Daud Abolade, Ponts'o Mpholle, Nicholaus Dismas Ladislaus, Saminu Mohammad Aliyu, Gali Ahmad Samuel, Fabrice Hakuzimana, Mike Nzirainengwe, Temitayo Olatoye, Sileshi Bogale Haile, Tewodros Achamaleh Bizuneh, Tolulope Olalekan Abiola, Kedir Yassin Hussen, Ibrahim Said Ahmad, Verrah Akinyi Otiende, Seid Muhie Yimam, Shamsuddeen Hassan Muhammad [paper]
Abstract: Large-scale translation projects for low-resource languages mostly rely on human translators to ensure cultural and linguistic fidelity. However, even professionally produced translations often contain subtle translation errors that are difficult to detect. Manual quality control at scale becomes prohibitively expensive, creating a major bottleneck in the development of high-quality Natural Language Processing (NLP) resources. Recent advances in multilingual large language models (LLMs) offer promising support for annotation workflows with human-in-the-loop settings. In this work, we investigate the use of LLMs to assist in auditing translation quality, enabling more efficient quality control pipelines for low-resource African languages. We audit translations in 11 African languages using the MAFAND-MT dataset, combining LLM-as-a-judge, native-speaker human review, and automated metrics. Our quality-audited version of MAFAND-MT test set yields performance gains across all languages, with BLEU scores ranging from 0.4 to 9.27 and chrF scores ranging from 0.3 to 8.69. Our findings further indicate that state-of-the-art LLMs, such as GPT-5.1, can assist in auditing translation quality and suggesting candidate corrections for low-resource languages. However, they remain far from being a stand-alone solution for the automatic correction of human translations in African languages.
What Do Prompts Reveal About Model Capabilities in Low-Resource Languages?
Authors: Oluwaseun A. Ajayi [paper]
Abstract: Large language models (LLMs) are highly sensitive to prompt design, yet benchmark evaluations typically rely on static, hand-crafted instructions that may underestimate true model capability. In this work, we study a reflective prompt evaluation algorithm, GEPA as an inference-time optimization strategy across multiple multilingual benchmarks spanning diverse African languages. GEPA uses a more capable model as a reflection agent to iteratively optimize prompts under strict compute and latency budgets, without updating model parameters. Results show that reflective prompt optimization consistently improves performance across tasks, enabling smaller models to match or outperform larger models when evaluated using optimized instructions. We find that prompt evolution functions as a form of textual policy learning, improving not only task accuracy but also output structure and formatting factors that are critical for reliable model evaluation. Qualitative analysis further demonstrates that optimized prompts elicit more stable model behavior. We characterize the trade-off between optimization cost and inference latency by measuring prompt token growth, and show that modest increases in prompt length can yield substantial gains in performance. Based on these findings, we argue that benchmark evaluations should report both baseline and prompt-optimized results to more faithfully reflect model capabilities, particularly in multilingual and low-resource settings.
Shamsuddeen Hassan Muhammad
Google DeepMind Fellow, Imperial College London
Simbiat Ajao, University of Lagos
Bunmi Akinremi, Obafemi Awolowo University Ile-Ife
Jesujoba Alabi, Universität des Saarlandes
Felermino D. M. A. Ali, Universidade do Porto
Victor Jotham Ashioya, Kabarak University
Tadesse Destaw Belay, Instituto Politécnico Nacional, Centro de Investigación en Computación
Happy Buzaaba, Princeton University
Emmanuel Kigen Chesire, Kabarak University
Emmanuel Dorley, University of Florida
Bonaventure F. P. Dossou, Mila & McGill University
Khalid Elmadani, New York University, Abu Dhabi
Naome A Etori, University of Minnesota - Twin Cities
Eric Le Ferrand, Boston College
Elodie Gauthier, Orange
Gideon George, Data Science Nigeria
Agam Goyal, University of Illinois at Urbana-Champaign
David Guzmán, University of Toronto
Tajuddeen Gwadabe, Masakhane Research Foundation
Cari Beth Head, University of Florida
Raphael Iyamu, University of Florida
Sandeep Kumar Jha, LinkedIn Core AI
Adejumobi Monjolaoluwa Joshua, University of Agriculture Abeokuta
Sulaiman Kagumire, Makerere University
Aditi Khandelwal, Mila & McGill University
Alfred Malengo Kondoro, Hanyang University
Sujay S Kumar, Tesla
Sven Lampe, Carl von Ossietzky Universität Oldenburg
Melaku Lake, Injibara
En-Shiun Annie Lee, Ontario Tech University
Senyu Li, Mila & McGill University
Weiran Lin, Carnegie Mellon University
Elie Mulamba, Université de Kinshasa
Francois Meyer, University of Cape Town
Anjishnu Mukherjee, George Mason University
Mulubrhan Abebe Nerea, University West
Gebregziabihier Nigusie, Mizan Tepi University
Chester Palen-Michel, Brandeis University
Perez Ogayo, Oracle
Kelechi Ogueji, ServiceNow
Odunayo Ogundepo, University of Waterloo
Tolúlopé Ògúnrèmí, Stanford University
Jessica Ojo, Mila & McGill University
Ifeoma Okoh, University of Ibadan
Akintunde Oladipo, University of Waterloo
Flora Oladipupo, Data Science Nigeria
Stephen D. Richardson, Brigham Young University
Nathaniel Romney Robinson, Whiting School of Engineering, JHU
Ted Pedersen, University of Minnesota, Duluth
Elizabeth Salesky, Google DeepMind
Fabian David Schmidt, Bayerische Julius-Maximilians-Universität Würzburg
Tajwaa Scott, California State University, Los Angeles
Walelign Tewabe Sewunetie, African Institute for Mathematical Science, AIMS Rwanda
Olamide Shogbamu, Data Science Nigeria
Rashidat Damilola Sikiru, Obafemi Awolowo University Ile-Ife
Yueqi Song, Carnegie Mellon University
Van-Thuy Phi, RIKEN
Jiayi Wang, University College London
Seid Muhie Yimam, Universität Hamburg
Hao Yu, Mila & McGill University
You are invited to join the Masakhane community Slack (channel #africanlp-acl2026-support). Meet other participants and find collaborators, mentors, and advice there. Organizers will be available on Slack to answer questions regarding submissions, format, topics, etc. If you have any doubt whether you can contribute to this workshop (e.g., if you have never written a paper, if you are new to NLP, if you do not have any collaborators, if you do not know LaTeX, etc.), please join Slack and contact us there as well.
To reach out to the workshop organizers, please email africanlp-eacl2026@googlegroups.com.