Cultivating Sovereign African NLP workshop at the Deep Learning Indaba 2026 takes place on, August 6, 2026. It is organised by the Masakhane African Languages Hub .
Important: This workshop is exclusively for registered attendees of Deep Learning Indaba 2026 in Lagos, Nigeria.
African language data is the foundation of sovereign AI, yet it remains one of the continent's most underleveraged resources. Despite Africa being home to over 2,000 languages, the vast majority of linguistic data that powers AI systems is scraped from the internet often legally restrictive, culturally decontextualised, and insufficient for building truly representative NLP models. Meanwhile, a wealth of untapped data sits in archives, radio recordings, oral histories, and physical repositories across the continent, waiting to be unlocked. Building on the momentum of last year's workshop and directly responding to Indaba 2026's theme of "Sovereign Intelligence," this full-day track led by the Masakhane African Languages Hub moves beyond problem identification to the deployment of solutions. Through three progressive sessions structured around Discovery, Transformation, and Protection, participants will gain the technical skills, ethical frameworks, and community connections needed to become genuine stewards of African language data. The day culminates in a landmark moment: the official launch of Masakhane's Community Pilot Grants, inviting IndabaX communities across the continent to build the datasets that will drive the next generation of African NLP.
Unlock African language data from unconventional sources: The workshop will expand participants' understanding of where African language data lives beyond the internet, exploring untapped sources such as radio archives, oral histories, and library repositories.
Expected outcomes: Participants will gain a broadened perspective on non-traditional data sources and practical insights from south-south learnings from other global majority communities.
Build practical capacity for data transformation: The workshop will equip participants with the technical skills to convert raw, unconventional sources into structured, usable datasets through hands-on OCR and digitisation demonstrations.
Expected outcomes: Participants will leave with replicable tools and methods to continue contributing to the African data corpus beyond the workshop.
Embed ethics, licensing and data sovereignty into African NLP practice: The workshop will ground data collection in a rigorous ethical and licensing framework, ensuring community consent, data provenance, and responsible governance of African language resources.
Expected outcomes: Participants will know how to derive and use African language data responsibly, and how to apply licensing frameworks that centre African ownership and community protection.
Launch a continent-wide movement for community-led data collection: The workshop will formally announce Masakhane's Community Pilot Grants, creating a direct pipeline from the Indaba to IndabaX communities to build datasets that serve African people.
Expected outcomes: Participants will leave with clear knowledge of how to apply for funding and how their local efforts connect to the broader vision of sovereign African AI.
08:30–8:45 : Introduction: The The data drought and the hidden wealth of offline sources
8:45 – 9:40: Panel Discussion: Beyond the Web: Where Does Your Language Live?
9:40 – 10:00 : Discourse & Engagement: Open-floor Q&A with the panel
10.00-10:15 Session Break
10:15 – 10:45 :The Technical Lens: OCR and digitization for African language data
10:45 – 10:50 : Transition: The case for diverse data sources
10:50 – 11:20 :Hands-On Technical Demonstration: From Source to Sentence
11:20 – 11:30 : Q&A and Contributions
11:30AM – 12:00PM : Snack Break & Networking
12:00 – 12:15: Introduction to Data Stewardship
12:15 – 1:00 : Expert Session: The Licensing Framework
13:00 – 13:30 : Community Pilot Grants Announcement
Confirmed Speakers
Chijioke Okorie
Idris Abdulmumin
Organisers