Inequitable Data

The Problem

In the last page we discuss how these ai system are built and see that everything begins with the data. If most of the data is Western in language and culture, the results produced may not reflect the realities or needs of people in low and middle income countries. The language may not match what they speak, and the cultural references may not fit their context

Strategy to Address this Problem

At the core, the challenge is: how can we bring our own data into these models? There are different levels of complexity:

Prompting: The simplest way is careful prompt design. Giving the model specific instructions and context in your question.
Retrieval-Augmented Generation (RAG): A powerful method where you provide external documents or knowledge, and the model retrieves this information before answering. This allows you to add new, updated, or local context without changing the model itself.
Fine-tuning: A more advanced method where the model is retrained with a new dataset to adapt it for a specific domain or task.
Training from scratch: The most complex approach, where an entirely new foundational model is built using local or specialized data.

Each Strategy in detail

Level 1 : Prompt Engineering

In prompt engineering you are using the model as it is, but your prompt becomes the data that shapes its response. By giving the model clear instructions, you guide it closer to the results you need.

For Example :

Generic Prompt : "Create a lesson plan for teaching photosynthesis to high school students."
Specific Prompt : "You are designing a lesson plan on photosynthesis for a rural secondary school in sub-Saharan Africa. The classroom has limited electricity, no digital projectors, and few textbooks. Suggest a lesson that uses everyday materials (like leaves, sunlight, or simple drawings on a chalkboard), includes interactive group activities, and explains the concept in simple English that could be translated into local languages."

By adding context such as the setting, constraints, and guidance, prompt engineering helps adapt AI outputs to the realities of LMICs.

Level 2 : Retrieval-Augmented Generation (RAG)

RAG is a powerful way to make language models more useful without changing the model itself. The idea is simple: you add your own documents, and the model retrieves information from them before generating an answer. If you’ve ever uploaded files into ChatGPT and then asked questions about them, you’ve already used RAG.

This is especially valuable in LMIC contexts because it allows you to inject local content such as regional curricula, community health guides, or government policies so the model’s responses are grounded in your reality, not just Western data.

ChatGPT now takes this a step further with Custom GPTs, where you can upload multiple documents and even set a system prompt to define its purpose. For example, here’s a Custom GPT built for Ghanaian educators: 21st Century Teacher – Educator AI for Ghana.

And ChatGPT isn’t the only one. Tools like BoodleBox also let you upload documents and customize a bot to fit your specific context, showing how RAG can help communities make AI more relevant to their needs.

Level 3 : Fine-tuning

Imagine you want an AI tool to generate lesson plans that match your country’s national curriculum. If you use a general model, it may default to Western content. With fine tuning, you can train the model on your own textbooks and exam materials so students and teachers get answers that actually reflect what they are tested on.

Fine tuning means taking an existing model and training it further on a new dataset to adapt it for a specific task. Unlike RAG, which just adds extra context at runtime, fine tuning reshapes the model’s behavior and sometimes even overwriting its original patterns with your data.

To do this you need three things:

A local dataset – such as curricula, health guidelines, or agricultural manuals.
Compute resources – machines or cloud services powerful enough to run training.
An evaluation plan – a way to check if the model’s answers really fit your context.

The process is quite technical. There are tools that can make it easier, like finetunedb and Hugging Face AutoTrain. But we highly suggest trying prompt engineering or RAG first before jumping into fine tuning.

Level 4 : Building Model from Scratch (Foundational Model)

The final and most complex level is creating a foundational model from scratch. This means training a large model entirely on your own data rather than starting with an existing one. The advantage is that you can build a system that truly reflects your languages, cultures, and knowledge, instead of relying on Western-centered models.

However, this process requires massive datasets, very powerful computing infrastructure, and significant funding. For this reason, it is usually carried out by large collaborations, governments, or research labs rather than individual organizations.

In the LMIC context, building from scratch makes sense when there is no existing model that covers your language or culture at all. For example, Latam-GPT is being built to better represent Latin American Context, and lelapa.ai is focused on African languages and contexts.

Collaborative initiatives like Masakhane and BLOOM show how this is possible:

Masakhane is a grassroots African NLP community where researchers across the continent contribute datasets, translations, and expertise. By pooling resources, they can create models for African languages that no single lab could support alone.
BLOOM is a global open science project where over 1,000 researchers worked together to train one of the largest multilingual models ever built. By sharing compute, data, and knowledge, BLOOM lowered the barrier for researchers outside of big tech companies to build state-of-the-art models.

Because of the cost and complexity, this is not the first step most communities should take. Prompt engineering or RAG are usually better starting points. Foundational models are valuable long-term goals when local ownership and representation are essential.

Page updated

Google Sites

Report abuse