Towards an equitable cultural and linguistic representation and participation in today’s digital and AI-driven world
The benefits of artificial intelligence (AI) are only as inclusive as the data behind them. Yet most of the world’s languages, especially Indigenous, minority, and oral-tradition languages, remain almost invisible in digital systems.
CODI is working to change that. We are launching an initiative to define and build solutions that ensure access to AI and digital tools in every language, preserving cultural heritage and enabling all communities to participate fully in the digital future. This is a significant undertaking and the first effort of its kind. By enabling structured, inclusive data collection and engagement, we aim to empower people to shape their digital presence and strengthen their cultural identity online.
Our first step: defining what constitutes a culturally relevant dataset and establishing ethical standards for how data is collected, shared, and used. This ensures that every action we take is grounded in community priorities, cultural representation, and responsible practices.
Establish the concept, scope, and intended outcomes of the Minimum Viable Dataset.
Create tiered dataset models and specify criteria for each tier, tied to real-world use cases.
Identify the core elements needed for authentic and comprehensive representation at each tier.
Recommend metadata structures, tagging conventions, and open protocols to ensure interoperability.
Adapt relevant standards from existing frameworks to ensure responsible creation, ownership, and use of datasets.