Multimodal Alignment for a Pluralistic Society
(MAPS)
CVPR 2026 Workshop
(MAPS)
CVPR 2026 Workshop
Multimodal AI systems, which include vision-language models (VLMs), text-to-image (T2I), and text-to-video (T2V) generation, are increasingly shaping how billions of people create, search, and communicate across domains like creativity, entertainment, education, and healthcare. However, as these systems are deployed worldwide, they often default to Western perspectives, failing to align with the diverse values and contexts of our pluralistic world. Without proper alignment, models may misrepresent cultural practices, reinforce stereotypes, or erase the richness of underrepresented regions. Such blind spots not only harm user trust but also risk amplifying cultural homogenization and inequity on a global scale. While multimodal AI continues to advance, the challenge of aligning these systems with pluralistic, multilingual, and culturally diverse societies remains critically underexplored and requires a focused interdisciplinary effort.
The MAPS workshop aims to achieve two primary goals:
First, it seeks to unite researchers from CV, NLP, HCI, social sciences, and humanities, recognizing that creating pluralistic multimodal AI systems demands both technical expertise and a nuanced understanding of human values and contexts. The workshop will bring together interdisciplinary experts to define a shared vocabulary around culture, geo-diversity, and pluralism in multimodal AI, while enabling discussions across the entire model development stack, from equitable data collection and optimization objectives that amplify underrepresented communities, to post-training methods including alignment. To facilitate this exchange of ideas, we will feature invited talks and panel discussions with leading researchers, alongside short paper submissions of up to four pages. All accepted papers will be presented in a poster session.
The second goal of the workshop is to benchmark progress in cultural adaptation capabilities of multimodal AI systems by hosting the Machine Translation for Vision (MTV) challenge, which won Best Paper in EMNLP 2024. The MTV challenge asks participants to build systems that "translate" images culturally, i.e., adapting source images to be appropriate for target cultures. Humans already do this in movie adaptations, advertisements and education, but the best AI systems have very low success rates, giving us massive headroom for innovation. The results and winning entries will be presented at the workshop advancing both technical capabilities and practical applications. By fostering healthy competition, we aim to catalyze the creation of multimodal AI systems with truly global cultural awareness.