If you participated in the challenge, please complete the form below ASAP. This is necessary for important, time-sensitive communications, including prize distribution and certificate issuance.
Please fill out the form below as soon as possible:
🔗 CulturalVQA: https://forms.gle/mZA3nLi8u33hqJk57
🔗 GlobalRG: https://forms.gle/Szt28hDzNfWEP8Tr5
The CULTURALVQA challenge emerges from the pressing need to evaluate the cultural understanding of VLMs. While current models excel at general scene understanding, and recognizing objects, actions, and attributes in typical contexts, they often fall short of capturing deeper cultural nuances. As global digital interactions expand, models must be capable of understanding not only visual cues but also the underlying cultural values such as traditions, rituals, and beliefs.
We evaluate VLMs on the recently developed CulturalVQA benchmark – a visual question answering dataset where the questions are specifically designed to probe models’ understanding of various cultural concepts and values depicted in the images. These questions were crafted by annotators familiar with the cultures represented, ensuring that models are tested on culturally relevant information. The dataset comprises 2,378 image-question pairs from 11 culturally diverse countries across 5 continents. Figure 2 shows a few samples from this dataset. Systematic evaluations of state-of-the-art VLMs reveal a significant gap of 12.3% between the best VLM performance (GPT-4o) and human accuracy, with this gap even more pronounced for African-Islamic countries like Turkey, Nigeria, Iran, Ethiopia, and Rwanda, where the difference reaches 17.5%. Open-source models lag even further behind, with a gap of 27% compared to human performance. We hypothesize that this disparity arises from (i) the lack of sufficient culturally relevant data in VLM training, (ii) the lack of representation of data from diverse cultures in VLM training, and (ii) models’ difficulty in combining cultural information from across vision and language modalities, among other factors. These issues highlight an important barrier to making AI truly global. Hence, this challenge is vital to advancing AI towards better cultural alignment and inclusivity, ensuring that models are capable of a comprehensive global understanding.
Through the first CULTURALVQA challenge, we aim to encourage research in the field of cultural understanding in VLMs. Specifically, our goal is to improve performance and reduce the disparity in model accuracy across different countries and cultural contexts. We believe this is a promising direction with significant room for improvement, and we hope the challenge will stimulate further research and interest in this critical area.
The challenge is hosted in Hugging Face challenges and you can find the link here: CulturalVQABench
Start date: March 14, 2025
End date: April 15, 2025 April 22, 2025 April 23, 2025 (23:59:59 UTC)
Winner announcement: 12 June, 2025
For any queries regarding the challenge please drop an email at culturalvqa@gmail.com.
While the CULTURALVQA challenge focuses on evaluating the cultural understanding of VLMs, the GlobalRG challenge focuses on evaluating cultural diversity in VLMs’ outputs.
To evaluate cultural diversity in VLMs’ outputs, we evaluate VLMs on the recently developed GlobalRG benchmark. GlobalRG introduces two tasks: (i) Retrieval Across Universals and (ii) Cultural Visual Grounding, which is aimed at testing models’ cultural inclusivity (as shown in Figure 3). The former task assesses the models’ ability to retrieve culturally diverse images based on universal concepts like “breakfast” or “wedding” from 50 countries across 10 regions. The evaluation uses a novel metric, diversity@k, which measures the cultural diversity among the retrieved images. For the latter task, models are evaluated on their ability to ground culture-specific objects and concepts in images. It covers 15 countries across 8 regions, focusing on items that are uniquely tied to particular cultures (e.g., a Mexican whisk “molinillo”). This is critical for assessing how well models understand and process concepts that do not generalize universally but are important for cultural specificity. Evaluation on GLOBALRG reveals significant cultural discrepancies in model performance, particularly when comparing Western and non-Western contexts (a gap of ≈ 35%). While models perform well on North American and European concepts, their accuracy drops significantly when handling cultural elements from East and Southeast Asia. Additionally, retrieving seemingly diverse images often defaults to Western cultural symbols, such as eggs for breakfast or white dresses at weddings, underscoring the persistent biases in current VLMs.
Through the first GLOBALRG challenge, we seek to drive research focused on improving cultural inclusivity in VLMs. We aim to boost model performance while minimizing accuracy gaps across diverse countries and cultural settings. We believe there is considerable opportunity for advancement in this area and hope the challenge will inspire more research and attention to bridging these cultural disparities in VLMs.
The challenge is hosted in Hugging Face challenges and you can find the link here: GlobalRG Bench
Start date: March 14, 2025
End date: April 15, 2025 April 22, 2025 April 23, 2025 (23:59:59 UTC)
Winner announcement: 12 June, 2025
For any queries regarding the challenge please drop an email at globalrg@gmail.com.