As multimedia communication becomes ubiquitous, translation must coherently adapt all modalities, yet machine translation systems remain confined to text and speech. In our EMNLP 2024 Best Paper award-winning work, we identified the critical need to transcreate images for cultural appropriateness, needed by several real-world applications as shown above. The task is to "translate" images culturally, i.e., adapting source images to be appropriate for target cultures while preserving meaning.
The test set comprises of two splits as per the original paper: (a) concept -- 585 images spread across 7 countries and 17 categories (like food, beverages, agriculture, housing etc.) with a single concept per image; (b) application -- 100 images spanning educational material and storybooks. The success of transcreation for the two splits is as shown below:
You will be given: (a) the input image (shown on the left); (b) the category it belongs to (in this case, food); (c) the country the source image is highly relevant to (in this case, India); (d) the target country for which the image needs to be adapted (in this case, Japan).
You have to adapt or "transcreate" the image such that: (a) the model output belongs to the same category as the original image; (b) the model output is culturally relevant to the target country.
Your output will receive a score of 1--5 on the extent to which the model output (a) belongs to the same category as the original image (eg., a house changes into a house); (b) is more relevant to the target country, compared to the input image.
You will be given: (a) the input image (shown on the left); (b) the concept it is teaching for education (like counting, addition etc.) OR the text associated with the image for stories (c) the target country for which the image needs to be adapted (in this case, India).
Note that these images do not have metadata on the cultures they are originally most relevant to. This mimics real-world use where a model should be able to adapt to multiple cultures regardless of the input. The concept dataset is an easier toy dataset curated to make progress on the harder application dataset.
You have to adapt or "transcreate" the image such that: (a) the model output can teach the same concept as is being taught in the original worksheet for education (in this case, counting objects) OR the model output is coherent with the text of the story for storybooks (b) the model output is culturally relevant to the target country. Your output will receive a score of 1--5 for both of these criteria to judge overall success.
The challenge is hosted in HuggingFace challenges, and you can find the link here: Coming soon!
Start date: TBD
End date: TBD
Winner announcement: TDB
For any queries regarding the challenge please drop an email at maps.cvpr@gmail.com