Challenge 2: Dishcovery: VLM MetaFood Challenge
Congrats to the Challenge Winners:
First place: Team SunnyLX - Xin Luo, Haoyu Wen, Xusheng Liu, Jie Yang, Shien Song [Video]
Second Place: Team Desayuno - Ashwin Kumar Gururajan, Enrique Lopez-Cuena [Video]
Third Place: Team e0nia - Madhusudhana Naidu [Video]
Background
The goal of this challenge is to develop a Vision-Language Model (VLM) capable of accurately understanding and relating the visual cues of food dishes to their corresponding textual descriptions. This task pushes the boundaries of fine-grained visual recognition and multimodal alignment, especially in a challenging and diverse domain as food. Recognizing the subtle visual features that differentiate similar dishes, ingredients, and preparation styles is crucial for a wide range of applications from dietary tracking to restaurant automation.
To support participants in building strong models, we provide a training set consisting of 400,000 food image-caption pairs obtained by the Precision at Scale framework. The dataset is a combination of synthetic and noisy real captions. Note that some of the images may also be synthetic. The data will be provided in a format that includes each caption along with a link to download the associated image. We strongly encourage all participants to download the images as early as possible and to report any issues they may encounter during the download process, so we can address them quickly. As some training-set images belong to LAION, there is a possibility that some of them are being removed by the authors during the course of the challenge.
To foster creativity, we allow participants to use any external data sources or pre-trained models (see "Limitations" and "Rules" for more information), as long as they adhere to the license terms of this challenge and any third-party data they incorporate. The focus is on achieving the most accurate and detailed alignment between food images and text, and we want to provide as much freedom as possible to explore different modeling strategies and data augmentation techniques.
The initial training data can be downloaded from HuggingFace.
Description
The challenge will be divided in two different phases.
Phase one, which is common for all participants, involves two public test sets. For the first test set, participants are encouraged to link every image to N captions. Note that N could range from 0 to the total number of captions in the test set. This linkage should be provided in a sparse matrix index format. For the second test set, the format should be the exact same. However, each image will only have a single correct caption.
Best 4 participants will move to Phase Two and they will be asked to provide information about the data used to train the model.
In this phase participants are required to submit their model, all relevant scripts and a txt file that contains the source of information used to train the model. Organizers will check for any limitations and rules violation and they will reproduce the results obtained by the provided model in Phase one.
Note that results that are not reproducible will lead to a disqualification.
Additionally, we will evaluate all the models on our own private test set to decide the winner and runner-up. Selected Winner and Runner-up will be asked to provide all the data used to train their model.
For detailed challenge instruction and regulations, please refer to the Kaggle page: https://www.kaggle.com/competitions/dishcovery-vlm-mtf-cvpr-2025/overview