News Update: The winners for the CVPR 2024 iteration of the challenge are announced here. Congratulations to all the teams!Â
Multimodal Foundation Models (MMFMs) have shown unprecedented performance in many computer vision tasks. However, on some very specific tasks like document understanding, their performance is still underwhelming. In order to evaluate and improve these strong multi-modal models for the task of document image understanding, we harness a large amount of publicly available and privately gathered data (listed in the image above) and propose a challenge. In the following, we list all the important details related to the challenge. Our challenge is running in two separate phases.Â
We will have $10K winner prizes for the top teams.
For this phase, we build up a comprehensive data suite comprising publicly available datasets including DocVQA, FUNSD, IconQA, InfogrpahicVQA, Tabfact, TextbookVQA, WebSrc, Wildreceipt, WTQ. All these datasets are aligned with the goal of the challenge for document image understanding in very specific domains like tables, receipts, infographics, or document figures, etc. This collection consists of a train and test set and can be downloaded from MMFM Data Collection.
For this phase, an alien test set will be released. The original intent of this dataset is to prevent people from overfitting on the publicly available datasets. This test set will consist of data that is similar to the distribution of the phase 1 dataset collection but will not consist of any publicly available data sources. The input data will be released on May 20, 2024.
Phase 1: For this phase, train and test sets are provided. We encourage participants to download the data from MMFM Data Collection.
Phase 2: An alien test set will be released after the phase 1. The participants will be required to submit their results on the alien test set.
Rules: Please read the full rules here: Challenge Phases and General Rules.
Our challenge winners prize is awarded in collaboration with: