The DataCV Challenge is held in conjunction with the ICCV 2025 DataCV workshop. This year, we focus on generating large-scale face recognition (FR) training sets to address the privacy problem caused by datasets that contain web-scraped images from real identities.
The competition is held via CodaLab and includes two phases. Go to this website to participate in the challenge.
May 16th, 2025 11:59 PM HST *: Release the model framework to be trained and the validation set
May 20th, 2025 11:59 PM HST *: Online evaluation server available for submitting the results of the validation set
Jun 24th, 2025 11:59 PM HST *: Release test set and open submissions for the test phase
Jun 27th, 2025 11:59 PM HST *: Result submission closed
Jul 2nd, 2025 11:59 PM HST *: Deadline for Workshop Paper Submission (Extended from Jun 30th, 2025 11:59 PM HST)
Jul 10th, 2025 11:59 PM HST *: Notification of Acceptance
Aug 17th, 2025 11:59 PM HST *: Camera Ready
Motivation. Training effective facial recognition (FR) models faces significant challenges as many real-world datasets are unavailable or unusable due to privacy and ethical concerns. This competition addresses this by focusing on utilizing synthetic data. The motivation is to generate high-quality synthetic training sets, aiming to achieve high accuracy on real tasks comparable to or even surpassing models trained on traditional real datasets.
Task. Generating training sets with synthetic identities that can achieve comparable or higher accuracy than the dataset with real identities is the competition task. The dataset quality is evaluated by the accuracy of the FR model obtained from generated datasets. The participant needs to design a method that can generate new training sets or resolve the weakness of the existing synthetic FR datasets.
Evaluation website. We use CodaLab as the competition platform. The URL is here. Please make sure to register using an official institutional or university email address. Registration requests submitted with personal or non-institutional email addresses will not be approved.
Test sets. We will provide an evaluation set and a test set for the dataset quality evaluation, where both sets contain image pairs focusing on age variation, pose variation, and the definition of synthetic identity (similar-looking tasks). Any interesting idea of identity definition other than cosine similarity would be more welcomed!
Training sets. Scalability is one of the important aspects of dataset generation. The generated dataset scale can be under 10K IDs, 20K IDs, and 100K IDs with at most 50 images per identity. Other than this, there are no limitations to dataset generation. However, the FR model training and result file creation must follow the guidance at GitHub.
Note that we might require teams to provide the generated dataset and the GitHub repository containing the code for dataset generation to check the integrity. The team will lose their score due to the large accuracy difference between the reported and reproduced results or real identities detected in the dataset.
Ethical considerations. We will use publicly released human face datasets. Additionally, we will adhere to the practices outlined by Asano et al. (NeurIPS 2021) to ensure copyright compliance before distributing these datasets.
University of Notre Dame
University of Notre Dame
Zelin Wen
Shandong University
Yan Tong
Shandong University
For additional info please contact us.