Datasets created by us as part of our research, now available openly for public use
InDeepFake: A Novel Multimodal Multilingual Indian Deepfake Video Dataset
InDeepFake presents a multilingual and multimodal deepfake dataset of people from Indian origin. From a total of 389 real public and proprietary corpus, we generated 4600+ deepfake videos, using seven popular SOTA synthetic video generators (FaceSwap, DeepFaceLab, FSGAN, SV2TTS, E2TTS, F5TTS, and Wav2Lip). Each model was used to generate 750+ deepfakes from 389 real videos; we provide a precise mapping between real and their fake counterparts in our dataset for fairness and ease of classification. We also segregate the samples specific to their generators, to achieve a fine-grained labeling, considering possible future requirements of training and building generatorspecific detectors. The dataset encompasses a significantly large set of videos in English (spoken by native Indians), in order to have sufficient representation of English-speaking native Indians in it, which also makes it deployable in a global setting.
We capture a wide range of dialects and accents that represent the linguistic/ ethnic diversity of India as a nation, by covering seven different widely spoken languages in India. The dataset encompasses a significantly large set of videos in English spoken by native Indians, in order to have sufficient reprepresentation of English-speaking native Indians in it, which also makes it deployable in a global setting. The dataset has been balanced in terms of age and gender.
Access InDeepFake at IEEE DataPort or fill-in this InDeepFake_Access_Request_Form and email it to ruchira[AT]it.iiests.ac.in
Paper Link: https://doi.org/10.1016/j.patrec.2025.07.002
Baseline Code Implementation: https://github.com/arnabdasphd/InDeepFake
Cite: A. K. Das, A. Bose, P. Manohar, A. Dutta, R. Naskar, and R. S. Chakraborty. "InDeepFake: A novel multimodal multilingual indian deepfake video dataset." Pattern Recognition Letters, vol. 197, pp. 16-23, 2025.
DiffSynFace: A Demographically Diverse Synthetic Face Dataset
The DiffSynFace is a large-scale synthetic face dataset, consists of 40k+ hyper-realistic synthetic face images. It leverages seven state-of-the-art diffusion-based text-to-image models (Stable Diffusion 1.5, Stable Diffusion 2, Stable Diffusion XL, Stable Diffusion XL-Turbo, Dalle3, Adobe Firefly and Midjourney) to generate images. The dataset provides balanced demographic representation across six ethnic groups (African, American, East Asian, European, Indian and South Asian), two genders (Female and Male), and four age categories (20s, 30s, 40s and 50s). Each age group is further split into early, mid and late stages (40 images each), ensuring at least 120 images per age group, per gender, per ethnicity, per generator, leading to a dataset of 40k+ demographically diverse synthetic human faces.
Access dataset through IEEE DataPort or fill-in DiffSynFace_Access_Request_Form and email it to ruchira[AT]it.iiests.ac.in
Paper Link: https://doi.org/10.1016/j.patrec.2025.08.013
Cite: T. Ghosh, B. Seth, S. Kar, and R. Naskar, "Evaluating the Substitutability of Generative AI-Generated Faces in Biometric Applications: From a Lens of Age, Gender, Ethnicity Detection", Pattern Recognition Letters, Elsevier, Volume 197, 2025, Pages 257-266, 2025.
GitHub repository: https://github.com/tanusreeg/DiffSynFace