Documentation

Collection and Generation

To generate our FakeAVCeleb, we gathered real videos from the VoxCeleb2[1] dataset, where VoxCeleb2 consists of real YouTube videos of 6,112 celebrities. And these videos contain interviews of celebrities and the speech audio spoken in the video. We chose 500 videos from VoxCeleb2, one video for each celebrity. 

We selected videos based on gender, ethnicity, and age. The individuals in the real video set belong to four different ethnic groups, Caucasian, Black, Asian (Southern), and Asian (Eastern). Each ethnic group contains 100 real videos of 100 celebrities. The male and female ratio of each ethnic group is 50%, i.e., 50 videos of men and 50 videos of women out of 100 videos.

And then, we incorporated 500 unique real videos to our FakeAVCeleb as a real baseline-video set, each belonging to a single individual, with an average of 6.4 seconds duration. Since we are focusing on a specific and practical usage of deepfakes, each video was selected based on some specific criteria, there is a single person in the video with a clear and centered face, and he or she is not wearing any hat, glasses, mask or anything that might cover the face.

Overall

We used face-swapping methods, Faceswap[2], and FSGAN [4], to generate swapped deepfake videos. To generate cloned audios, we used a transfer learning-based real-time voice cloning tool (SV2TTS) [5] (see Figure 2). After generating fake videos and audios, we apply Wav2Lip [6] on generated deepfake videos to reenact the videos based on generated fake audios. On the other hand, the Real-Time Voice Cloning tool (RTVC) [5] was used for synthetic cloned voice generation.


This type of FakeAVCeleb dataset is minacious as an attacker can generate fake video and fake audio and impersonate any potential target person. This type of deepfake dataset can be used for training a detector for deepfake video and deepfake audio datasets.


Synthesis methods

we use a facial recognition service called Face++ [7], which measures the similarity between two faces. The similarity score helps us find the most similar source and target pairs, resulting in more realistic deepfakes. We selected the top 5 videos with the highest similarity score. After calculating the similarities, each video was synthesized with the chosen five videos by applying synthesis methods to produce high-quality realistic deepfakes.


Generation

While inspecting the generated videos, we filter the videos based on these criteria: 

1) The resulting fake video must be of good quality and realistic, i.e., hard to detect through the human eye.

2) The synthesized cloned audio should also be good.

3) The video and corresponding audio should be lip-synced.


Since we apply the manual screening process on synthesized videos, the final video count is more than 20,000.

We found that some of the synthesis methods, FSGAN and Wav2Lip, resulted in many fake videos with excellent and realistic quality. Meanwhile, FaceSwap resulted in several defective videos, since they are sensitive to different lightning conditions and require excessive time and resources to train.

Figure 1. Samples from the Dataset. We divide the dataset into 5 ethnic groups

African, Asian (East), Asian (South), Caucasian (American) and Caucasian (European).

Figure 2. Spectrogram of real audio and fake audio from left to right.

Data Description

Since we are generating cloned voices along with the fake video, we can create four possible combinations of audio-video pairs (see Table).


Reference

[1] Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622, 2018.

[2] Iryna Korshunova, Wenzhe Shi, Joni Dambre, and Lucas Theis. Fast face-swap using convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.

[3] Ivan Petrov, Daiheng Gao, Nikolay Chervoniy, Kunlin Liu, Sugasa Marangonda, Chris Umé, Jian Jiang, Luis RP, Sheng Zhang, Pingyu Wu, et al. Deepfacelab: A simple, flexible and extensible face-swapping framework. arXiv preprint arXiv:2005.05535, 2020.

[4] Yuval Nirkin, Yosi Keller, and Tal Hassner. Fsgan: Subject agnostic face swapping and reenactment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7184–7193, 2019.

[5] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, and Yonghui Wu. Transfer learning from speaker verification to multispeaker text-to-speech synthesis, 2019. arXiv:1806.04558.

[6] KR Prajwal, Rudrabha Mukhopadhyay, Vinay P Namboodiri, and CV Jawahar. A lip sync expert is all you need for speech to lip generation in the wild. In Proceedings of the 28th ACM International Conference on Multimedia, pages 484–492, 2020.

[7] Face++. Face comparing – protection against fake face attacks – seasoned solution – accurate in real-world, 2021. URL: https://www.faceplusplus.com/face-comparing.