Chalearn CASIA-SURF Dataset
1.1 Dataset Introduction
Chalearn CASIA-SURF is a face Anti-spoofing dataset which consists of 1, 000 subjects and 21, 000 video clips with 3 modalities (RGB, Depth, IR). As shown in Fig 1 and Table 1, our dataset has two advantages: (1) It is the largest one in term of number of subjects and videos; (2) Our dataset has three modalities (i.e., RGB, Depth and IR).
In the proposed dataset, each sample includes 1 live video clip, and 6 fake video clips under different attack ways (one at tack way per fake video clip). In the different attack styles, the printed flat or curved face images will be cut eyes, nose, mouth areas, or their combinations. Finally, 6 attacks are generated in the CASIA-SURF dataset. Fake samples are shown in Fig. 2. Detailed information of the 6 attacks is given below.
• Attack 1: One person hold his/her flat face photo where eye regions are cut from the printed face.
• Attack 2: One person hold his/her curved face photo where eye regions are cut from the printed face.
• Attack 3: One person hold his/her flat face photo where eyes and nose regions are cut from the printed face.
• Attack 4: One person hold his/her curved face photo where eyes and nose regions are cut from the printed face.
• Attack 5: One person hold his/her flat face photo where eyes, nose and mouth regions are cut from the printed face.
• Attack 6: One person hold his/her curved face photo where eyes, nose and mouth regions are cut from the printed face.
Thanks Surfing Technology (more information, please see link) to provide this high quality dataset for this research and challenge.
1.2 Acquisition Details
We use the Intel RealSense SR300 camera to capture the RGB, Depth and Infrared (IR) videos simultaneously. In order to obtain the attack faces, we print the color pictures of the collectors with A4 paper. During the video recording, the collectors are required to do some actions, such as turn left or right, move up or down, walk in or away from the camera. Moreover, the face angle of performers are asked to be less 30 degrees. The performers stand within the range of 0.3 to 1.0 meter from the camera. The diagram of data acquisition procedure via Intel RealSence SR300 camera is shown in Fig. 3.
1.3 Data Preprocessing
In order to make the dataset more challenging, we remove the complex background except face areas from original videos. Concretely, as shown in Fig. 4 the accurate face area is obtained through the following steps. Although there is lack of face detection for Depth and IR face images, we have a RGB-Depth-IR aligned video clip for each sample. Therefore, we first use Dlib  to detect face for every frame of RGB and RGB-Depth-IR aligned videos, respectively. The detected RGB and aligned faces are shown in the second column of Fig. 4. After face detection, we apply the PRNet  algorithm to perform 3D reconstruction and density alignment on the detected faces. The accurate face area (namely, face reconstruction area) is shown in the third column of Fig. 4. Then, we define a binary mask based on non-active face reconstruction area from previous step. The binary masks of RGB and RGB-Depth-IR images are shown in the fourth column of Fig. 4. Finally, we obtain face area of RGB image via pointwise product between RGB image and RGB binary mask. The Depth (or IR) area can be calculated via the pointwise product between Depth (or IR) image and RGB-Depth-IR binary mask. The face images of three modalities (RGB, Depth, IR) are shown in the last column of Fig. 4. The resolution is 1280 × 720 for RGB images, and 640 × 480 for Depth, IR and aligned images.
Table 2 presents the main statistics of the CASIA-SURF dataset:
(1) There are 1000 subjects and each one has a live video clip and six fake video clips. Data contains variability in terms of gender, age, glasses/no glasses, and indoor environments.
(2) Data is split in three sets: training, validation and test. The training, validation and test sets have 300, 100 and 600 subjects, respectively. Therefore, we can get 6300 (2100 per modality), 2100 (700 per modality), 12600 (4200 per modality) videos for its corresponding set.
(3) From original videos, there are about 1.5 million, 0.5 million, 3.1 million frames in total for training, validation, and test sets, respectively. Owing to the huge amount of data, we select one frame out of every 10 frames and formed the sampled set with about 151K, 49K, and 302K for training, validation and test sets, respectively.
(4) After data prepossessing in Sec. 1.3, by removing non-detected face poses with extreme lighting conditions, we finally get about 148K, 48K, 295K frames for training, validation and test sets on the CASIA-SURF dataset, respectively. All subjects are Chinese peoples, and the information of gender statistics is shown in the left side of Fig.5. It shows that the ratio of female is 56.8% while the ratio of male is 43.2%. In addition, we also show age distribution of the CASIA-SURF dataset in the right side of Fig 5. One can see a wide distribution of age ranges from 20 to more than 70 years old, while most of subjects are under 70 years old. On average, the range of [20,30) ages is dominant, being about 50% of all the subjects.