VLM- Hand Over Face Dataset

Description

One of the challenging problems in computer vision is hand segmentation specifically distinguishing hands from face when both combined in the same scene. There are many applications rely on the result of this area, such as sign language recognition. VLM-HandOverFace was created to enrich the field with a challenging dataset to cover essential details about hand segmentation in a complex environment. This complexity came from the similarity between hand and face skin color. The collected dataset contains a collection of videos for several people performing hands shapes and movements in front of their faces. The number of participants in the dataset is 42 from a variety of ethnicities, ages, gender, and skin complexion. Each volunteer performs different hand shapes and movements while their hands are in front of their face. There are two devices used for recording: (1) Color and depth camera (Kinect Camera). (2) Leap Motion sensor, which is a small sensor that can recognize hands.

Data Collection

All recording sessions took place in a lab scene. Each volunteer sits in front of a Kinect camera and a motion sensor placed on a movable table that customized to the performer comfort. No devices were attached to the subject. The volunteer mimic a sample video shown in front of him. The duration of each recording varies between 3 to 7 minutes depending on the speed of the performer. For each subject, two videos recorded: one with wall background and the other with lab background. The dataset will include these files for each volunteer as follow:

  • Color stream video from Kinect camera, which only records the upper body parts of the subject. The resolution of the video is 1920x1080.
  • Depth stream video from Kinect camera, which represents the distance between each pixel in the color view and the depth sensor. The resolution of the video is 512x424.
  • Infrared stream video from Kinect camera. The resolution of the video is 512x424.
  • Right-hand Leap Motion sensor video, which is a video of right-hand shape.
  • Left-hand Leap Motion sensor video, which is a video of left-hand shape.
  • Compiled text files of the collected video frames consist of upper body skeleton data, which include palm, neck, head, shoulder, arm, and hip points.

Figure 1 shows an example of the recorded streams in the dataset.

Fig 1: Recorded Streams

Content

This dataset consists of 4384 frames. Those frames are randomly selected from the original videos. Each frame manually labeled as one class mask: hand(1)/ background(0). Also, another two classes mask created as right hand(2), left hand(1), and background(0). We also provide bounding box information and centroid points for each hand. Along with, all recorded streams are available to download. The following shows the description of each dataset attribute.

Note:

Frame Number Code: all images names follow this pattern: two digits represent subject number _ special id for each subject _ one digit for session number (either 1 or 2) _ the word "frame" _ frame number. example: 10_LS714_2_frame_3100

Subject Session Code: all text files follow this naming pattern: two digits represent subject number _ special id for each subject _ one digit for session number (either 1 or 2) _type of information stored inside the file. for instance: 10_LS714_2_kinectBodyInfo

1- RGB: the RGB frame captured by kinect with size 1920x1080 in jpg format.

2- Depth: the Depth frame captured by kinect with size 512x424 in jpg format.

3- Infrared: the Infrared frame captured by kinect with size 512x424 in jpg format.

4- Leap Left: the leap left side frame captured by Leap Motion sensor with size 640x240 in jpg format.

5- Leap Right: the leap right side frame captured by Leap Motion sensor with size 640x240 in jpg format.

6- Mask: the ground truth frame as one class label where the value 0 is assigned for each background pixel and 1 is for hands pixel. The size is 1920x1080 with one uint8 dimension in png format.

7- Mask__Left_Right: the ground truth frame as two classes label where the value 0 is for background, 1 is for left hand, and 2 is for right hand. The size is 1920x1080 with one uint8 dimension in png format.

8- All_Hands_Info: One csv file contains the bounding box and centroid points in each frame. Each line in the file include: (1) frame number code, (2) 12 values representing left hand bounding box (left, top, width, height), left hand cetroid (x,y), right hand bounding box (left, top, width, height), right hand cetroid (x,y).

example: 10_LS714_2_frame_3100,839,557,151,149,901,638,887,460,142,106,966,518

9- Kinect Info Files: For each recorded session ( 2 for each subject) there are three files. The subjectSessionCode_kinectBodyInfo.txt, subjectSessionCode_kinectColorBodyInfo.txt, subjectSessionCode_kinectDepthBodyInfo.txt store Kinect body skeletons in respectively world coordinates, color camera coordinates, and depth camera coordinates. World coordinates are the regular 3D coordinates, while color and depth image coordinates express the coordinates in terms of pixel locations with the color and depth image (+ z-distance). Each file has per line 202 values, namely 1: timestamp 2: skeleton id 3-102: for all 25 joints (x,y,z) position + tracking state 103-202: for all 25 joints 4D quaternion. The skeletal joints are numbered in the same way as the official Kinect 2 API, for more information see here.

Publishable Images

36 out of 42 participants allowed us to publish their faces on the paper or any future publication resulting from this dataset. While these images can be shown, the identity of the subjects is not provided and cannot be shared in any way.

Publishable Images

Dataset download links

Please respect volunteers desire in the "Publishable Images" section. Here are download links for the dataset:

Contact

For any question please contact Sakher Ghanem ( Email: sakher (dot) ghanem (at) mavs (dot) uta (dot) edu ).

Acknowledgment

We want to thank all individuals who volunteered in this dataset. Also, we would like to thank Sami Alesawi, Bander Badrieg, Sarah Jamaluddin, and Shajahn Parakottu for their support and help in the annotation of the dataset.