Superpixel Earth Mover's Distance for Hand Gesture Recognition
Abstract
We present a new superpixel-based hand gesture recognition system based on a novel superpixel earth mover’s distance metric, together with Kinect depth camera. The depth and skeleton information from Kinect are effectively utilized to produce markerless hand extraction. The hand shapes, corresponding textures and depths are represented in the form of superpixels, which effectively retain the overall shapes and color of the gestures to be recognized. Based on this representation, a novel distance metric, Superpixel Earth Mover’s Distance (SP-EMD), is proposed to measure the dissimilarity between the hand gestures. This measurement is not only robust to distortion and articulation, but also invariant to scaling, translation and rotation with proper preprocessing.
The effectiveness of the proposed distance metric and recognition algorithm are illustrated by extensive experiments with our own gesture dataset as well as two other public datasets. Simulation results show that the proposed system is able to achieve high mean accuracy and fast recognition speed. Its superiority is further demonstrated by comparisons with other conventional techniques and two real-life applications.
Demo Video of Real-Life Applications
(Rock-Paper-Scissors-Lizard-Spock Game and 3D Content Browser)
(Robotic hand manipulation and 3D scene navigation)
Experimental Results
Our system is evaluated using three different real world datasets, namely our joint color-depth hand gesture dataset, NTU hand digit dataset and American Sign Language (ASL) finger spelling dataset.
Our Joint Color-Depth Hand Gesture Dataset
It contains 10 gestures with 20 different poses from 5 subjects. Therefore, there are a total of 1,000 cases for testing, each of which consists of a pair of color texture and depth map with corresponding skeleton information used in our experiment. Gesture samples are shown below, which are labeled from 0 to 9. It should be noted that this dataset is a challenging real-life dataset, which is collected in two different rooms with different illumination conditions using different Kinects. Moreover, the hand motion is not very restrictive including large in-plane rotation and moderate out-of-plane rotation.
(including the samples for view angle sensitivity test)
The confusion matrix of hand gesture recognition using SP-EMD (unit: %). Left: LOO CV. Right: L4O CV.
The mean accuracy and mean running time of FEMD, Shape Context, Skeleton Matching and our proposed SP-EMD
NTU Hand Digit Dataset
Their Homepage and Download the Dataset
The confusion matrix of hand gesture recognition using SP-EMD (unit: %) for LOO CV.
Comparison with other state-of-the-art recognition algorithms.
ASL Finger Spelling Dataset
Their Homepage and Download the Dataset
The confusion matrix of hand gesture recognition using SP-EMD (unit: %) for LOO CV.
Comparison with other state-of-the-art recognition algorithms.
Sensitivity Analysis
Our algorithm is robust to parameter selection, rotation, scaling and view angle changes.
Parameter Sensitivity Test
Superpixel Size Weights
Orientation and Scale Sensitivity Test
Synthetic mismatches are added to corrupt the preprocessed data of our hand gesture dataset before the ICP alignment is applied. More specifically, after preprocessing the hand shapes with scale normalization, skeleton-based in-plane rotation correction and depth-based out-of-plane rotation correction, they are randomly rotated with a degree theta or scaled by a factor of (1+delta). In our experiments, theta and delta are generated using a Gaussian distribution with zero mean and a standard deviation of sigma. Five different values of sigma are tested and each test is repeated 50 times. The following tables summarize the averaged accuracies with orientation and scale noise, respectively.
Mean accuracy of SP-EMD with orientation noise
Mean accuracy of SP-EMD with scale noise
View Angle Sensitivity Test
The test samples are captured from 5 different view angles ( 0, -20, -10, +10 and +20 degrees) with 5 subjects.
The confusion matrix of hand gesture recognition (5 view angles) using SP-EMD (unit: %). Left: LOO CV. Right: L4O CV.
The confusion matrix of hand gesture recognition (5 view angles) using SP-EMD (unit: %) without the preprocessing step of out-of-plane (OOP) rotation correction. The recognition accuracy will degrade quite a bit (2.33% drop in L4O CV and 0.67% drop in LOO CV). Left: LOO CV. Right: L4O CV.
Publications
Chong Wang, Zhong Liu and Shing-Chow Chan, "Superpixel-based Hand Gesture Recognition with Kinect Depth Camera," IEEE Trans. Multimedia, vol. 17, no. 1, pp. 29-39, Jan. 2015. (pdf & link)
C. Wang, Z. Liu, M. Zhu, J. Zhao and S.-C. Chan “A Hand Gesture Recognition System based on Canonical Superpixel-Graph,” Signal Processing: Image Communication, vol. 58, pp. 87-98, Oct. 2017. (pdf & link)
C. Wang, Z. Liu and J. Zhao, “Hand gesture recognition based on canonical formed superpixel earth mover's distance”, ICME, Seattle, Jul. 2016. (link)
Chong Wang and Shing-Chow Chan, "A New Hand Gesture Recognition Algorithm based on Joint Color-Depth Superpixel Earth Mover's Distance," in Int. Workshop on Cognitive Information Processing (CIP), Copenhagen, 2014, pp. 1-6. (pdf)
Patent Pending ...