The first deliverable for our project included in the identification of static images. We applied several feature extraction methods, each with varying levels of predictive success outlined below. We used the ASL Alphabet dataset from Kaggle found here.
The table on the left shows cross validation and test results for each feature extraction method. All methods used the AdaBoost classifier with the same hyperparameters (max tree depth of 3 and 1000 estimators). The result shown is the mean score across all folds of 10-fold cross validation with 87,000 training examples (3,000 per class). The score for each fold is the subset accuracy, which is a harsh metric. We used a small dataset of 29 images, one for each class, to test the trained classifier. The results shown are the percentage of correctly identified examples.
The downsampling method performed surprisingly well. ORB performed much worse than we expected. SIFT, DWT, and especially SURF performed the best. Since we have a 29 class dataset, we expect simply guessing to yield an accuracy of 3.4%. All of the featurization methods performed far better than that.
After finishing feature extraction and training our machine learning algorithm on each of the feature extraction methods, we built a simple app that is a capable of classifying hand gestures in real time. The app takes live video feed from a camera such as a laptop webcam and tries to identify which hand gestures appear within a 200x200 window on the image. The video is displayed to the user with the box in which the app looks for gestures highlighted and the predicted current gesture drawn on top.
The demo seen on the left is not as accurate as the validation and test results led us to assume. The video used the classifier trained using the SURF features which was our best feature extraction method. The feature extraction and classification is performed only on the red box in the video. This method consistently did a great job of identifying when there was no symbol shown. However, it had trouble identifying other symbols. Depending on the feature extraction method, some symbols were easier to identify than others. As the video shows, the SURF method could identify S, N, and M occasionally and before we recorded this demo it also did a good job with other symbols like W, R, and C.
Live symbol recognition provides many more challenges than static image classification which our algorithm was trained to do. There are also issues with the dataset we used. There are examples of images from the Kaggle dataset throughout and it is easily noticed that each image always has the same background, the same lighting conditions, and the same person doing the hand gesture. This proves detrimental to the learning algorithm as it cannot learn to ignore distracting characteristics present in the live video shown to the left.