Inspired by image processing and image recognition projects from previous EECS 351 semesters, our team sought to develop a model that analyzes static images and real-time video footage of American Sign Language (ASL) gestures. We were particularly interested in comparing different feature extraction methods which we could use to train a supervised machine learning algorithm. Our classification problem has 29 different classes (the 26 letters of the English alphabet, 2 other symbols, and the lack of a symbol). We implemented six different feature extraction methods ranging from simply down sampling to feature extraction using unsupervised learning. We were pleasantly surprised at how accurate most of our methods were, even though modern neural networks can do much better. We developed a simple app that takes live footage from a camera such as a webcam and predicts the symbol it sees. Our feature extraction and classification methods are fast enough to do this without noticeable delays.
This method simply downsamples the input image allowing for more realistic processing times in the classification step.
We use an N-point discrete Fourier transform computed using the Fast Fourier Transform method. Here, N is much smaller than our original image.
The discrete wavelet transform can be applied recursively to an image to produce an approximation image that captures important features of the original.
SIFT detects keypoints in an image and provides a descriptor for each keypoint. We cluster these keypoints based on their descriptors and have them vote on the features of the image.
SURF also detects keypoints in an image, however it is meant to be more robust and quicker than SIFT. Once again, the keypoints are clustered and vote on the features of the image.
ORB is a very fast and memory efficient way to detect key points in an image, though its features are not as robust as SIFT and SURF. A similar K Means clustering is performed.
Our classifier relies on the AdaBoost ensemble meta algorithm. AdaBoost relies on several weak decision tree learners to derive a strong learner. We trained our classifier using 87,000 examples of hand signs, 3,000 examples for each symbol. The hyperparameters of the classifier were kept constant for all feature extraction methods. Specifically, we used 1000 base estimators with a maximum depth of 4 and a learning rate of 0.3. These numbers were determined experimentally, however due to the extremely long time necessary to train and evaluate this method, we were not able to optimize this hyperparameter selection. It is therefore very likely that this method could achieve a higher accuracy than we were able to get by further tuning the parameters.
It is easier for the classifier to identify unique symbols such as R shown above. Most of the feature extraction methods are relatively robust to background variation as well. The above image is on a live test with a far different background than our training dataset. Feature extraction using the SURF method was by far the best performer in the real life application.
There are several situations in which out classifier does not work well. It especially struggles with symbols that have a similar shape such as the A above which is confused for S. Additionally, the limitations of the dataset we used to train make the classifier a lot less accurate in real applications compared to the results we got from cross validation.