Sign Language is a form of communication designed for people with hearing and/or speech impairments. The availability of an interpreter to translate Signs often dictates their ability to communicate in their everyday lives. With the evolution of Artificial Intelligence and Machine Learning, it is now possible to train a computer to do tasks that are usually done by humans.
A combination of hand and fingers pose can create a way to communicate with one another. There are multiple sign languages used in different countries. In the absence of Global Sign Language, the scope of this project is restricted to American Sign Language for the time being.
Many deep learning strategies have been developed in order to address the classes of problems related to gesture and pose recognition. However, the quality of data severely restricts the performance of even state-of-the-art techniques. In this project, we train a simple Convolutional Neural Network on samples of separately preprocessed image data and observe the model performance of each sample. The primary goal of this project is to analyze the impact of differently preprocessed data on a simple deep-learning model and analyze the implications of the insights derived from it.
It is a challenging computer vision task to achieve robust real-time hand perception while it is very natural to human perception. In this project, we aim to sample preprocessing techniques that can pinpoint hand position and orientation, and use a deep learning model to recognize a pattern or characteristic of the hand gesture. The model will be capable of identifying hand gestures in ASL (American Sign Language).
The instances where the idea of feeding low-resolution, grayscale images to the deep-learning model are quite common in the literature. Filtering and scaling the images, performing smoothening and max-pooling iteratively to enhance the point of interest is an approach described in many papers. The main drawback of this, however, is in cases of occlusion, shadows, and/or bad lighting, the model won't be able to detect anything. Also, the model would not generalize well with real-time input in a variety of conditions i.e. it won't be robust.
To overcome this, pinpointing the exact key points in an image, regardless of the background or scale, also known as 'landmarking' turned out to be the best fix to our problem. In our case, we decided to extract multiple key points corresponding to the palm position in the image. The landmarks are then normalized relative to the palm to avoid scaling problems. A dataset was compiled by generating the landmarks of ASL letters A to F with approximately 300-355 samples of each letter. The dataset consists of 2130 rows and 42 features with the first field 'label' and 42 fields depicting the x and y coordinates of each of the 21 landmarks. The images for generating the dataset are taken from [9] ASL Alphabet dataset from Kaggle.
Figure 1
Figure 1 describes the high-level approach of this project. A set of images selected from [10] ASL Alphabet dataset are fed to the Mediapipe pipeline. [4] Media pipe uses a Single-Shot Detector model which is can be used for real-time as well as static key point detection. A detailed description of Mediapipe's workflow is described in Figure 2(a).
One of the main challenges of this project was to normalize the raw landmark coordinates. The raw landmark coordinates given out by Mediapipe are relative to the image size and/or palm length. In order for the model to generalize well for scaled images, key points must be normalized. To address this, we selected a base key point and subtracted each key point from the base point, and normalized the obtained values between [-1,0]. The data was then compiled into a dataset and used for training the CNN.
Figure 2 (a)
https://google.github.io/mediapipe/solutions/hands
Figure 2 (b)
[10] Provides detailed descriptions regarding Mediapipe architecture. As seen in Figure 2 (a) 'HandDetection' module is implemented by training a Single Shot detector model optimized to detect real-time as well as static palm images. Note that a 'Palm' detector as opposed to a 'Hand' detector is used to train the SSD. The primary reason for this is that the palm is a relatively symmetric, rigid object as opposed to the hand.
21 Landmarks as described in Figure 2 (b) are extracted and normalized before feeding them to the CNN for training.
The selected images for training and testing have different levels of shadows and lighting as well as hand positions and orientations as shown in Figure 3. The Palm detector recognizes the palm in the images and gives out 21 key points with x, y, and depth coordinates.
Figure 4
Figure 4 shows the detailed pipeline for palm landmarks extraction. Media Pipe extracts x, y, and z coordinates for each detected key point. Here, we discard the depth as we only need to define the 2-D hand pose.
In this experiment, we train a CNN (Convolution Neural Network ) that is able to classify 6 types of ASL hand signs such as "A", "B", "C", "D", "E" and "F".
Dataset Selection:
In the experiment, [9] ASL Alphabet dataset available on Kaggle is used. It contains 3000 images of each hand sign of which we only use 355 images corresponding to each letter.
Data Preprocessing:
After extracting the hand landmark using the model from Mediapipe, we obtain a list of 21 hand landmarks, ranging from [-1.0, 0.0] [4]. In this experiment, we are going to have two models with different methods of data preprocessed sets before feeding them to train the model.
The first one is the naïve way. We trained the entire output from the model without any processing.
In another one, we calculate the distance of each landmark from a base landmark. This can avoid the effect of changing the dataset value in case of translation, rotation, and scaling of hand sign while maintaining the same hand sign.
CNN Model Definition:
We define a 4-layer model with two dropout layers, avoiding overfitting the training. The first layers are using "relu" activation function and the last one is using the "SoftMax" activation function. (See Figure 4)
Figure 4: CNN model definition
Training Model:
With a small dataset, we set to train at most 500 epochs with a batch size of 64. The learning rate is 0.001 which is a bit small but for this dataset size, thus, it the training is not computationally expensive. In addition to that, in the training phase, we decide to stop the training if the training loss does not decrease within 20 epochs.
Evaluation Model:
For naively preprocessed data, the test accuracy and the test loss are 0.7112, and 0.856 respectively which indicates the inconsistency of the data.
For carefully preprocessed data, however, the test accuracy and the test loss are 0.997, and 0.02 respectively which is a significantly decent result.
For a better understanding of the result, the plot in Figure 5 (a) and Figure 5 (b) show the accuracy and loss over the training epochs of the two models.
Figure 5 (a) : Accuracy and loss over the training epochs of the naive model
As you can see in Figure 5 (a), the validation loss starts to increase after around 100 epochs, training loss is also decreasing. The reason is the coordinates of each hand landmark are relative to the full-size of the image, so the position and orientation of the hand greatly impacts the gesture even though the exactly the same is performed in each case. This could be the poor choice of the parameters that lead to poor generalization. Additionally, the model started to perform well with training data but not the validation data.
Figure 5 (b): Accuracy and loss over the training epochs of careful model
By normalizing the landmark, coordinates, we can avoid this. As you can observe in Figure 5 (b), both accuracy and loss in training and loss have changed relatively. This indicates that the model is generalized and not overfitting like the previous model.
We tested the final model on real-time images after which we observed that it predicts 4 out of the six alphabets with decent accuracy. As stated earlier in the project proposal, the quality of a model is as good as the dataset. From the results, we can infer that the data for the classes predicted correctly was robust enough to work even in a real-time setting, in fact, the data of classes that were not predicted correctly was adequate for at least the test images of the [9] ASL Alphabet dataset but could not perform that well in a real-time setting. The figures below represent visual examples of the success and failure cases of our model.
Success:
Class A detected accurately.
Class B detected accurately.
Class D detected accurately.
Class E detected accurately.
Failures:
Class C not predicted correctly.
Class F not predicted correctly.
In this project, we thus demonstrated the importance of relevant preprocessing of inputs in the case of Sign Language Recognition. We saw that preprocessing itself can bring up the accuracy of the simplest models. Feeding raw, naively extracted landmarks to the CNN turned out to be detrimental to the overall model accuracy while normalized landmarks worked well on static as well as real-time images.
In the end, the model is only as good as the data it is trained with. Image preprocessing techniques must be applied with the overall goal of the project in mind.
[1] A Comprehensive Study on Deep Learning-Based Methods for Sign Language Recognition Nikolas Adaloglou , Theocharis Chatzis, Ilias Papastratis , Andreas Stergioulas , Georgios Th. Papadopoulos , Member, IEEE, Vassia Zacharopoulou, George J. Xydopoulos , Klimnis Atzakas, Dimitris Papazachariou , and Petros Daras , Senior Member, IEEE
[2] Abdul Mannan, Ahmed Abbasi, Abdul Rehman Javed, Anam Ahsan, Thippa Reddy Gadekallu, Qin Xin, "Hypertuned Deep Convolutional Neural Network for Sign Language Recognition", Computational Intelligence and Neuroscience, vol. 2022, Article ID 1450822, 10 pages, 2022. https://doi.org/10.1155/2022/1450822
[3] Zimmermann C, Brox T (2017) Learning to estimate 3D hand pose from single RGB images. ICCV, Venice, Italy, Oct 2017, pp 4903–4911. http://openaccess.thecvf.com/content_ICCV_2017/papers/%0AZimmermann_Learning_to_Estimate_ICCV_2017_paper.pdf
[4] MediaPipe Hands. mediapipe. (n.d.). Retrieved October 14, 2022, from https://google.github.io/mediapipe/solutions/hands
[5] https://learn.ml5js.org/#/reference/handpose?id=description
[6] A. Pardasani, A. K. Sharma, S. Banerjee, V. Garg and D. S. Roy, "Enhancing the Ability to Communicate by Synthesizing American Sign Language using Image Recognition in A Chatbot for Differently Abled," 2018 7th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions) (ICRITO), 2018, pp. 529-532, doi: 10.1109/ICRITO.2018.8748590.
[7] D S, L.; Raj, N.; 2021 Fifth International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC) Palladam, India 2021 Nov. 11 - 2021 Nov. 13. 2021 Fifth International Conference on I-Smac (iot in Social, Mobile, Analytics and Cloud) (i-Smac). In Sign Language Recognition Using Hand Gestures; IEEE, 2021; pp 968–971.
[8]Rastgoo, R., Kiani, K. & Escalera, S. Real-time isolated hand sign language recognition using deep networks and SVD. J Ambient Intell Human Comput 13, 591–611 (2022). https://doi.org/10.1007/s12652-021-02920-8
[9] Akash Nagaraj. (2018). ASL Alphabet. Retrieved November 18, 2022 from https://www.kaggle.com/grassknoted/aslalphabet
[10]Zhang, F., Bazarevsky, V., Vakunov, A., Tkachenka, A., Sung, G., Chang, C., & Grundmann, M. (2020). MediaPipe Hands: On-device Real-time Hand Tracking. ArXiv, abs/2006.10214.