Brahmani Thota

American Sign Language - Sign to Text Conversion

Introduction:

American Sign Language is a natural language that serves as a sign language of deaf communities in the United States. ASL is expressed by movements of hands and face. It is completely separate and distinct from English and has its own features of a language, like the order of the words, the pronunciation of the words and formation of the words.

[1]Parents are often the source of a child’s early acquisition of language, but for children who are deaf, additional people may be models for language acquisition. [2]Unfortunately, ASL is a dying form of communication at present, due to the lack of resources in education and technology.

[3]Sign language recognition is a problem that has been addressed in research for years. However, we are still far from finding a complete solution available in our society. Among the works developed to address this problem, the majority of them have been based on basically two approaches: contact-based systems, such as sensor gloves; or vision-based systems, using only cameras.

Aim:

To understand the communication of the deaf community by recognizing various alphabets of American Sign Language through deep learning techniques.

PHASE - I

Data:

[4]The original MNIST image dataset is made from handwritten digits and is commonly used for image-based machine learning methods.
Sign Language MNIST is patterned to match closely with the classic MNIST.
Each training and test case represents a label (0-25) as a one-to-one map for each alphabetic letter A-Z (and no cases for 9=J or 25=Z because of gesture motions)
[5]The training data (27,455 cases) and test data (7172 cases) are approximately half the size of the standard MNIST. The dataset consists of about 100MB of space.
The data format of the data is in the form of CSV files where the row contains the label, pixel1,pixel2….pixel784 which represent a single 28x28 pixel image with grayscale values between 0-255.

Methodology:

This problem is an image classification problem that involves taking an image as an input that contains the hand gesture and the output would be text.
The algorithm used here is a Convolutional Neural network and it has two parts. The first part has convolutional layers with max-pooling layers and the second part has dense layers.
Evaluation Metrics: Confusion Matrix, Precision, f1, Recall

PHASE - II

Data Preparation:

The original MNIST dataset has train and test datasets separately. The data of train and test datasets are divided into features and target values.
Various values in the target/label column are made into categorical variables and the unique values are considered as classes. Here there are 25 unique values which mean that they are 25 classes.
Finally, the pixel values are normalized for faster training and testing of the model.
The dataset obtained is a clean dataset with no missing values

Data Distribution:

The distribution of various classes in the target column. Each label represents a one to one map for each alphabetic letters respectively.

Visual representation of a label:

The pixels values are reshaped to 28*28 array and seen in grayscale to understand how the image looks. The adjacent image shows the visual representation of the letter 'D'.

Understanding the CNN Model:

The problem uses a Convolutional Neural Network (CNN) which is a deep learning neural network problem commonly used for image classification problems.
This problem uses a Keras CNN model where it has 4 layers. In these 4 layers, the first 2 layers represent convolutional layers and the following 2 layers represent fully connected layers (or dense layers).
This model uses various activation functions like ReLU and softmax. [6]ReLU activation functions are used in CNN's to prevent the vanishing gradient problem and where the values become so small, that they have little effect.
[7]Softmax is used in the output layer, so all the probabilities will add to 1 by fulfilling the other constraints of probability density.
Additionally, this model uses the loss function of Categorical Crossentropy. This loss function is commonly used for multi-class classification tasks. [8]The model is used when an example can only belong to one out of many possible categories, and the model must decide which one.

Results:

The model showed about 94 per cent accuracy.
The most misclassified letter is R and this was decided based on f1 score.
There is a chance of overfitting.

R letter

V letter

PHASE - III

Comparison of MNIST Sign Language Dataset and Significant ASL Dataset

Data:

[9]The dataset contains 77,554 images of all letters and space.
The directory contains folders of each letter including another aspect of space. All these folders will be considered as classes.
Divided the dataset into training and validation using the validation slit of 80-20.
The dataset has been preprocessed using the Keras pre-processing of the dataset.

Comparison of Datasets:

Significant ASL Dataset:

The dataset is large and contains 77 thousand images.
It covers all the angles of the gesture with various backgrounds
Additionally, the dynamic signs (J and Z) are also covered

SIGN MNIST Dataset:

The dataset is limited and has about 30 thousand cases of pixels' values.
The dataset lags dynamic signs (J and Z)
All the images are not titled. The images have one angle of the gesture.

Understanding the CNN Model:

The problem uses a Convolutional Neural Network (CNN) which is a deep learning neural network problem commonly used for image classification problems.
Pooling layers like Maxpooling are used to reduce the dimensions of the feature maps and max-pooling layers, selects the maximum value from the region where the feature map is used.
From the above CNN model, it used ReLU, softmax activation functions and Categorical cross-entropy as the loss function.

Results:

This model gave an accuracy of 70 percent.
The letters J and Z which are dynamic signs are still predicted wrong.
Sometimes these dynamic signs are predicted with less confidence.

Sign Lang Dataset

Significant ASL Dataset

Phase III presentation

References:

Github: https://github.com/Brahmani-Thota/ASL