Using YOLOv5 and YOLOv4
Group Members : Daksh Sinha, Aadhar Chauhan
American Sign Language (ASL) is a natural language that serves as the primary sign language for deaf communities in the United States. The goal of our project is to recognize ASL hand gestures through computer vision techniques to bridge the communication gap between the deaf community and us. We have used transfer learning from 2 model architectures, YOLOv5 and YOLOv4, and their performances as well as challenges faced while training are documented below.
Since the YOLO models need annotated images for training, we used this annotated dataset from Roboflow which contains 1728 images of the ASL letters. The images all have bounding boxes drawn around the hand gestures, with various data augmentation techniques incorporated, like flipping the images, different backgrounds, lighting changes, grayscale representations, etc. The label for each image contains 5 values : the class index and the 4 bounding box coordinates.
Multiple representations of the ASL letter 'A'
We used transfer learning from YOLOv5 and YOLOv4 models. These models are widely accepted as the state of the art in object detection, with their weights having been estimated from training on the COCO dataset.
The training process involved changing the configuration files of the models to reflect the number of classes in our dataset. We also played around with hyperparameters such as learning rate, batch size, weight decay, etc to ensure faster convergence.
Turns out the default parameters were the best ones.
Testing is where we spent most of our time. It was tough to incorporate webcam testing on Colab, and our personal machines didn't have GPUs which were computationally robust enough to train for a decent number of epochs. We ended up training on Colab (exhausting GPU limits pretty quickly), downloaded the weights, and used these weights for the detection script that ran on a local machine, to access the webcam. The results from training and a webcam demo are shown below.
For YOLOv5
Hyperparameters : learning rate = 0.01, momentum=0.937, weight decay=0.0005
Besides the implementational challenges due to GPU limits and webcam integration, these are some of the issues we ran into:
The model runs into problems when the background is noisy. To fix this we planned to capture images of the room which we were recording in and annotate them manually, but couldn't do it due to time constraints.
There are some faulty detections as seen in the video. For example, 'B' often gets detected as 'F' due to the similarity of these letters in the ASL alphabet. Also, 'J' and 'Z' require motion of the hand, so we would probably need video files and run optical flow detection on them to better detect these specific letters.
As of this moment, YOLOv4 is still training on Colab CPU (as we've exhausted the GPU). We'll upload the metrics for it soon, although the mAP@0.5 IoU after 100 epochs is worse than that for YOLOv5.
This work could be extended to encompass more ASL words and expressions to make this a comprehensive and viable mode of communication between us and the deaf community. Similar work can be done for other sign languages.
We also hope to develop a speech-to-text-ASL model in the future and plan to take an NLP course for the same.
The Colab notebook for our project can be found here!
YOLOv5 repository : https://github.com/ultralytics/yolov5
YOLOv4 Darknet : https://github.com/AlexeyAB/darknet