GSoC21-Ellen-Gestures

Benchmark on Ellen DeGeneres shows Dataset

Introduction

The Ellen DeGeneres Show dataset consists of 30 videos available on various online streaming platforms, collected by Red Hen Lab. These videos have been independently annotated by professional gesture annotator Yao Tong using the Elan software. In general, the show usually has Ellen DeGeneres and US celebrities including musicians, politicians, and actors/actresses. While speaking, a lot of times the speakers show hand movement in correlation to their speech. This hand movement is a point of interest in our research since it can be related to speech in many ways. Also, it can help to create automated annotations based on hand gestures. This page is used to publish the dataset and the benchmark obtained by training deep learning models on it. Further, visit the blog created for this project at gestures-dataset-blog to get more ideas/documentation/resources on the work. The code for hand movement gesture recognition is available at redhen-gesture. This project was completed as a part of Google's Summer of Code project at Red Hen Lab.

Dataset

There are a total of 30 videos shot between 11 November and 23 December in 2014. As of August 2020 out of the 30, 18 are publicly available on YouTube, 1 on DailyMotion, while 2 are private videos on Youtube and 1 on Vimeo. 8 are not available on the web. The videos range from a length of 138 seconds to 788 seconds with an average of 346 seconds. All the complete (full-length original) videos are also available on the UCLA Library Broadcast NewsScape.

To get the videos from public links using youtube-dl, follow the instructions:

  1. Download the youtube-dl module. Instructions here https://ytdl-org.github.io/youtube-dl/download.html. Using pip3:

$ pip3 install --upgrade youtube-dl

  1. Use youtube-dl on the command line:

$ youtube-dl <video-link> -o <output-filename>

This is will directly download the video file in the given link to your given output file format. It is suggested to have the same file format for all since in some cases multiple videos may need to be concatenated. There are a lot of options available for downloading, check the documentation.

In some cases there are multiple links available for the given video. Or, it might be that a portion needs to be cut from the whole video. To do so, use FFmpeg. Download from here or if you are on Linux you can use your package manager such as apt on Ubuntu to install FFmpeg. There are multiple concatenation options available. Check here.

For concatenation:

Firstly, all video files need to be of the same resolution (say, 1280x720):

$ ffmpeg -i <video-file-path> -vf "scale=1280:720" <output-path>

Then, concatenate them. The example is for 2 video files but can be more:

$ ffmpeg -i <input-1-path> -i <input-2-path> -filter_complex "[0:v] [0:a] [1:v] [1:a] concat=n=2:v=1:a=1 [v] [a]" -map "[v]" -map "[a]" <output-path>

For cutting out a given time portion (start time and duration are in hh:mm:ss):

$ ffmpeg -i <input-video-path> -ss <start-time> -t <duration> -async 1 <output-video-path>

For example:

$ ffmpeg -i video.mp4 -ss 00:00:05 -t 00:01:05 -async 1 cut.mp4

Benchmark

A benchmark deep learning model was developed on this dataset to identify hand gestures. In this project, the model has a 2 step process. Firstly, OpenPose keypoints are generated for the persons in the video. Then they are parsed and divided into 3D time-series data of the form (no-keypoints X no-of-persons X window-size). This data is fed into a model consisting of 2D Convolutional-LSTM layers (ConvLSTM2D in Keras) and 3D Convolutional layers (Conv3D in Keras). The output of the OpenPose model is available in zipped format. To check how OpenPose works, link to the GitHub page. The code is deployed in a Singularity Container available on the Case HPC Cluster. The best performance obtained was Accuracy: 0.7207, Precision: 0.7581, and Recall: 0.6684.