Spring 2020

Title: Automatic Detection of Violence in Video Scenes

Team members:

Ahmed Hesham Abo Eitta
Toka Ossama Barghash
Yousef Mohamed Nafea

Institute:

Egypt-Japan University of Science and Technology

Abstract

Violence detection from video stream using machine learning is a very rising field of inspection, due to its great contribution to achieving peace and security and saving people’s lives by automatically detecting violent acts and alarming those responsible to interfere. Knowing its significance, we have designed and implemented Convolutional Neural Networks (CNNs) to tackle this problem. We used packets to refer to the input of our neural network, where a packet consists of 15 sampled frames that make up one second of a video. So the input to the CNNs is 3D volume of video frames. The problem is posed as a binary classification. Using four different datasets one of which is a combination of youtube videos collected and annotated by ourselves mixed with a dataset found on kaggle, but also filtered by ourselves, in addition to three other ready datasets, the model was trained using supervised learning on a set of normal and violent videos, and tested using three separate classifiers whose results are compared together and then compared to other approaches. Finally, we implemented transfer learning via cross validating models that were trained on a certain dataset and tested on another.

Data Sample before Preprocessing.

Data Sample after Preprocessing.

Architecture of the CNN Classifier.

Accuracy on combined data.

Recognition accuracy on each dataset.

OCR curve for Hokey data.

OCR curve for movies data.

OCR curve for Combined data.

OCR curve for Crowd data.

Recognition accuracy on each dataset.

Training on Hockey dataset with weights from Movies Fight model.

Comparisons of different approaches.

Training on Hockey dataset with weights from Movies Fight model.

Training on Movies dataset with random initialized weights.

Training on Movies dataset with weights from Hockey Fight model.

Conclusion

In this project, we have explored CNNs for automatic detection of violence out video scenes using different datasets.It was concluded that even though the amount of data in each dataset is not huge, an end- to-end approach to the problem using CNNs for feature extraction can achieve results as good as using complicated hand engineered features. Also, transfer learning in the problem of violence detection proved to have a significant effect on the training speed. Furthermore we plan to further investigate different methods to achieve better results in the problem of violence detection. One thing to try is making the problem a multi-class classification problem by introducing an intermediate class beside normal and violence to include other behaviors that include a lot of movement as these are not normal activities to be seen on a normal surveillance camera. Using sequence models is another approach that is likely to have good results as videos are sequence data. The transformer model, which made a breakthrough in natural language processing (NLP) is now being introduced to computer vision problems and it would be interesting to test its performance in the violence detection problem.

Title: Activity With Gender Recognition Using Accelerometer and Gyroscope

Team members:

Ahmed Sharshar
Ahmed El-ghareeb

Institute:

Egypt-Japan University of Science and Technology

Abstract

Recently, the use of the inertia measurement units (IMU), especially the gyroscope and accelerometer sensors, has increased in the human activity recognition (HAR) due to the increasing use of smart-watches and smartphones. In addition to the high quality and efficiency result in by these sensors, they can capture the data of the body dynamic motion as a function of time, then the stream of data is analyzed and processed to classify and predict the action being done, the gender, the health status and many other characteristics. Gender and activity recognition have been deeply studied recently, using various ways to recognize either of them through many interfaces, like voice, image, or inertial motion data. All these types of classifications are crucial in many applications such as recommendation systems, speech recognition, sports tracking, security and most importantly in healthcare. In this research , we present a model based on the MoVi dataset to predict gender with activity and see how every activity reflect on gender, using only two IMU sensors on the right and left hand and explore the efficiency on using the autocorrelation function as a feature extractor along with random forest as classifier.

20 samples representing data as genders only.

5 samples representing data as activities only.

Summary of the results.

Normalized confusion matrix for activities only.

Normalized confusion matrix for gender only.

Normalized confusion matrix for all combinations.

Conclusion

In this work, we introduced a new classification method to predict both activity with gender for humans. Such method can be crucial for recognition systems and healthcare services. We used MoVi dataset to extract 4 activities of Walking, Running, Clapping and Waving for both genders from 80 different subjects. These data are extracted from only 2 IMU sensors placed on Right and left hands only. Each sensor uses 3-axial acceleration, gyroscope and magnetometer. We used Auto Correlation function as feature extractor and Random Forest as classifier. We made 3 different experiments, one to predict activity, one to predict gender, and one to predict gender with activity. All the 3 experiments showed good results with high computational performance, as this method proved to not demand high resources. Thus, the results are promising and encouraging for its implementation in the meantime technological devices, such as smart watches.

In the future, we seek to explore another classifier with a high potential of achieving better results. We target the Convolution Neural Network (CNN) for deep learning and Support Vector Machine (SVM) besides the Random Forest. Also, we seek to increase number of activities and the number of the used sensors to see the effect of increasing one or both of them on the performance and the results in metrics.

Title: Sensor Position Detection in Human Activity recognition (HAR)

Team members:

Hisham Madcor

Institute:

Egypt-Japan University of Science and Technology

Abstract

Human Activity Recognition has gained tremendous drive in recent years. This is due to the increasing ubiquity of all types of sensors in commodity devices such as smartphones, smart watches, tablets, etc. This has made available to the normal user a continuous stream of data including visual data, inertial motion data, audio, etc. In this work we focus on data streamed from inertial motion units (IMUs). Such units are currently embedded on almost all wearable devices including smart watches, wrist bands, etc. In many research works, as well as in many real applications, different specialized IMU units are mounted on different body parts. In the current work, we try to answer the following question: given the streamed inertial signals of a gait pattern, as well as some other activities, determine which sensor location on the subject’s body generated this signal. We validate our work on several datasets that contain multi-dimensional measurements from a multitude of sensors mounted on different body parts. The main sensors used are the accelerometer and gyroscope. We use the Random Forest Classifier over the raw data without any prior feature extraction. This has proven yet very effective as evidenced by the results using different metrics including accuracy, precision, recall, F1-score, etc. An important application of such research can be in data augmentation of timeseries inertial data. This can be used as well for healthcare applications, for example, in treatment assessment for people with motion disabilities.

The contributions of this work are as follows:

We present a full detection of inertial wearable sensors on different parts of the human body during various activities on four different datasets.
We work on raw inertial data without any feature extraction algorithms.
We achieved an average accuracy of 92.33% on all the four datasets.

Outline of the proposed framework.

Sensors locations in the four datasets considered for experimentation.

Confusion matrices.

Classification metrics.