Rammohan Ch, Shruti V Phadke, Subhashree R
Fall 2016: ECE 5554/4554 Computer Vision: Class Project
Virginia Tech
Fight detection in multimedia sources and videos can be extremely useful in several applications such as video surveillance, parental control, etc. In this work, different approaches for feature extraction, training, and classification are used to detect violence, particularly physical fights, in videos. The performance of these approaches is then compared.
Results show that extracting Convolutional Neural Network (CNN) features from the training set and using them to train a Support Vector Machine (SVM) classifier gave 96.3% classification accuracy of the testing set. In addition, using those CNN features to train a K-Nearest Neighbor (KNN) classifier with 5 neighbors and a Decision Trees classifier gave 91.55% and 67.6% accuracy respectively. On the other hand, classifying the entire dataset using a pretrained CNN classifier gave 71.66% classification accuracy. Moreover, extracting Spatial Temporal Interest Points (STIP) features and using them to train an SVM classifier gave 55.94% classification accuracy. In addition, Augmented Affine Trajecton with SVM classifier gave 61.88% accuracy.
Human action recognition works towards recognizing actions. It has been one of the prime areas of research in computer vision. However, most research has focused on simple actions like running, walking, jogging, etc., while detection of fights or aggressive behaviour has received less attention. This detection of fights from multimedia sources or video streams plays an important role in ensuring safety. It can be extremely useful in several applications such as video surveillance, parental control, etc.
Various methods have been proposed in the literature for both feature detection of space-time interest points and description of local video patches in the field of human action recognition such as Convolutional Neural Network (CNN) features, Spatio-Temporal Interest Points (STIP), Gradient Descriptor, HOG features, etc.
Space Time features give local events in a video and can be adapted to represent the frequency and velocity of moving patterns. Hence this representation is robust to dynamic transformations in the video. The Bag of words model is used and finally SVM is used for classification of fight.
In this method a trajectory based human action recognition method has been implemented which gives the adequate spatial or temporal relationships. In this approach, trajectories are extracted by extracting SURF features and tracking these features in consecutive frames and is combined with Bag of Words Model. A method for extracting the affine transformation is implemented which serves a representation of change in temporal features.Finally support vector machine is used to classify the fights.
In this work, CNN features and CNN classifiers are used. CNNs, which are Convolutional Neural Networks, are biologically-inspired variants of Multilayer Perceptron Models (MLP). The building blocks are neurons that have learnable weights and biases. Each neuron receives some inputs, performs a dot product and optionally follows it with a nonlinearity. The whole network still expresses a single differentiable score function from the raw image pixels on one end to class scores at the other [1]. CNN arranges its neurons in three dimensions width, height, and depth. Every layer of a CNN transforms the 3D input volume to a 3D output volume of neuron activations [2].
In this work, given a video describing an action, the action is classified as fight or non-fight and the performance of different feature extraction, training, and classification techniques is compared.
In this method, the STIP points, which give information about the motion in 2D images, are extracted. A second moment matrix using spatio temporal image gradients within the Gaussian neighbourhood of each point is constructed. The positions of features are then described by finding the local maxima of the har function [2]
The spatio temporal descriptors are calculated by computing the spatio-temporal jets using the normalized vector for every feature point. A bag of visual words was formed using K-Means clustering. The hard assignment was used for encoding and the training and test data was calculated. The training and test data was fed to bilinear SVM classifier which gave an accuracy of 55.94%.
In this method, each video is sampled at a frame rate and the initial feature points are initialized using SURF features. These features are then tracked using KLT tracking. The inliers are found in consecutive frames and the affine transformation is obtained between the two sets of inliers. This affine transformation constitutes the augmented trajecton, which are the extracted features. The bag of words is formed from these features using K-means clustering and the train and test data are generated. These are fed to the SVM classifier, which gave a classification accuracy of 61.88%.
The advantage of this method over cubical optical flow feature is that trajectory is attached to a particular moving feature. That is, in video deriving from the movement of physical bodies through space, a properly tracked feature (and hence trajectory) automatically gains foreground-background separation. In contrast, histogramming over a cube of optical flow vectors will blend the various sources of motion within that cube. Also, this method is computationally less intensive [3].
After experimenting with the afformentioned STIP feature extraction and Augmented Affine Trajecton, CNN based approach is used to improve the accuracy. Since training the entire dataset from scratch is computationally expensive, ‘resnet50 1by2’, a pre-trained CNN model, is used. This model was initially trained for the binary scene classification [4].
For this method, the dataset is transformed into a network readable lmdb datafile, which is used to create the binary proto required by using mean of all the images. A network architecture is created with a custom solver required for the proposed model. In addition, a pretrained model on the ImageNet dataset is used for computational purposes and the final layers are retrained with our data. This trained model is used to classify the test data.
To further improve the classification accuracy, a pre-trained CNN classifier that has been trained on the ImageNet dataset [4] is used to extract CNN features, which are then used to train a classifier that detects fights in videos. Using the Peliculas Fight Action dataset [5], which has 100 fight videos and 100 non-fight videos, 25 frames are extracted from each video resulting in 5000 fight and non-fight images. These images are labeled and divided into a training dataset of 3000 images and a testing dataset of 2000 images. The CNN features of both datasets are extracted using the aforementioned pre-trained CNN classifier. The CNN features of the training dataset are used to train SVM, KNN, and Decision Tree classifiers. These classifiers are then used to classify the 2000 remaining testing dataset as fight and non-fight actions. Since each frame is now labeled, to get the label of the video, if more than 12 frames out of the 25 frames per video are labeled as fight, then the video is classified as a fight action. On the other hand, if 12 frames or less out of the 25 frames per video are labeled as fight then the video is classified as a non-fight action. In order to accelerate both the training and the prediction, a CNN MATLAB toolbox called MatConvNet [4] is used. In addition, the function cnnPredict() [6] is used to extract the CCN features using the pre-trained CNN model.
In this work, the Peliculas Fight Action dataset is used [6]. This dataset contains 100 videos of fight sequences and 100 videos of non-fight sequences. Several platforms were used for this work such as CAFFE, Theano, and MATLAB. The accuracy and confusion matrices are used to evaluate the performance of the proposed feature extraction and classification techniques. . The results obtained are as follows:
Accuracy = 48.5%
The confusion matrix is given by
Accuracy = 55.94% Accuracy = 61.88%
The confusion matrix is given by The Confusion matrix is given by
The classification result was the probability value for an image being a fight scene or non-fight scene. For final classification of the image as fight or non-fight, a threshold of 0.1 was empirically obtained. The model which was used for the classification is:
Accuracy =71.66%
The Confusion Matrix is given by
CNN feature Extraction with SVM, KNN, and Decision Trees Classifiers:
Accuracy = 96.3% Accuracy = 91.55%
The confusion matrix is given by The Confusion matrix is given by
Accuracy = 67.6%
The confusion Matrix is given by
Results show that extracting CNN features and using an SVM classifier has the best accuracy. This is due to CNN’s self learning property of deciding the best features. Moreover, all obtained results performed better than the random decision classifier.
Fight Non Fight
Accurately Classified Inaccurately Classified Accurately Classified Inaccurately Classified
Fight Non Fight
Accurately Classified Inaccurately Classified Accurately Classified Inaccurately Classified
Fight Non Fight
Accurately Classified Inaccurately Classified Accurately Classified Inaccurately Classified
In this work, different techniques for feature extraction, training, and classification are used to detect fight actions in videos, and the performance of these different techniques is compared. Using various feature descriptors and extractors, the detection of fight actions is not very accurate because of the involvement of complex features and simple classifiers. Results show that using CNN features with SVM, KNN, and Decision Trees has 96.3%, 91.55%, and 67.6% accuracy respectively. On the other hand, using a pretrained CNN classifier has 71.66% accuracy. In addition, using STIP features with SVM has 55.94% accuracy, and using Augmented Affine Trajecton with SVM has 61.88% accuracy. The best results were obtained using feature extraction from CNN due to its self learning property of deciding the best features. Proper finetuning of CNN with longer iterations will give higher accuracy and less training loss. In addition, since all computations are done using CPU only, challenges arise when fight actions are detected in real time due to the computational power and capacity limitations. For future work, the CNN network can be trained from scratch using a significantly large dataset. In addition, sounds from videos can be used to further enhance the accuracy results.
References: