Automatic Video Surveillance

Violence Detection and Recognition

The world is witnessing a vast increase in crime rate, accompanied by severe losses of lives and property. Surveillance cameras are present everywhere. However, human supervision is becoming inefficient and incapable of early detection of violent acts, due to the diversity and the unexpected scenarios of violence. Accordingly, contributing to automatic violence prevention is becoming exceedingly important. Fulfilling this urgent demand, we turn to Machine Learning to help detect and classify violent events from video streams. In this work, we propose a framework for detecting violence in video captures and streams, followed by categorizing the violent act in case the video clip is classified as violent. Supervised Learning is applied for both: the binary classification problem and the multi-class violence classification problem. The detection model relies on the usage of 3D Convolutional Neural Networks. The classification model utilizes the pre-trained Inception-v3 model for feature extraction, followed by Gated Recurrent Units (GRUs) for temporal processing. The models are trained on multiple datasets, whose frame-level annotations are available. We meant to use videos from various sources such as surveillance cameras, human recordings, movies, and public websites such as YouTube, to demonstrate the effectiveness of our models on different data sources. Furthermore, Transfer Learning from pre-trained models is applied, where each model trained on one of the datasets, is lightly re-trained on a different one, mostly demonstrating better performance than the original model, in terms of computational resources demands and accuracy.

The task of violence detection and classification is not a trivial one as it incorporates a myriad of arbitrary factors due to its psychological and sociological nature. Often the source of the video being analyzed dictates different characteristics and parameters to be considered in the violence detection and classification models implemented. For instance:

  • In the case of detecting violence through surveillance cameras, the abundance of cameras nowadays indicates that such method is indeed perfectly valid for detecting violence in streets, shopping malls, educational institutes, companies, etc. However, the key problem is the need for workers to continuously monitor this footage in real-time in order to instantaneously detect any violent act taking place, or even predict that such an act is about to occur. In this situation, it is much more efficient to make available an automatic violence detector, capable of processing the video data instantly (in real-time) while recording. Moreover, it should be capable of classifying violent acts in order to help determine their level of ferocity and consequently aid in swiftly carrying out the appropriate response to such violent acts.

  • In case of movies, violence detection and classification could be considered a slightly more well-defined task, where there are known and easy to determine cinematic tropes in films for action-packed scenes. It can take the form of intense camera shaking, constantly switching camera angles, suspenseful background music, etc. Also, the scenes are naturally focused only on the actions of interest happening and high-quality sound effects are quite apparent as opposed to surveillance footage for example.

  • In the case of streaming services, it is required to detect violence in order to assure that videos are adhering to the terms of use & services where it is required to remove any brutal videos that break such rules or at the very least, restrict access to them by age to protect children from accessing these types of videos. Also, multi-class violence classification is necessary for determining the genres of videos which can be further utilized in videos recommendation systems or simply for documentation purposes.

Flow diagram of our system accompanied with the pre-processing done on videos using frame-level annotations.














Categories across all of the used datasets:

  • UCF: UCF-Crime --> surveillance footage that has 13 classes of violence.

  • XD: XD-Violence --> A dataset collected from both movies and YouTube videos (in-the-wild scenes).

  • LAD: LAD2000 -- > An anomaly detection dataset with14classes of anomalies.


Violence detection model architecture.

Multi-class violence classification model architecture.

Violence detection results: the diagonal numbers show the results of self-testing, whereas the off-diagonal numbers show the results of cross-testing using transfer learning.

Violence classification results: the diagonal numbers show the results of self-testing, whereas the off-diagonal numbers show the results of cross-testing using transfer learning. We notice that in some cases transfer learning from other datasets improve the results over the baseline (the diagonal numbers).


Violence classification results with half of fine tuning compared with the preceding table. So it is natural to witness a little bit of decrease in classification performance with cross-testing in some cases. In some other cases no performance degradation is witnessed.

In the following we give the confusion matrix for violence classification for each of the three datasets used in the empirical study in this work.

Confusion matrix for the UCF-Crime dataset.

Confusion matrix for the XD-Violence dataset.

Confusion matrix for the LAD2000 dataset.

This work aimed to contribute in preventing violence in our daily life by proposing two Machine Learning models for Violence Detection and Classification. These models are trained on three benchmark datasets: UCF-Crime, LAD2000, and XD-Violence. Our results show that our models can be used for video data from various sources and achieve accurate performance. We then applied Transfer Learning by re-training a pre-trained model on a particular dataset, on a different dataset for a fewer number of epochs. Our work benefits from Transfer Learning in the reduction of training time, achieving better accuracy, as in case of XD-Violence dataset or sometimes slightly less accuracy as in UCF-Crime and LAD2000 datasets.

We plan to extend our framework to include spatial and temporal localization of violent acts in videos, after categorizing them. We can then merge the three models, providing a real-time tool for use in surveillance systems and others. Regarding the classification task, we plan to consider the multi-label violence classification problem. A clip can belong to several different categories, not just a unique type. One advantage to exploit is that some of the datasets used contain videos that belonged to several categories, as the XD-Violence dataset.

Merging all datasets together as a whole and applying our models on them is a one thing to try. We also investigated two other datasets that we plan to use in this combined dataset. Similar to the UCF-Crime the ‘CCTV-Fights’ dataset is collected from surveillance footage, but has only 1 class of violence which is fighting. Another one is the ‘VSD2014’ dataset which consists of 32 movies of different genres and 32 short web videos collected from Youtube. Also, combining all datasets together has the advantage of helping with the problem of categories imbalance. This can help achieve better classification results.

In our experiments for violence detection, we extracted the non-violent parts from the videos that contained violent scenes and labelled them as normal clips. To enhance the generalization of our detection model, we plan to add more normal clips by using the videos that are completely normal, provided in some datasets. Lastly, we intend to try other feature extractors, classical classifiers as Support Vector Machines, Random Forests, in addition to state-of-the-art architectures and techniques such as transformers, which have recently shown huge advancements, particularly, in the field of natural language processing.

References

  • Maria Gadelkarim, Mazen Khodier, and Walid Gomaa. Violence detection and recognition from diverse video sources. Accepted in IJCNN 2022.

  • Ahmed Abo Eitta, Toka Barabash, Yousef Nafea, and Walid Gomaa. Automatic detection of violence in video scenes. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2021.

Identifying Motion Pathways and Anomalies in Crowd Scenes

I. Classical Approach: Non-Parametric Clustering

Many approaches that address the analysis of crowded scenes rely on using short trajectory fragments, also known as tracklets, of moving objects to identify motion pathways. Typically, such approaches aim at defining meaningful relationships among tracklets. However, defining these relationships and incorporating them in a crowded scene analysis framework is a challenge. In this work, we introduce a robust approach to identifying motion pathways based on tracklet clustering. We formulate a novel measure, inspired by line geometry, to capture the pairwise similarities between tracklets. For tracklet clustering, the recent distance dependent Chinese restaurant process (DD-CRP) model is adapted to use the estimated pairwise tracklet similarities. The motion pathways are identified based on two hierarchical levels of DD-CRP clustering such that the output clusters correspond to the pathways of moving objects in the crowded scene. Moreover, we extend our DD-CRP clustering adaptation to incorporate the source and sink gate probabilities for each tracklet as a high-level semantic prior for improving clustering performance. For qualitative evaluation, we proposed a robust pathway matching metric, based on the chi-square distance, that accounts for both spatial coverage and motion orientation in the matched pathways. Our experimental evaluation on multiple crowded scene datasets, principally, the challenging Grand Central Station dataset, demonstrates the state-of-the-art performance of our approach. Finally, we demonstrate the task of motion abnormality detection, both at the tracklet and frame levels, against the normal motion patterns encountered in the motion pathways identified by our method, with competent quantitative performance on multiple datasets.

The main contributions of this work as follows: (1) a flexible tracklet similarity measure based on line geometry, (2) an adaptation of the DD-CRP model to the clustering of tracklets, (3) a new tracklet cluster likelihood (TCL) based on adopting the source/sink gate distributions of tracklets as a high-level semantic prior, (4) new comprehensive ground truth (GT) pathways for the Grand Central Station scene based on different gate annotations, provided for practical evaluation, (5) a robust quantitative/qualitative evaluation for the semantic analysis of crowded scenes based on a reliable pathway matching metric, and (6) a simple tracklet representation used for anomaly detection through the identified pathways.

(a) Hypothetical scene with one source (A), two sinks (B and C), and four tracklets (t_1 , ... , 𝑡_4 ). The directed line segment associated with each tracklet is indicated by a dashed line. 𝜃_{34} is the estimated angle between t_3 and t_4. (b) Computation of the overlap ratio between two tracklets, t_i and t_j, as 𝑂_{𝑖𝑗} = 𝐼_{𝑖𝑗}/U_{ij} .

(a) parallel tracklets appear in separate clusters that resulted from the first level of clustering, (b) corresponding collinear representatives for each cluster, and (c) global collinear representative obtained from the second level of clustering based on curve fitting.

(a) Hypothetical scene that demonstrates the directed lines that extend from the start terminal of a tracklet to the gate centroids, (b)our 14-gate annotation , and (C) the floor plan of the Grand Central Station.

(image credit: https://www.cultofmac.com/82433/confirmed-apple-to-open-biggest-store-yet-in-grand-central-terminal )

(a) Motion pathway obtained from our clustering model without providing the semantic prior (green ellipses indicate the most probable gates). (b) Corresponding pathway after introducing the source/sink priors.

Computation of both source and sink gate probability distributions for hypothetical tracklet t_i in cluster C based on its frontal and rearward field of views. Note that the most likely source and sink gates are highlighted in black and blue, respectively (11 and 5).

(a) GT (ground truth) pathway, (b) heat map of the pathway’s spatial extent, and (c) pathway motion orientation histogram.

(a,c) GT pathways based on seven and eight-gate annotations, respectively, and (b, d) our identified pathways’ correspondences.

(a) Ground truth pathways, (b) pathways identified by our proposed method, and (c) corresponding semantic paths from Zhou et al. (2011). Both approaches incorporate source/sink priors in their clustering model.

(a) Ground truth motion pathways, (b) pathways obtained using our no-prior approach, and (c) pathways that resulted from the MT method (Jodoin et al., 2013).

Sample clustering results from different crowded scenes: the first column shows the ground truth pathways. The remaining columns show the identified pathways obtained using our approach without providing a semantic prior.

Anomaly Detection

Detecting unusual actions/activities throughout the crowded scene has attracted a great deal of research interest recently. The problem is not only to detect if there is an abnormal action but also attempt to localize where and when the event occurred, or even identify how long it took. Visual scenes almost always contain normal behavior over the time evolution, and abnormal actions are rare cases. Thus, most of the proposed anomaly detection approaches depend on learning behaviors from labeled data or a corpus of unlabeled data in which most parts are normal. We introduce a simple tracklet representation that incorporates spatial, orientation, and speed information. We represent each tracklet by a six-dimensional (6D) feature vector. The pair of parameters (r, 𝜃) represents each point on the tracklet in polar coordinates. These parameters express, respectively, the distance of the point from the frame origin and the amount of rotation required from the positive x-axis. Let s, m, and d denote the start, middle, and end points of the tracklet, respectively. The proposed feature vector is a 6D vector [rs, rm, rd, 𝜃s, 𝜃m, 𝜃d]. Note that the 𝜃 angles are the rotation differences. This representation has proved to be reliable for the anomaly detection task for the discrimination between different types of tracklet abnormalities.

Tracklet representation for anomaly detection.

Three panic scenarios of the UMN dataset (UMN, 2006): the first row shows the normal behaviors and the second row shows the abnormal behaviors.

II. Deep Learning: LSTM

In this alternative approach, we propose two approaches to analyze the crowd scenes. The first one is motion units and meta-tracking based approach (MUDAM Approach). In this approach, the scene is divided into a number of dynamic divisions with coherent motion dynamics called the motion units (MUs). By analyzing the relationships between these MUs, using a proposed continuation likelihood, the scene entrance and exit gates are retrieved. A meta- tracking procedure is then applied and the scene dominant motion pathways are retrieved. To overcome the limitations of the MUDAM approach, and detect some of the anomalies, that may happen in these scenes, we proposed another new LSTM based approach. In this approach, the scene is divided into a number of static overlapped spatial regions named super regions (SRs), which cover the whole scene. Long Short Term Memory (LSTM) is used in defining a predictive model for each of the scene SRs. Each LSTM predictive model uses its SR tracklets in the training, such that, it can capture the whole motion dynamics of that SR. Using apriori known scene entrance segments, the proposed LSTM predictive models are applied and the scene dominant motion pathways are retrieved. An anomaly metric is formulated to be used with the LSTM predictive models to detect the scene anomalies. Prototypes of our proposed approaches were developed and evaluated on the challenging New York Grand Central station scene, in addition to four other crowded scenes. Four types of anomalies that may happen in the crowded scenes were defined in the context, and our proposed LSTM based approach was used to detect such anomalies. Experimental results on anomalies detection have been applied too on a number of datasets. Overall, the proposed approaches managed to outperform the state of the art methods in retrieving the scene gates and common pathways, in addition to detecting motion anomalies.

So, in brief this work aims at 1) analyzing the crowd scenes to discover the common motion pathways of the scene typical moving objects, (2) discovering the scene entrance/exit regions, and 3) building normalcy models for the scene motion dynamics and accordingly using these models in anomaly detection.

a) The Marathon scene divided into four overlapped super regions (SRs). b) A group of MUs (motion units) of the Grand Central station scene and their mean tracklets (yellow) and their orientations (red).

a) The field of view of a hypothetical tracklet T_i , and a group of acceptable continuation tracklets (green tracklets). δ and θ FOV (field of view) are the distance and angle parameters of the field of view, respectively. b) A hypothetical figure shows the type of MU (represented by a circular node) according to the connectivity relationship (represented by the black arrow) between it and the surrounding MUs.

Samples of the retrieved trajectories using our proposed MUDAM approach, that shows the loop back failure case. The trajectories are starting at the yellow points and ending at the red points. This is one of the limitation of the MUDAM approach leading to the following use of LSTM recurrent networks to properly model the dynamics of the scene.

a) The architecture of the proposed LSTM predictive model. b) A hypothetical figure showing how the trajectory generation process is applied.

a) The manually annotated ground truth (GT) gates of the New York’s Grand Central station dataset. b) The obtained entrance/exit gates after applying the mean shift clustering. c) Our obtained gates after matching to the GT gates (bipartite matching)

Grand Central (GC) gates results for our proposed approach (MUs plus Meta-tracking) vs. Jodoin et al. (JA) [15] and Hassanein et al. (HA) [13] approaches.

Qualitative results of applying our MU and meta-tracking based approach on the Marathon, Rush Hour, Street Light, and China Street datasets (from top to bottom respectively). a) Ground Truth entrance/exit gates and common pathways (green arrow). b Detected Entrance (yellow)/Exit (red) gate points

The Marathon scene is divided into different number of overlapping super regions (SRs) as shown in Column (a), and the corresponding obtained common pathways in each case (shown in Column (b))

The following table shows the results of the discovered pathways of the Marathon dataset using LSTM proposed approach after dividing the scene into two, three, four, and six super regions (SRs).

The 7 richest pathways of the ground truth, JA [15], HA [13], MUDAM [23], and our LSTM based approaches from left to right respectively.

Pathway qualitative results of applying our LSTM based approach on the Rush Hour, Street Light, and China Street datasets (from left to right)

Conclusion

In this work, we proposed two approaches for crowd scene analysis. The first is based on motion units and meta-tracking. In this approach the scene is divided into dynamic divisions called the Motion Units (MUs) based on the scene local motion characteristics. A connectivity relation is defined to analyze the relationship between these MUs to retrieve the crowd scene entrance/exit gates. A transition relation and a Motion Unit Dynamics Acquisition Model (MUDAM) are derived to apply a meta-tracking procedure on the scene to retrieve the scene common pathways. Due to some limitations in this approach, and in addition to the need of detecting anomalous behaviors that may happen in the crowd scene, a new LSTM based approach for analyzing the crowded scenes is proposed. In this approach we divide the scene into a number of spatially overlapped parts called Super Regions (SRs), and then train an LSTM predictive model for each SR using the extracted tracklets inside that SR. The proposed approach is then used in discovering the scene’s common pathways, considering that the scene gates (entrances and exits) are given. In our experiments, we used the gates obtained by the MUs and meta-tracking technique as an input to the LSTM based approach. An anomaly metric is also proposed to detect four abnormal situations that may happen inside the crowd scene. The two proposed approaches have been assessed against two of the state of art approaches in video content analytics in addition to the ground truth pathways of the challenging New York’s Grand central dataset. For more evaluation of the proposed approaches, four other datasets were also used. The proposed LSTM based approach was tested in terms of identifying anomalous activities on several scenes. The experimental results show that our proposed approaches outperform the other state of the art approaches in terms of detecting the scene gates and pathways, and also in detecting anomalous scenarios that may happen. In future work, we will consider fixing the scene defects that affect the performance of our proposed algorithms. We also will consider the adaptive scene SRs division process, which can be adaptive to the motion dynamics in the scene (more SRs in highly dynamical areas of the scene and vice verse) and the shape of these SRs. The proper number of input sequential points to the LSTM predictive model is another factor that will be investigated too. One of the important points is changing the dominant pathways over different time of the day or different days (non-stationarity of the motion dynamics). We also intend to analyze this situation and give our approach the ability to handle such situation. Also, we intend to study more complex anomalous behaviors such as people grouping or splitting in the crowd scene, and the effect of these actions on the scene motion flow. Also, we plan to identify some specific events in the crowd scenes such as putting something on the unattended floor and leaving it, which is very important as a security issue. Also as a need for these new anomalous scenarios, we plan to employ a neural network-based approach for more accurate and adaptive detection of these anomalies.

References

  • Abdullah N. Moustafa and Walid Gomaa. Gate and common pathway detection in crowd scenes and anomaly detection using motion units and LSTM predictive models. Multimedia Tools and Applications, April 2020.

  • Abdullah Moustafa, Mohamed Hussein, and Walid Gomaa. Gate and Common Pathway Detection in Crowd Scenes Using Motion Units and Meta-Tracking. In Proc. of the International Conference on Digital Image Computing: Techniques and Applications (DICTA 2017) , Sydney, Australia, 29 Nov - 01 Dec 2017.