This website has been made by Rajeev Ranjan and Sayantan Sarkar to document the progress made in the project of ENEE739E.
Objective
Our objective is to decompose a video sequence into foreground and background components and to facilitate activity/anomaly detection. We consider a special class of videos whose background is fixed and only a small portion (the foreground) is dynamic.
Surveillance videos are a good example of this class off videos as they are usually fixed, hence they capture the same background in all frames. The background can be thought of as a low rank structure (since it is repetitive) with the objects moving in the video adding a sparse component.
This problem is important because it forms the backbone of solving many related tasks like:
Sparse Formulation
Let us now explore how we can express a video as the sum of two components: a low rank one and a sparse one.
Consider a video of resolution R x C (that is it has R rows and C columns). Now let us extract a contiguous set of N frames. We then vectorize each frame by stacking up its columns to convert each frame into a vector of length RC. Finally we stack up these vectorized frames to form the columns of a new matrix called the 'video volume' Thus the video volume is a matrix of RC rows and N columns, where each column is the vectorized version of a frame.
Fig: Image showing construction of video volume
If the video has nothing except a static background, then all the frames are same, therefore all the columns of the video volume will be same. Hence the video volume will have a rank of 1. Thus we see that the background forms a low rank structure.
Now if we have objects moving in the video, then they tend to occupy only a small portion of the video, thus making them sparse spatially. Given this formulation we seek to decompose the video volume into a low rank (ideally rank 1) structure which defines the background and a sparse matrix, which defines the foreground objects.
Implementation details
We have implemented 2 separate pathways to study this problem. In the first method, we first decompose the video volume (using 3 decomposition techniques), extract features from the sparse component and then classify the video volume as anomalous or normal, by finding its distance from a known set of non-anomalous video volumes. In the second pathway, we use an unsupervised approach, by attempting to cluster objects, thus hopefully separating out all anomalous objects into a single cluster. The image below summarizes our roadmap. The green blocks represent the supervised pathway, while the cream colour represents the unsupervised path.
Fig: Roadmap summary
The next image shows a furthur expansion of the supervised pathway. We consider two cases, one where the full video volume is available, and the second where the video volume is sampled and only a compressed version is available. Based on that we use different techniques to separate the video into L (low rank) and S (sparse) components. It is now very easy to segment the sparse component as the background has been separated. We then track the objects that we segment out in the video volume and find its movement based features and finally use Mahalanobis distance to find the distance of the given video volume from a reference data set of known non-anomalous videos. These operations that we have implemented are described in the green boxes. The orange boxes show a few easy extensions to this framework. We can threshold on the level of sparsity to infer if any object is present in the video at all. If nothing is present we can remove that segment from the video, thus creating a shorter summary video. Also we can use shape based features to classify and perhaps fuse shape and movement based classification in a cascade classifier. Movement based classifier can be used to differentiate between, say, a fast moving roller skater and a normal walking person, while shape based features may help differentiate between a car and a person for example.
Fig: Supervised pathway
Site Map
Now that we have given a brief overview of the techniques we have used, let us look at them in more details: