Crowd Counting

Srinath Srinivasan

Overview

Crowd counting is a technique used to estimate the number of people in a particular scene. (image or video) This technique plays a crucial role in urban planning, public safety, intelligent transportation and even helps optimize services in the commercial sector.

Furthermore, crowd counting can be used to estimate the population density of an area in real-time. The population density can be used in a plethora of urban analysis such as monitoring usage of streets, malls, parks, etc. Such data could be especially useful in analyzing the spread of COVID-19, enforcing regulations such as social distancing, and allotting necessary resources to areas that require it the most.


DATASET

The VISDRONE Crowd Counting dataset is formed by 3,360 images taken by drone-mounted cameras for 70 different scenarios across 4 different cities in China. Every person in each video frame is manually annotated with a point. The dataset is divided into training and testing subsets with 2,460 and 900 images respectively.

Available here: https://github.com/VisDrone/VisDrone-Dataset

STATE OF THE ART

Crowd Counting results from VisDrone-CC2020 dataset

FPNCC MAE MSE

Large 13.74 18.37

Small 10.27 13.15

FPNCC is based on Autoscale (https://arxiv.org/abs/1912.09632) with a VGG16-based Feature Pyramid Network (FPN) backbone. (https://openaccess.thecvf.com/content_cvpr_2017/papers/Lin_Feature_Pyramid_Networks_CVPR_2017_paper.pdf) This framework rescales the dense regions into similar density levels, which mitigates imbalance of density values in the dataset.

One fundamental challenge in computer vision is being able to identify objects at vastly different scales. Feature pyramids built upon image pyramids form the basis of a standard solution. These pyramids are scale-invariant in the sense that an object’s scale change is offset by shifting its level in the pyramid. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and pyramid levels.

CHALLENGES

A majority of early crowd counting methods depend on sliding window detectors to scan video frames to detect pedestrians based on hand-crafted features. However, these methods are often affected by heavy occlusion, scale and viewpoint variations on crowded scenarios. With the advancement of deep learning, many modern solutions involve approaching the problem as a regression of density maps by the network.

These were some of the major identified difficulties with the dataset that I faced when implementing a solution to this task.

  • Cross-scenarios - The training and validation sets of the dataset are divided according to scenario, which means that this is a cross-scene crowd counting task. Some of the test scenarios are virtually unknown to the model thereby increasing the difficulty of the task.

  • Scale variation - The flying height of the drones that produced the footage varies often, which significantly affects the scale of objects. This variation in scale increases the difficulty of fitting the model with the data.

  • Illumination variation - The test set may have dark, dimly lit footage which might not appear in the training set. These night scenarios will have a significant impact on the test results.

Initially, my plan was to implement the state-of-art algorithm, FPNCC, mentioned above. However, I soon realized that it was far too complicated. This was mainly because I did not have the knowledge required to implement feature pyramids with our dataset.


Approach

My approach was as follows:

  • Data Preparation - The first, and probably the most important step I undertook was to pre-process the data. This involved splitting the data into train, test and validation sets and making sure the directory structure was maintainable.

  • Density Calculations - the Gaussian density of each image in the train dataset based on the provided annotations and after finding ground truths.

Density estimation of a sample image from train set

  • The algorithm I chose to implement was CSRNet. (https://arxiv.org/abs/1802.10062) This method provides a data driven and deep learning method that can understand highly congested scenes and provide accurate count estimation as well as outputs high quality density maps. This algorithm consists of two components - 1) a convolutional neural network (CNN) as the front-end for 2D feature extraction, 2) a dilated CNN for the back-end, which uses dilated kernels to deliver larger reception fields and to replace pooling operations.

  • Model Architecture - The model architecture is divided into two parts, front-end and back-end. The front-end consists of 13 pre-trained layers of the VGG16 model ( 10 Convolution layers and 3 MaxPooling layers ). The fully connected layers of the VGG16 are not taken. The back-end comprises of Dilated Convolution layers. Here, I built a custom VGG16 model and downloaded pre-trained weights from the model which I was able to find online.

CSRNET model in my code

plot_model of model

Trained model - Post-training the CSRNet model with the train set and estimating the gaussian density of images. We observed that the model was doing a much better job estimating the densities.

Annotation - Rather than estimating the density on images in the test set, I chose to annotate persons in each and every image with red dots based on the density maps. I chose this route as it was a less complicated process to perform.

Similar to how models were evaluated by Mean Absolute Error and Mean Squared Error for the actual VISDRONE 2020 competition, I chose to use the same evaluation metrics. My model was able to achieve a MAE value of 8.108 on the test set and 10.189 on the test dataset.

MAE and MSE of train and validation sets

MAE of test set

As we can observe from this graph, it is evident that our model does a decent job in predicting where persons are in a given image.

Final Thoughts

There are many aspects of my model that could've been optimized to work more efficiently.

  • For one, my model has the potential to be even more accurate if I used my own parameters for obtaining the weights of the VGG16 architecture rather than downloading a pre-trained model off the Internet.

  • Furthermore, I would like my model to be able to estimate densities on video footage rather than images alone. I believe that the reason my MAE is better/lesser/more accurate than that of the state-of-art FPNCC architecture is because I am only working with still images.

  • My model works but it does not output the number of people in the image. Rather, it annotates them. Next steps for my project would be to actually output a value for the number of people in the footage at a given time.

  • Lastly, I would like to implement FPNCC on my own and compare results in order to have a true picture as to the accuracy of the models.



I would like to achieve a similar output with my model in the future.