CS766 PROJECT - STATE OF ART

STATE OF ART OF OBJECT DETECTION

Convolutional Neural Network (CNN)

Convolutional Neural Networks (CNN) are a category of neural network that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self-driving cars. There are four main operations in the ConvNet shown in Figure 1 above:

Figure 1. Simple CNN model

Convolution
Non Linearity (ReLU)
Pooling or Sub Sampling
Classification (Fully Connected Layer)

These operations are the basic building blocks of every Convolutional Neural Network.

The overall training process of the Convolution Network can be summarized as below:

We initialize all filters and parameters / weights with random values
The network takes a training image as input, goes through the forward propagation step (convolution, ReLU and pooling operations along with forward propagation in the Fully Connected layer) and finds the output probabilities for each class.
- Let’s say the output probabilities for the boat image above are [0.2, 0.4, 0.1, 0.3]
- Since weights are randomly assigned for the first training example, output probabilities are also random.
Calculate the total error at the output layer (summation over all 4 classes)

Total Error = ∑ ½ (target probability – output probability) ²

Use Back propagation to calculate the gradients of the error with respect to all weights in the network and use gradient descent to update all filter values / weights and parameter values to minimize the output error.
- The weights are adjusted in proportion to their contribution to the total error.
- When the same image is input again, output probabilities might now be [0.1, 0.1, 0.7, 0.1], which is closer to the target vector [0, 0, 1, 0].
- This means that the network has learnt to classify this particular image correctly by adjusting its weights / filters such that the output error is reduced.
- Parameters like number of filters, filter sizes, architecture of the network etc. have all been fixed before Step 1 and do not change during training process – only the values of the filter matrix and connection weights get updated.
Repeat steps 2-4 with all images in the training set.

In conclusion, Convolutional Neural Network (CNN) is good for image based data classifications, especially for binary classifications, however, when we need to consider object detection for image data set, there are multiple objects in one image, the detection process of CNN is more like a Logistic Regression, the performance of that is not powerful which results as low accuracy, thus, the other robust architecture models will be considered for object detection.

Region-based Convolutional Neural Network (R-CNN) and Fast Region-based Convolutional Neural Netowork (Fast R-CNN)

Region-based Convolutional Neural Network (R-CNN) is an modified version of CNN regarding object detection. It introduces the boundary proposal to corresponds to Region of Interest (RoI), which is brought up with methodologies to define the boundary of all the goal objects in the image to help with object identification. Standard R-CNN requires a forward pass of the CNN (i.e. AlexNet) for every single region proposal for every single image. Three classifiers are proposed for object identification: CNN is used to generate the image features, Support Vector Machine (SVM) is used to classify the labels and Regressor is used to tighten the boundary of the image object. Fast R-CNN is an improved version of R-CNN, where the region proposals are done faster through sharing computation. It computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features extracted in the proposal region are pooled together and concatenated as in spatial pyramid pooling.

However, it still has some drawbacks:

Multistage classification;
Features are written to disk;
The fine-tuning step cannot update the convolutional layers that precede the spatial pyramid pooling.

An example for R-CNN procedure overview

Therefore, when an image is input to R-CNN model, it will compute and find potential objects of the image and generate around 2000 images which contain different region proposals, as for Fast R-CNN, it will compute all the 2000 region proposals in one single image, this difference Fast R-CNN from R-CNN reduces the computation time.

R-CNN and Fast R-CNN use a external region proposal module as a pre-processing step before running the CNN. The proposal algorithms are typically techniques such as EdgeBoxes or Selective Search, which are independent of the CNN layers. In Fast R-CNN, the use of these searching techniques becomes the processing bottleneck compared to running the CNN.

Faster R-CNN and Mask R-CNN

The major improvement of Faster R-CNN is the use of the Region Proposal Network (RPN), a fully connected convolutional neural network, on top of CNN. With region proposal network, the search of ROI is based on CNN feature map, which is already calculated for classification.

The RPN takes CNN feature map as input and output potential bounding box with a score indicating the likelihood of presence of an object within the box are. Then, it send the information within the bounding box to Fast R-CNN network for ROI pooling and classification.

Results of Faster R-CNN. Source: https://arxiv.org/pdf/1506.01497.pdf

Mask R-CNN is an extension of faster R-CNN including pixel level segmentation. In addition to faster R-CNN network, mask R-CNN adds an additional fully convolutional network (FCN) in parallel with RPN on top of CNN feature map. FCN outputs a mask indicates whether a given pixel is a part of an object. When finding corresponding pixel location in the original image, bilinear interpolation is used for calculation.

Result of Mask R-CNN. Source: https://arxiv.org/pdf/1703.06870.pdf

Previous: Overview Next: Data

Google Sites

Report abuse