Convolutional Neural Networks (CNN) are a category of neural network that have proven very effective in areas such as image recognition and classification. ConvNets have been successful in identifying faces, objects and traffic signs apart from powering vision in robots and self-driving cars. There are four main operations in the ConvNet shown in Figure 1 above:
Figure 1. Simple CNN model
These operations are the basic building blocks of every Convolutional Neural Network.
The overall training process of the Convolution Network can be summarized as below:
Total Error = ∑ ½ (target probability – output probability) ²
In conclusion, Convolutional Neural Network (CNN) is good for image based data classifications, especially for binary classifications, however, when we need to consider object detection for image data set, there are multiple objects in one image, the detection process of CNN is more like a Logistic Regression, the performance of that is not powerful which results as low accuracy, thus, the other robust architecture models will be considered for object detection.
Region-based Convolutional Neural Network (R-CNN) is an modified version of CNN regarding object detection. It introduces the boundary proposal to corresponds to Region of Interest (RoI), which is brought up with methodologies to define the boundary of all the goal objects in the image to help with object identification. Standard R-CNN requires a forward pass of the CNN (i.e. AlexNet) for every single region proposal for every single image. Three classifiers are proposed for object identification: CNN is used to generate the image features, Support Vector Machine (SVM) is used to classify the labels and Regressor is used to tighten the boundary of the image object. Fast R-CNN is an improved version of R-CNN, where the region proposals are done faster through sharing computation. It computes a convolutional feature map for the entire input image and then classifies each object proposal using a feature vector extracted from the shared feature map. Features extracted in the proposal region are pooled together and concatenated as in spatial pyramid pooling.
However, it still has some drawbacks:
An example for R-CNN procedure overview
Therefore, when an image is input to R-CNN model, it will compute and find potential objects of the image and generate around 2000 images which contain different region proposals, as for Fast R-CNN, it will compute all the 2000 region proposals in one single image, this difference Fast R-CNN from R-CNN reduces the computation time.
R-CNN and Fast R-CNN use a external region proposal module as a pre-processing step before running the CNN. The proposal algorithms are typically techniques such as EdgeBoxes or Selective Search, which are independent of the CNN layers. In Fast R-CNN, the use of these searching techniques becomes the processing bottleneck compared to running the CNN.
The major improvement of Faster R-CNN is the use of the Region Proposal Network (RPN), a fully connected convolutional neural network, on top of CNN. With region proposal network, the search of ROI is based on CNN feature map, which is already calculated for classification.
The RPN takes CNN feature map as input and output potential bounding box with a score indicating the likelihood of presence of an object within the box are. Then, it send the information within the bounding box to Fast R-CNN network for ROI pooling and classification.
Results of Faster R-CNN. Source: https://arxiv.org/pdf/1506.01497.pdf
Mask R-CNN is an extension of faster R-CNN including pixel level segmentation. In addition to faster R-CNN network, mask R-CNN adds an additional fully convolutional network (FCN) in parallel with RPN on top of CNN feature map. FCN outputs a mask indicates whether a given pixel is a part of an object. When finding corresponding pixel location in the original image, bilinear interpolation is used for calculation.
Result of Mask R-CNN. Source: https://arxiv.org/pdf/1703.06870.pdf