Approaches

Convolutional Neural Network

Convolutional Neural Network (CNN) is similar to a multilayer perceptron which has been designed for reduced processing requirements. The layers of a CNN consist of an input layer, an output layer and a hidden layer that includes multiple convolutional layers, pooling layers, fully connected layers and normalization layers.

As we can see from the image, after a series of operations, CNN is able to produce a summarised result which tells users what the input is (classification). Nevertheless, when it comes to object detection, which includes multiple objects which users want to identify, CNN is insufficient to support since it cannot tell where objects are.

A naive approach to solve this problem would be to take different regions of interest from the image, and use CNN to classify the object within that region. The problem is that the objects might have different spatial locations within the image and different aspect ratios. Thus, it induces a huge number of regions scanning and this could lead to computationally blow up.

Region with CNN [1]

Region with CNN (R-CNN) was the first approach proposed to improve the naive CNN method. It first adopts selective search to extract 2000 region proposals from the input, which are potential object to be classified by CNN.

More efficient as it is compared to CNN's unbounded box number production, it still takes a long time to train the network as it has to process 2000 region proposals per input, let alone make processing in real time. What's worse, the selective search algorithm does not learn since it is a fixed algorithm, which could cause the generation of bad candidate region proposals.

Fast R-CNN [2]

After realising the critical defect of R-CNN (i.e., computationally heavy), the author of R-CNN proposes a improved approach -- Fast R-CNN. It is similar to R-CNN. However, instead of generating then inputting the region proposals to CNN, it directly feed the input image to the CNN to generate a convolutional feature map, where Fast R-CNN identifies the region proposals are, and input them to next layer by using RoI pooling layer. From RoI feature vector, it predicts the class and also the location of the bounding box.

Even if Fast R-CNN is faster with only doing convolution operation once per input and generate region proposals later, it still cannot process in real time since the large number of region proposals is the bottleneck which affects its performance.

Faster R-CNN [3]

The fatal bottleneck for R-CNN and Fast R-CNN is the computational heaviness of selective search for region proposals generations. Therefore, Faster R-CNN is introduced to improve the efficiency of the workload by eliminating selective search. It applies another layer of network to detect region proposals, and reshaped them using a RoI pooling layer which is then used to classify the image in the proposed region and predict the location of bounding boxes.

The process is a lot faster and can almost be done in real time, but may not be ideal for self-driving scenario as it requires really fast processing.

You Only Look Once [4][5]

R-CNN, Fast R-CNN, and Faster R-CNN all have two steps processing an image -- classification (identify classes) and linear regression (determine boxes sizes and locations), and they rarely look at the entire input image. Different from the above two-step regional based approaches, You Only Look Once (YOLO) is proposed to predict bounding boxes and class probabilities for the boxes by a single convolutional network as the following figure.

YOLO first takes an input image and split it into an SxS grid. Then take m bounding boxes for each cell and the network predicts the result (location and size of the bounding box and the classes probabilities) for each bounding box. Next, choose the class with maximum probability for each cell to generate confidence by Pr(class) * IOU(ground truth), where IOU is the intersection of bounding box and ground truth. In the end, apply non-maximum suppression to remove overlapped or low confidence boxes and output the final predicted result.

YOLO is faster than other object detection algorithms. One limitation of YOLO is that it struggles with small objects within the image (e.g., it might have two cars in the same bounding box). However, this is negligible in our case as we tend to know whether there is an obstacle in the image; namely, as long as it can identify obstacle and help self-driving car avoid from it, it is acceptable to tolerate this limitation.