Object Detection

2. Post-processing: Non-Maximal Suppression

3.2 Quantizing weights and/or activations

1. Object detection models

There are very many different approaches using convolutional neural networks for object detection, but the following two families of models are dominant:

1.1 SSD-MobileNet

SSD fixes a set of default bounding boxes over different aspect ratios and scales at each feature map location. Predictions are with respect to these boxes.
Makes use of depth-wise separable convolutions to reduce the model footprint.

1.2 YOLO

Uses a single Neural Network to predict bounding boxes and class probabilities directly from full images in one evaluation.
Divides the image into a course grid and directly predicts the class labels and candidate boxes for each grid.
For our experiments, we use an incremental version of YOLO, ie Yolov3, which makes use of anchor boxes and predicts bounding boxes at 3 different scales. It is a faster version of the original YOLO.

2. Post-processing: Non-Maximal Suppression

After getting bounding boxes, we often see that the same object seems to get recognized in multiple bounding boxes which are very similar in size and are just shifted by small amounts. In such cases we need to define a method to select one of the boxes and reject the others.

Non-maximal suppression does this by finding, for a given bounding box, all other boxes which have substantial overlap with it (have IoU over a threshold), and takes the one box among this set (plus the original box) which has the highest confidence, while discarding the rest. This substantially improves the quality of the output, though at some computational cost.

3. Tricks to improve runtime

3.1 Depthwise Separable convolution

Depthwise seperable convolution is a form of factorized convolution that factorizes the convolutions in 2 steps.

Depth-wise convolutions: That performs light weight filtering by applying a single kernel per input channel, significantly reducing the number of multiplications required. This factorization also drastically reduces the model size.
Point-wise convolutions: Expands the feature map along channels by linear combination of the input channels. It uses a 1 x 1 kernel, which combines the output of the depth-wise convolutions

This process of factorizing the convolutions into 2 stages has significant improvements in terms of computational cost.

3.2 Quantizing weights and/or activations

Post training quantization is a common technique used to reduce the model size while also providing approximately 2 to 3 times lower latency with little degradation in accuracy of the model.

Quantization can be done in 2 ways

Weight only quantization: Reduce the precision of the weights from float to 8 bit int. Particularly useful if you wish to reduce the size of the model.
Weight and activation quantization: Quantize floating point model to 8-bit by calculating the quantization parameters for all the quantities to be quantized. Since activations need to be quantized, one needs calibration data and needs to calculate the dynamic ranges of activations to scale them appropriately.

Report abuse