SSD: Single Shot MultiBox Detector

This paper (SSD: Single Shot MultiBox Detector) presents an object detection method using a single deep neural network. The model, SSD takes one single shot to detect multiple objects within the image. The model discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. During prediction the network generates scores for each object category present in each default box. The network combines predictions from multiple feature maps with different resolutions, which helps it to conveniently handle objects of various sizes. SSD eliminates object proposal generation and subsequent pixel or feature resampling stages and covers all computation in a single network. This makes SSD easy to train and simple to integrate into systems that requires detection component. Experimental results on different datasets confirm that SSD has competitive accuracy to methods which make use of object proposal step. Compared to other single stage methods, SSD is faster, has much better accuracy, and provides a unified framework for both training and inference.

Concepts:

Object Detection: Object detection is a computer technology related to computer vision and image processing that deals with detecting instances of semantic objects of a certain class (such as humans, buildings, or cars) in digital images and videos.
Single Shot Detection: Single-shot models encapsulate both localization and detection tasks in a single forward sweep of the network, resulting in significantly faster detections while deployable on lighter hardware.
Multiscale Feature Maps: In object detection, feature maps from intermediate convolutional layers can be directly useful because they represent the original image at different scales. Therefore, a fixed-size filter operating on different feature maps will be able to detect objects of various sizes.
Priors: These are pre-computed boxes defined at specific positions on specific feature maps, with specific aspect ratios and scales.
Multibox: This is a technique that formulates predicting an object’s bounding box as a regression problem, wherein a detected object’s coordinates are regressed to its ground truth’s coordinates. In addition, for each predicted box, scores are generated for various object types.
Hard Negative Mining: This refers to explicitly choosing the most egregious false positives predicted by a model and forcing it to learn from these examples.
Non-Maximum Suppression (NMS): NMS is a means to remove redundant predictions by suppressing all but the one with the maximum score.

Fig. 1 SSD real time Implementation.

Fig. 2 SSD Network Architecture.

The SSD is a purely convolutional neural network (CNN) that is organized into three parts –

Base convolutions: derived from an existing image classification architecture (VGG 16) that will provide lower-level feature maps.
Auxiliary convolutions: added on top of the base network that will provide higher-level feature maps.
Prediction convolutions: to locate and identify objects in these feature maps.

The paper demonstrates two variants of the model called the SSD300 and the SSD512. The suffixes represent the size of the input image. Although the two networks differ slightly in the way they are constructed, they are in principle the same. The SSD512 is just a larger network and results in marginally better performance.

Fig.3 VGG 16 Network Architecture.

1. Results Summary of the SSD Research Paper:

Experiments in the paper are all based on VGG16, which is pre-trained on the ILSVRC CLS-LOC dataset.
In Base VGG16 network, fc6 and fc7 are converted to convolutional layers using subsampling of parameters, and changing pool 5 from 2 × 2 − s 2 to 3 × 3 − s 1, and used the `a trous algorithm to fill the ”holes”.
All the dropout layers and the fc8 layer removed.
Resulting model is fine-tuned using SGD with initial learning rate 10−3, 0.9 momentum, 0.0005 weight decay, and batch size 32.

A. PASCAL VOC2007 Dataset Test Results:

SSD network is tested on PASCAL VOC 2007 daataset (4952 images). These results are comparaed against two object detection algorithms i.e. Fast R-CNN and Faster R-CNN.
This dataset contains images with twenty different types of objects {‘aeroplane’, ‘bicycle’, ‘bird’, ‘boat’, ‘bottle’, ‘bus’, ‘car’, ‘cat’, ‘chair’, ‘cow’, ‘diningtable’, ‘dog’, ‘horse’, ‘motorbike’, ‘person’, ‘pottedplant’, ‘sheep’, ‘sofa’, ‘train’, ‘tvmonitor’}

SSD model with input image size 300x300 (SSD300) is more accurate than Fast R-CNN.

When SSD is trained on a larger 512×512 input image (SSD512), it is even more accurate, surpassing Faster R-CNN by 1.7% mAP.
SSD model trained with more data (VOC 07+12), SSD300 is performed better than Faster R-CNN by 1.1% and that SSD512 is 3.6% better.
SSD Model trained on COCO trainval35k and ﬁne-tuned on the VOC 07+12 dataset achieved the best results (highest score): 81.6% mAP. Both Fast and Faster R-CNN use input images whose minimum dimension is 600.The two SSD models have exactly the same settings except that they have different input sizes (300×300 vs. 512×512).
It is obvious that larger input size leads to better results, and more data always helps. Data: ”07”: VOC2007 trainval, ”07+12”: union of VOC2007 and VOC2012 trainval. ”07+12+COCO”: ﬁrst train on COCO trainval35k then ﬁne-tune on 07+12.

Fig.4 PASCAL VOC2007 test detection results.

B. PASCAL VOC2012 Dataset Test Results:

SSD300 improves accuracy over Fast/Faster RCNN. By increasing the training and testing image size to 512×512, we are 4.5% more accurate than Faster R-CNN.
Compared to YOLO, SSD is significantly more accurate, likely due to the use of convolutional default boxes from multiple feature maps and our matching strategy during training.
Fast and Faster R-CNN use images with minimum dimension 600, while the image size for YOLO is 448 × 448. data: ”07++12”: union of VOC2007 trainval and test and VOC2012 trainval.07++12+COCO”: first train on COCO trainval35k then fine-tune on 07++12.

Fig.5 PASCAL VOC20012 test detection results.

C. COCO test-dev2015 detection results:

SSD300 is better than Fast R-CNN in both mAP@0.5 and mAP [0.5:0.95]. SSD300 has a similar mAP@0.75 as ION and Faster R-CNN, but is worse in mAP.
It is also observed that SSD512 is 5.3% better in mAP@0.75, but is only 1.2% better in mAP@0.5. t it has much better AP (4.8%) and AR (4.6%) for large objects, but has relatively less improvement in AP (1.3%) and AR (2.0%) for small objects.

Fig.6 COCO test-dev2015 detection results.

D. Data Augmentation for improving Small Object Accuracy Results:

Without a follow-up feature resampling step as in Faster R-CNN, the classification task for small objects is relatively hard for SSD.
The data augmentation strategy helps to improve the performance dramatically, especially on small datasets such as PASCAL VOC.
Zoom in and zoom out operation trick is used that creates more training examples. There is a consistent increase of 2%-3% mAP across multiple datasets, as shown in Table below. In specific, it shows that the new augmentation trick significantly improves the performance on small objects
There is a consistent increase of 2%-3% mAP across multiple datasets, as shown in Table below. In specific, it shows that the new augmentation trick significantly improves the performance on small objects.

Fig.7 Results on multiple datasets when image expansion data augmentation trick is used.

2. Procedure taken to Reproduce the Results:

Following steps followed to build, train, and test the SSD model on different datasets.

Step1: Literature review of other object detection methods like R-CNN, Faster R-CNN and YOLO(You Only Look Once).
Step2: The SSD model implementation using deep learning framework.
- Software Framework used: PyTorch
  - Following three stages of the SSD model implemented.
  - Base Convolutions (VGG16): SSD uses VGG 16 as a baseline network.
  - Auxiliary Convolutions: Four convolutional blocks, each with two layers, added after baseline VGG 16 network. These convolutions provide additional feature maps, each progressively smaller than the last.
  - Prediction convolutions: Two sets of convolutional layers used for each feature map for a localization and a class prediction.
- Open Source Codes Referred.
  - SSD: Single Shot MultiBox Object Detector, in PyTorch
  - Tutorial to Object Detection.
- Note: We have referred open source codes for SSD model implementation, fixed few issues in those implementations and trained the model on trainVal dataset.
Step3: Model Training:
- For training of the SSD model two datasets PASCAL VOC2007 trainval and PASCAL VOC2012 trainval used.
- Authors have used SGD optimizer with batch size 32, learning rate of 1e−3, momentum of 0.9, and 5e-4 weight decay.
- We have used a batch size of 8 and trianed the model for 200 Epochs to reduce computation cost and for increased stability of gradients.
- After the setup model is trained on Google Colab platform (provides free GPU access for 12 hours).
Step4: Testing and evaluation:
- Tested the trained SSD model on two different datasets. (PASCAL VOC2007 Test, PASCAL VOC2012 Test).
- Parsed predictions are evaluated against the ground truth objects. The evaluation metric is the Mean Average Precision (mAP).

3. (a) Results:

Setup for SSD for training and testing:
- Batch Size: 8
- Optimizer: SGD
- Learning Rate: 1e-3
- Momentum: 0.9
- Weight decay: 5e-4

Test results on VOC2007 test dataset: 77.1 mAP, against the 77.2 mAP reported in the paper.

2. Test results on VOC2012 test dataset: 75.1 mAP, against the 75.8 mAP reported in the paper.

3. (b) Analysis:

For training the SSD model, we used batch size of 8 to reduce computation cost and for increased stability [8] and trained the model for 200 epochs to reduce computation cost. However, authors have used batch size of 32 and trained the model for 100k iterations.
After training the SSD model on VOC07/12 trainval dataset, and testing it on PASCAL VOC07/12 test datasets, we found that our results almost (mAP) match with the authors results in the paper.

4. (a) Discussion of relevant papers:

Object detection is the combined task of object localization (providing bounding boxes) and classification (labelling), in an Image. Typical Convolutional Neural Network (CNN) architecture, VGG16 [1] is the most representative model for deep learning.
VGG16 [1] architecture constitutes of multiple layers (13 conv. layers, 3 fully connected layers with softmax activation for classification, 3 max. pooling layers and 2 stride max pooling layers; where each layer is termed as a feature map. The input is an image constituting a three-dimensional matrix of pixel intensities of RGB. Filtering and Pooling are the transformation operations performed on the feature maps. Every neuron of a particular layer is connected to a fraction of neurons of the previous layer termed as the receptive filed. Filtering is the process of convolution of filter coefficients with the receptive field and its output passes through a nonlinear function such as ReLU/Sigmoid to obtain its overall response. Pooling type may be max, average or of L2. Pooling with local contrast normalization gives a single value which represents the robust feature description. The overall architecture intertwines filtering and pooling, which can be fine-tuned using fully connected layers by supervised learning methods. Based on different tasks, activation function at the output layer, categorizes the pixel. The network is optimized on an objective function (mean-squared error, cross -entropy loss) suing Stochastic Gradient Descent method. This architecture which combinedly optimizes the problem of classification and regression, provides a good expressive capability and allows us to solve computer vision problems using a different perspective.
Object detection falls into two broad categories- region based and classification/regression methods. A thorough study of the same is presented in [1]. Region proposal-based methods mimics the technique adopted by human brain which takes a glimpse of the overall picture and then concentrates on the region of interest. There is a plethora of these methods, i.e. Region with CNN feature (R-CNN) [2], Spatial Pyramid Pooling Network (SPP-Net) [3], Fast R-CNN [4] and Faster R-CNN [5].
R-CNN [2] does a selective search to obtains region proposals, generates feature map using CNN and classifies the region using a linear SVM. Its advantages are increased accuracy due to high capacity CNN, and reduction in the search space owing to supervised training for classification and subsequently fine tuning for detection task. Its main disadvantage is content losses, due to fixed size input (fully connected layers), when the objects occurs at different scales.
This is overcome by spatial pyramid matching architecture in SPP-Net [3], which uses quantization of a fine to coarse region, at different scales. Thus, leading to an increase in speed and detection efficiency at testing period. The limitation of this method is its accuracy as it has fixed convolutional network layer.
Fast R-CNN [4] shows an improvement in the speed (training and testing) and accuracy, owing to the aspect of no feature caching and all the network layers being updated during training (single-stage) respectively.
Faster R-CNN [5] architecture includes a region proposal network prior to that in Fast R-CNN. This unified frame work achieves good accuracy for near real-time frame rates.
Though classification and regression forms integral part of the region-based method, there is a single step framework based on global regression/classification which maps image pixels to bounding box coordinates and class probabilities. These fall under the umbrella of regression/classification methods for object detection. YOLO [6] and Single Shot MultiBox Detector (SSD) [7] are two classic methods under this category.
You Only Look Once (YOLO) [6] initially resizes the image which is fed as input to a single convolutional network. It detects objects based on confidence scores specific to a class, both at one go. This leads to an extremely fast detection but cannot handle images of unusual aspect ratio (due to several down sampling layers) as well as small objects appearing in groups (due to space constraint on bounding boxes).
This speed-accuracy trade-off is circumvented by a method termed as SSD [7]. This is accomplished by adding several feature layers at the end, VGG16 forming the backbone network, the additional layers can conveniently make predictions and determine the associated probabilities on the bounding box offset, thus tackling the problem arising out of varying scales and aspect ratios.
A set of grids is created on an input image. A couple of rectangles of different aspect ratios are made around those grids. Filtering operation is carried out in the region described by the bounding box in order find the presence of an object. The size of the image gets reduced drastically at the layers present at the end. The operations of convolution and detection is carried out at every step, such that even the smaller objects get easily identified. The SSD algorithm also knows how to go back from one convolution operation to another. It not only learns to go forward but backwards too.
Thus, providing increased speed and accuracy more than YOLO, even when the input image is of smaller size. Some of the limitations of SSD are to improve accuracy, data augmentation is very much essential, a greater number of default boxes are needed, and atrous convolution is needed at the last two layers in the VGG16 used therein, without which the model tends to be slower. Application of SSD, the state-of- the art technique for object detection is currently being studied for object tracking also.

Fig.9 Different methods with mAP.

4. (b) References:

5. Conclusion:

Paper Conclusion -
- This paper introduces SSD, a fast single-shot object detector for multiple categories.
- A key feature of SSD model is the use of multi-scale convolutional bounding box outputs attached to multiple feature maps at the top of the network. This representation allowed authors to efficiently model the space of possible box shapes. It was experimentally validated that given appropriate training strategies, a larger number of carefully chosen default bounding boxes results in improved performance.
- SSD models were built with at least an order of magnitude more box predictions sampling location, scale, and aspect ratio, than existing methods. It was demonstrated that given the same VGG-16 base architecture, SSD compares favorably to its state-of-the-art object detector counterparts in terms of both accuracy and speed. SSD512 model significantly outperforms the state-of-the art Faster R-CNN in terms of accuracy on PASCAL VOC and COCO, while being 3× faster.
- The real time SSD300 model runs at 59 FPS, which is faster than the current real time YOLO alternative, while producing markedly superior detection accuracy. Apart from its standalone utility, the monolithic and relatively simple SSD model provides a useful building block for larger systems that employ an object detection component.
Our Argument in support of the paper conclusion-
- Our results on the VOC 2007 and VOC 2012 datasets corroborate the conclusions in the paper.
- The mean average precision calculated on both datasets are almost matching to the values calculated in the paper.
- Due to lack of computation resources (ARC) and given time limit we were unable to test the model on COCO dataset.
- Our conclusion is that under same set of hyper parameters, our implemented SSD model will give similar set of results as observed by authors, for other datasets (COCO dataset).

Members:

Akash Patil
Vanshaj Khattar
Varsha Seshasayee
Hitesh Baadkar

Google Sites

Report abuse