Final results

Evaluation of YOLO model

To validate the implemented network architecture, we used the pre-trained weights, yolo-small.weights, which was trained on VOC 2007 train/val datasets. We restored the weights from yolo-small.weights and trained the model on VOC 2007 and VOC 2012 train/val for further 100 epochs (~ 3 days) on AWS cloud with NVIDIA K80 GPU configured with CUDA 9.0 and cuDNN 7.0.5 environment. We employ the gradient descent optimization method offered by TensorFlow with batch size 64. We evaluated our model on VOC 2007 test dataset, and achieve 52% mAP, the average precision for each class is shown on Table 1. We also compare the performance of the state-of-the-art object detection models with our YOLO-small model (Table 2). We further examined our model on few example images, the result indicated that our implemented model displays the similar inference result in comparison with the author’s work (Figure 1). However, as shown in the left panel of Figure 1, our YOLO model could not detect all the objects. The reason is because we used the YOLO-small model, which has smaller scope of weights and would perform well only in a smaller graphics.

Table 1. Average precision on 20 classes

Table 2. Performance of the state-of-the-art object detection models

Figure 1. Sample images detected on YOLO. Each bounding box is indicated the class the corresponding confidence.

Parameter Experiment

Examine the testing parameters

To refine our model and achieve better performance, we measured the YOLO detection performance on a various range of testing parameters, including confidence threshold, iou threshold, and iou-overlap threshold, to come up a general strategy for parameter design in YOLO approach.

Confidence threshold

The confidence threshold decides how much the confidence score the model should agree the occurrence of an object. If the confidence score of a bounding box is greater than the confidence threshold, the model will agree the existing of an object in the given bounding box. The effects of confidence threshold is shown on Figure 2. We found the ideal range is between 0.1 and 0.2.

Figure 2. The effects of confidence threshold

Overlapping threshold

The iou-overlap threshold decides how much the predicted bounding boxes can overlap. If the overlapping ratio is greater than the iou-overlap threshold, then the model will draw both the predicted bounding box, otherwise, the bouonding box with lower confidence threshold will be omitted. The overlapping ratio is computed as follows:

The effects of iou-overlap threshold is shown on Figure 3. We found the best range for this value is between 0.5 to 0.7.

Figure 3. The effects of iou-overlapping threshold

IOU Threshold

The iou threshold is used in the testing phase when we know the inofrmation of ground truth. It decides how much IOU between predicted bounding box and ground truth bounding box the model should agree the position of the predicted bounding box around the given object. We evaluated a range of those parameters on the performance of mAP, the the result is shown in Figure 4.

Figure 4. The varience of iou threshold on mAP performance.

Examine the generalizability

To further examine the ability of our implemented YOLO-small model on all possible use cases, we test the model on several artwork data, and results indicated the good performance on generalizability in Figure 5.

Figure 5. YOLO running on artwork

Future Work

Improve classification capability by training with more data

As indicated in original paper [1], the YOLO model can achieve 63.4% mAP. Therefore, we will continue refine our YOLO model by further training with VOC 2012 train/val datasets. We will also connect YOLO to a webcam and measure its real-time performance, including the time to fetch images from the camera and display the detections. In addition, we will try to add more classes on our training data for enable YOLO to detect more object types.

Using YOLO as a pre-process indicator

The supreme performance of YOLO could be used as a pre-process indicator for combining with other algorithms, like R-CNN. Although it might need more time to classify each frame, the precision might be improved by the combining

References

[1] J. Redmon, S. Divvala, R. Girshick, A.Farhadi. You only look once: unified, real-time object detection. arXiv preprint arXiv: 1506.02640v5, 2016.

[2] The PASCAL Visual Object Classes Challenge 2007 http://host.robots.ox.ac.uk/pascal/VOC/voc2007/

[3] The PASCAL Visual Object Classes Challenge 2012 http://host.robots.ox.ac.uk/pascal/VOC/voc2012/

[4] https://github.com/pjreddie/darknet/wiki/YOLO:-Real-Time-Object-Detection