Methodology

Figure 1. YOLO object detection workflow

Detection Algorithm

As shown in Figure 1., YOLO divides the input image into a SxS grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting the object. Each grid predicts B bounding boxes and the corresponding confidence scores. The confidence score is defined as Pr(Object) IOU, where IOU = intersection area / union area of predicted and ground truth boxes. Each bounding box consist of 5 predictions: x, y, w, h, and confidence. The (x, y)represent the center of the box relative to the bounds of the grid cell. The (w, h) indicate the width and height relative to the whole image. The confidence prediction represents the IOU between the predicted box and any ground truth box. In addition, each grid cell predicts C conditional class probabilities , Pr(Classi | Object). We only predict one set of class probabilities per grid cell. At test time, we multiply the conditional class probabilities and the individual box confidence predictions,

Pr(Classi | Object)Pr(Object) IOU=Pr(Classi ) IOU

which provides the information of class-specific confidence scores for each box. In our implementation, we use S=7, B=2. We trained and evaluated our model on PASCAL VOC 2007/2012 datasets [2][3], which has 20 labelled classes, so C=20.

Loss Function

YOLO’s loss function must simultaneously solve the object detection and classification tasks. Thus, the function penalizes incorrect object detections as well as measures the most likely classification. The function is as follows:

Note that 1^iobj equals one if object appears in the grid cell i and 1_ij^obj equals one if the jth bounding box predictor in grid cell i is responsible for that prediciton. We also set Lamda_coord=5 and Lamda_noobj=0.5 to increase the loss from bounding box coordinate predictions that don’t contain objects.

Implementation of YOLO network architecture

YOLO directly straights from image pixels to bounding boxes and associated class probabilities with single neural network. The network has 24 convolutional layers followed by 3 fully connected layers. The input image is 448×448 for the requirement of fine-grained visual information. The final output of the network is the 7×7×30 tensor of predictions. We implemented the full network in python and TensorFlow. The architecture is shown in Figure 2, with one extra fully connected layers with dimension 512 for the YOLO-small model.

Figure 2. The YOLO Architecture.

Implementation Settings

  • Language & Library
    • Python 3.5
    • OpenCV 3.0 for image process
    • Tensorflow 1.6 for building CNN
  • Training Process
    • Load pre-trained weight with PASCAL VOC2007
    • Training with VOC 2007 + 2012 train/val for 100 epochs
    • ~3 days and ~USD$ 60
  • Training Setting
    • AWS EC2 P2 instance (1 GPU)
    • Gradient descent optimizer
    • Batch size of 64
    • Learning rate 10-4 applied with exponential decay
    • Dropout rate of 0.5
  • Goal
    • Similar mAP with the pretrained model