In past decades, vehicle detection has attracted a lot of research interests. Common vehicle detection can be categorized into two classes: traditional ML methods and deep-learning-based methods. The traditional method refers to the traditional machine learning approach. N. Laopracha and K. Sunat [1] utilized the histogram of oriented gradient (HOG) method to extract features and apply the support vector machine (SVM) to achieve vehicle detection. However, the traditional method usually involves complex steps and cost too much time, which is not suitable for real-world applications. Thus, the current research direction turns to focus on deep learning. Y.Gao et al. [2] adopted the Faster R-CNN method and selected better scales in the detected region to optimize detection performance. R-CNN is a widely used model in the field for deep learning object detection, which uses a selective search to generate a region of interest and creates a deep learning object detection based on the regional proposal. However, detection accuracy and generation ability need further improvement. Hence, many state-of-art studies utilize the YOLO model to enhance detection performance. YOLO is proposed by Redmon et al. in 2016 [3], which is an end-to-end object detection model. J.Sang et al. [4] introduced the YOLOv2 model to complete vehicle detection. J.Lu et al. [5] presented a YOLO v3 model to detect vehicles in aerial images, and it shows a good performance on small objects, rotating objects, as well as compact and dense objects.
Redmon et al. proposed the end-to-end object detection method YOLO (You Only Look Once) [3], which is characterized by combining the candidate box generation and classification regression into a single step. YOLO divides the image into S × S grid and predicts B bounding box and C class probability for each grid cell, if the center of a target falls into a grid, the grid is used as object detection.
The following figure presents the flowchart of the YOLO object detection and bounding box. The bounding box consists of five predictions: w, h, x, y, and object confidence. The values of w and h represent the width and height of the box, the x and y represent the center coordinates of the box. The object confidence indicates the probability of containing objects in this prediction box, which is actually the IoU value between the prediction box and the real box, and the class probability represents the class probability of the object.
Fig1. Flowchart of YOLO detection [3]
Fig2. Bounding box [6]
In 2018, a new version of the YOLOv3 has been developed by Redmon [6]. 53 convolutional layers called Darknet-53 shown in figure 3, which is mainly composed of convolutional and Residual structures. In [6], the author concluded that Darknet-53 has high accuracy, fewer floating-point operations, and the fastest calculation speed. Therefore, the YOLOv3 algorithm can be a practical tool in our project vehicle detection.
Fig 3. Darknet architecture [3]
Tracking with only spatial data association is considered as the baseline of tracking-by-detection approach, where the input of the tracker is the output of the detector. In IOU tracker proposed by E.Bochinski et al. [7], detection results from consequent frames whose intersection-over-union (IOU) is larger than a threshold are associated as a track using a greedy algorithm. In [8] developed by A. Bewley et al., Kalman Filter is used to estimating the location of the tracked object from the last frame. The Kalman Filter is an algorithm that is able to use measurements from detections and previous states of tracks that contain uncertainty to estimate the current states. The new detection results are assigned to the estimated tracks using the Hungarian algorithm [9]. This tracker can achieve real-time speed and is called the Simple Online Realtime Tracker (SORT).
To reduce the effect of unreliable detections, adding appearance information in data association can significantly increase the robustness of the tracker. N.Wojke et al. proposed the Simple Online and Realtime Tracking with a Deep association metric (Deep SORT) [10], which is an extension of SORT, wherein the appearance of new detections is compared with that of previously tracked objects in each track to help that data association problem. By integrating appearance information, this model is able to track objects through longer periods of occlusions and effectively reduce the number of identity switches.
Deep SORT (Simple Real-time Tracker) is one of the most widely used object tracking frameworks [10]. The bounding box variables of Deep SORT extend to 8 variables (u, v, a, h, u’, v’, a’, h’), u and v are centers of the box, a is the aspect ratio and h is the height of the image. The other variables are the respective velocities of the variables. Kalman filter will factor in the noise in detection and uses a prior state in predicting a good fit for bounding boxes.
To associate new detections with new predictions, a distance metric (squared Mahalanobis distance) is used to qualify the association and Hungarian algorithm is applied to associate data. Due to the extension presented by [10], the SORT algorithm is able to track through longer periods of occlusion, making it a strong competitor to current other tracking algorithms. The following figure presents the framework of the Deep SORT algorithm.
Fig 4. Framework of Deep SORT [10]
[1] N. Laopracha and K. Sunat, “Comparative study of computational time that HOG-based features used for vehicle detection,” Adv. Intell. Syst. Comput., vol. 566, no. July 2018, pp. 275–284, 2018, doi: 10.1007/978-3-319-60663-7_26.
[2] Y. Gao et al., “Scale optimization for full-image-CNN vehicle detection,” arXiv, no. Iv, 2018.
[3] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2016-December, pp. 779–788, 2016, doi: 10.1109/CVPR.2016.91.
[4] J. Sang et al., “An improved YOLOv2 for vehicle detection,” Sensors (Switzerland), vol. 18, no. 12, 2018, doi: 10.3390/s18124272.
[5] J. Lu et al., “A Vehicle Detection Method for Aerial Image Based on YOLO,” J. Comput. Commun., vol. 06, no. 11, pp. 98–107, 2018, doi: 10.4236/jcc.2018.611009.
[6] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” arXiv. 2018.
[7] E. Bochinski, V. Eiselein, and T. Sikora. High-speed tracking-by-detection without using image information. In International Workshop on Traffic and Street Surveillance for Safety and Security at IEEE AVSS 2017, Lecce, Italy, Aug. 2017.
[8] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), pages 3464–3468, 2016.
[9] H. W. Kuhn. The hungarian method for the assignment problem. In 50 Years of Integer Programming, 2010.
[10] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” 2018, doi: 10.1109/ICIP.2017.8296962.