Multi Object Tracking

Overview

Multi Object Tracking is an incredibly popular area of computer vision research.

Given a discrete number of objects, can you track them and their trajectory through the range of the video. It is also one of the most complex tasks in Computer Vision given the number of variables you need to account for.

"Visual object tracking is among the hardest problems in computer vision, as trackers have to deal with many challenging circumstances such as illumination changes, fast motion, occlusion, among others." (arXiv:2009.04787)

dATA

For this task, we used the Multi Object Tracking data provided by Visdrone. The dataset included 96 sequences, 56 video sequences for training (24,201 frames in total), 7 sequences for validation (2,819 frames in total) and 33 sequences for testing (12,968 frames in total). (1)

Here is an example of a video sequence captured by the drone that is in their test dataset.

1: http://aiskyeye.com/challenge_2021/multi-object-tracking-2/

uav0000201_00000_v.mp4

Research Papers

There are many research papers that dealt with the Visdrone data given that it was a very public challenge. Here is a paper that summarized results of different methods in a concise way:

https://openaccess.thecvf.com/content/ICCV2021W/VisDrone/papers/Chen_VisDrone-MOT2021_The_Vision_Meets_Drone_Multiple_Object_Tracking_Challenge_Results_ICCVW_2021_paper.pdf

One of the most successful appraoches to VisDrone was created by researchers in China and uitilized YOLO

Here is the paper:

Zhang2021_Chapter_Pruned-YOLOLearningEfficientOb.pdf

STATE OF THE ART

Hungarian Algorithm

The current state of the art incorporates two major paradigms, object detection, and object tracking. In each frame of the video, you have to OpenSource software like OpenCV have prebuilt algorithms to help accomplish this task because of how difficult it is. We use bounding boxes to find a few pieces of information, including how close the appearance is, how close the centers are on consecutive frames, and the size of the boxes. There are several methods that are considered state of the art right now -- One method is the Hungarian Algorithm:

Researchers typically use the Hungarian algorithm to combine tracking boxes and detections to find the optimal solution. Below is a typical neural network architecture for how we can approach and solve the problem.

We did not end up using this approach, however it demonstrates the wide variety of ways people are attempting to solve this problem.

YOLO! (You Only Look Once)

YOLO is based off of Darknet, a set of Open Source Neural Networks written in C. Created by Professor's at the University of Washington, YOLO applies only one neural network to the entire image, which divides the image into regions. From here, it predicts bounding boxes and probabilities for each region, which are then weighted by the prediction probabilites. This approach is "1000x" faster than other popular methods of object detections, like a Regional Based Convolution Neural Network (R-CNN).

The architecture of the feature detector was based on other famous architectures like ResNet and FPN (Feature Pyramid Network) which was developed by Facebook.

Here is a paper describing YOLO more in context and detail.

YOLOv3.pdf

Here is an visual representation of Yolov3:

Thanks to: https://dev.to/afrozchakure/all-you-need-to-know-about-yolo-v3-you-only-look-once-e4m for this image

Results

Training:

The YOLO model I used followed the paper and was trained on the Common Objects in Context (COCO) dataset. We utilized a pretrained model.

This dataset is created by Microsoft (and others) and is one of the standard datasets for creating machine learning models due to its challenging yet high quality images and labels. Many libraries already have models trained on this data or are preconfigured to understand the data provided. The researchers who created it came out published their methodology and results in this paper here:

1405.0312.pdf

Results:

Here is a sample video corresponding to the one above that highlights our results.

Here is a google drive folder highlighting our results (must be logged into UW Madison G-Suite to view them):

https://drive.google.com/drive/folders/1EzoL5VFJ-d3HAO75RSe31v352ADWbTCS?usp=sharing

As you can tell if you go through the videos, it does a fairly good job of predicting objects and tracking them through the space, however we still needed an objective, numerical value to judge our results.

Metric:

Methodology:

For each pair of images i_truth, i_predicted:
- Find predicted bounding boxes in i_predicted
- Find the closest match in i_truth:
  - Define closest match to be: largest area of intersection between box_truth and box_predicted.
Because boxes are regions of pixels in the images, this intersection direclty corresponds to overlapping regions in the original image space, and therefore is a good indicator of correctness
You can see that even if cars pass under the underpass and we lose sight of them, we can pick up their trajectories right after!

uav0000297_02761_v.mp4

ORIGINAL VIDEO

uav0000297_02761_v.mp4

PREDICTIONS

Final Results

As you can see, our model did a fairly good job. The average amount of overlapping area between ground truth boxes and their actual was 67%, and the model also did a good job of predicting the correct number of objects in most cases, however some spectacularly small numbers skew the results lower.

Most of the resources I encountered said that anything over 50% is quite good.

Problems Encountered

The first problem I encountered on this project was how to start -- i was initially intimidated by the breadth of literature on the topic out there, and the fact that nearly every one of them began with a description of how hard this task is to generally accomplish. However I am grateful for many online resources in helping me get started (most of them linked above)

The second problem I encountered was with my first iteration of the project. i initially attempted to use OpenCV's toolbox for multi object tracking. OpenCV is a set of open source tools meant to help researcher's and the general public accompish common Computer Vision tasks through a set of standard API's. However my attempts at doing so yielded very poor results and even poorer throughput. The videos were too slow to make any sense of the output, and therefore I had to pivot my approach. This can be seen on the branch multiObjectTrackRoshan.

That led me to YOLOV3, which is a complicated topic, and i spent several days trying to understand the literature and how to approach the proejct. However in doing so, I believe I arrived at an approach that very closely emulates the research papers above!

Code

GitHub - roshanverma2001/CS639-Final-Project at roshanMOTYoloContribute to roshanverma2001/CS639-Final-Project development by creating an account on GitHub.

Page updated

Report abuse