For this project, I wrote an application to perform computer vision tasks including localization and tracking of an object. As mentioned earlier, this involves three basic steps: learning, localization, and tracking.
Learning is performed before tracking begins. Generally, a set of positive training images (eg. people) and a set of negative training images (eg. background scenery) are fed into a machine learning system. The machine learning system builds a model of the training images. It is then able to analyze images from video and classify them. For example, a machine learning system would be trained with images of pedestrians. Then, images from a video containing pedestrians would be fed into the machine learning system. The machine learning system would then determine which of the images were pedestrians.
This project simplifies the learning process by building a model of only a single image. A target feature is selected from the video input and the application generates a model which is used to locate and track the target throughout the video.
Localization and detection were the focus of this project. Localization, determining the location of a target feature in the video, involves two steps.
Searching. My implementation uses a sliding window technique to search video images. Below is a diagram that depicts the sliding window algorithm in image processing:
The window, or search location, is swept across the image. The image data in each window location is compared to the target feature. Whichever window matches most closely is determined to be the location of the target feature.
Image Credit: Miriam Leeser and Haiqian Yu
Classification/Detection. When performing the sliding window search, the application must compare the image data in the window to the target feature. As the appearance of the target feature in different video frames may not be exactly the same, a direct comparison is not useful. Rather, the application uses an image descriptor called a Histogram of Oriented Gradients (HOG). To compare the HOG descriptor of each window to the HOG descriptor of the target feature (the model discussed in the previous section), the application finds their Bhattacharyya coefficient. More information on the HOG descriptor is presented in the next section.
Tracking is the process of predicting and determining the movement of a target feature through the video. This not only eliminates the need to run the computationally-costly sliding window searches each frame, but also allows a computer vision system to distinguish between multiple targets which match the same model. This project uses a simple localized search to perform tracking.
The HOG descriptor generator divides the image into a grid of sub-regions to better describe its features. Histograms are computed for each of the sub-regions and then concatenated to form the final histogram used for comparison.
To generate a histogram, the application iterates through each pixel of all three color channels in the input image. It ignores edge pixels because their gradients cannot be calculated and they are unlikely to contain important data. The gradients of each pixel are used to calculate the orientation and magnitude of each pixel. The gradients of a pixel are expressed as the difference between the intensities of the pixels adjacent to that pixel.
Figure 1 illustrates the pixels used for gradient calculation. The horizontal gradient of the current pixel is expressed as the difference between x2 and x1 and the vertical gradient of the current pixel is expressed as the difference between y2 and y1. The orientation of each pixel is calculated using equation 1.
In equation 1, gx is the vertical gradient of the pixel and gy is the horizontal gradient of the pixel. The orientation of the pixel is used to select the orientation bin into which the magnitude of the pixel will be added. The pixel’s magnitude is expressed by equation 2.
The pixel’s magnitude is added to the bin containing the orientation of that pixel. The generator computes the orientations and magnitudes for each pixel in the image and then places them into the histogram of the sub-region that encompasses their location.
The Bhattacharyya coefficient is crucial to the algorithm as it provides an effective means to compare two histogram-based descriptors. The Bhattacharyya coefficient is expressed as equation 3.
In equation 3, a and b are the histograms under comparison and i is the bin index.
The Bhattacharyya coefficient is a measure of the similarity between two discrete sets of data, which are in this case histograms. The HOG descriptors contain three separate histograms for the three color channels of the image feature they represent. The Bhattacharyya coefficient of all three color channels is given as the mean of the Bhattacharyya coefficient of each channel.
The Bhattacharyya coefficient is used during localization. The Bhattacharyya coefficient of the image data in each window location is recorded and the window with the highest coefficient is considered to be the target.