cu-BRIEF

Abhinit Modi: abhinitm@andrew.cmu.edu Computer Science Department

Luis Fernando Fraga Gonzalez: lfragago@andrew.cmu.edu Computer Science Department

GPU IMPLEMENTATION OF THE BRIEF FEATURE DESCRIPTOR

Final report

Problem Statement

Implement an efficient key point detection and description algorithm and use it in an application to track logos on moving automobiles in real time (fps: 30). Develop the entire pipeline from scratch on GPU, to provide custom optimizations and a key point tracker mostly independent from OpenCV.

Background

Keypoints are spatial locations, or points in the image that define what is interesting or what stand out in the image. No matter how the image changes... whether the image rotates, shrinks/expands, is translated or is subject to distortion we should be able to find the same keypoints. To match two objects in an image, keypoints are a common way. The pipeline has 3 major stages: Keypoint detection, key point description and descriptor matching. Here is an high level view of the algorithm we have adopted.

Detection

We use the difference of Gaussians method to locate keypoints in an image. The algorithm is as follows.

Apply a set of filters to the image. a.k.a Gaussian Pyramid.
Compute the difference of Gaussians. (a.k.a Difference of Gaussians DoG)
Scan the 3D pyramidal array for local minimums and maximums.
If a local min/max is detected calculate the Hessian and apply curvature test filtering.

Description

Create a patch around the keypoint and select a fixed set of pair of points which will be used across images for descriptor generation. We choose 256 points, size varies based on the nature of the descriptor.
For every pre-selected pair in the patch apply the test: Intensity(point-1) > Intensity(point-2).
Create a binary descriptor by applying the test to all pre-selected pairs.

Matching

Given two images calculate the hamming distance between all keypoint descriptors in both images.
Using knn matching get the 2 nearest points to a given keypoint in the reference image and apply false positive ratio test to detect if it is actually a match.

Challenges and Parallelism

Need to process in real time for video applications. Time per frame < 0.033 seconds.
Ensure there is minimal loss in quality of the video and accuracy is not compromised for efficiency.
High resolution images pose limitations due to the sheer number of pixels.
Work imbalance among pixels: Processing a keypoint pixel is much time consuming that processing a non keypoint pixel.
High potential for parallelism in individual components of the pipeline, but work overlap when parallelizing.

Test Dataset

Used four different benchmark images (having different number of keypoints) at 8 different scales: 32x32, 64x64, 128x128, 256x256, 512x512, 1024x1024, 2048x2048, 4096x4096.

Accuracy was verified by testing keypoint coordinates obtained for the data set from OpenCV experiments

Baseline

Virtual machine with 2 Intel core i5 2.3Ghz speed + AVX.
DoG: Generate DoG Pyramid in OpenCV
Serially scan the DoG Pyramid for keypoints
Apply less computationally expensive test first:
1. Threshold comparison
2. Min/Max Test
3. Hessian Test
Open CV already used many platform optimizations like SIMD, multi-core support. It also uses many compute optimizations to boost performance. This makes it a solid serial (although not completely) implementation to evaluate the performance of our system with.

Test Architecture

Memory Speed 10 Gbps
1.6 GHz
8 GB memory
2560 cores

Key point detection

After profiling the baseline version and a crude implantation we inferred that key point detection is the slowest and hence a bottleneck in the pipeline.

Premature Optimization

1 CUDA thread launched per pixel
Avoid using global memory at all.
Given a CUDA thread, calculate all filter responses for all filters in the neighborhood, keep the result on the stack.
Apply less computationally expensive test first to identify keypoints and avoid unnecessary further computations.

Pitfall: Too many computations performed were redundant. Value once computed and fetched was fetched, computed again instead of reusing.

Max Speed up: 1.6x.

An interesting observation is the drop in speed up for images greater than 512x512 pixels, which is attributed to the fact that they will not be fitting in the cache and will require memory reads and updates.

Naïve implementation

Launch a kernel to compute each filter using 2D convolution.
Store results in global memory.
Launch kernel to calculate Difference of Gaussians “in-place” + DoG detector.
This approach differs from the previous approach in a sense that it applies the convolution on the entire image and stores the pyramid in the global memory. The computations of the neighboring pixels are reused for every pixel.

Max Speed up: 14x

Leveraging locality

Optimization techniques like loop fusing and reuse of values across iterations (use old value in the next iteration without re-computing, were used.
Dynamic programming was employed to store results of sub problems so that there is no need for re-computation.
These techniques proved to be useful because the application was compute bound.

Max Speed up: 18x

Loop Unrolling

The convolution and some of the loops in the CUDA kernel were optimized using the loop unrolling technique, a factor of 2 was found to be most efficient.

Max Speed up: 21x

Adopting the convolution optimizations used in Halide.

Within the key point detection pipeline the computation of the Gaussian pyramid of filters was the most time consuming.
Linearly separable 2-D filters can be split into two 1-D linear filters. These can be applied in succession to the image to obtain the same result with lesser number of arithmetic operations.
We used the above technique to split our filters and also used chunking of images into tiles to fit the cache, so that temporal locality could be levaraged.

Max Speed up: 25x

Shared Memory

Within a CUDA block, the image pixels consumed by one pixel, for convolution, also get used by other pixels. In fact all the neighbors reused the pixel fetched by one pixel. This motivates the need for a shared storage, a fast look up, rather than fetching from memory.
The horizontal blurring phase was optimized by using shared memory. The first thread fetches the image pixels which will be reused by all the pixels in the same cuda thread block. It also pads with additional columns.
Also, the filters which will be repeatedly used by all threads and all blocks are a huge potential source of speed up. Moving them to shared memory resulted in performance boost.

Workload imbalance

When the workload distribution among CUDA threads was optimized, a huge gain was observed.
Fetching of data into shared memory was distributed among all threads equally instead of just one.

Max Speed up: 70x

Key point description

Implemented our own feature BRIEF descriptor, compared against MATLAB for correctness.
One cuda thread per key point detected.
Every key point has 256 comparisons to be done. Perfectly balanced load.
Optimizations obtained by storing the pre-decided reference comparison points on shared memory.
Fusing the kernel with the detector helps here because the overload for launching kernel is higher than some workload imbalance.
The compute is very light weight that the imbalance does not predominate.

Matching algorithm

Re-used the hamming distance CUDA feature matcher provided by OpenCV using hamming distance.

Final result

Below are the frame rates we could capture for videos for logo detection. Videos can be found here

OpenCV video FPS - 6
Initial optimization video FPS - 23
Final approach video FPS - 62

Real time processing is only 30 frames per second. But since we are able to process at 62 frames per second, we can potentially detect and process key points in HD videos in real time.

The above graphs depict the speed ups obtained incrementally by adding more optimization techniques. For the given application we can see that workload balancing gives the maximum speed up.

References

https://github.com/opencv/opencv

https://github.com/opencv/opencv_contrib/

https://gilscvblog.com/2013/08/26/tutorial-on-binary-descriptors-part-1/

http://opencv.org/platforms/cuda.html

Numerous articles on key point detectors and descriptors.

Distribution

Both members equally contributed to the project

Code

github

Google Sites

Report abuse