CS240A- Applied Parallel Computing, Home Work I


This webpage is part of Home Work I for CS240A course of UCSB. I, Sanmay Patil, am first year master student of Department of Computer Engineering. The problem I have chosen is analyzing implementation of  celebrated Viola Jones Face Detection algorithm for GPU platform to speed up the processing.


Generic Face Detection finds an application in various fields in today's world, like surveillance, bio-metrics, augmented reality, human computer interaction, entertainment. Majority of these applications are time sensitive and needs continuous interactions with users, which enforces the application to perform at real time. However, CPU single thread implementation of face detection consumes lot of time, and despite various optimization, it performs poorly at real time. One possible solution would be to use dedicated FPGA hardware which generally we find in advanced cameras nowadays. But it consumes more power and its costly affair. With the advent of General Purpose GPU (GPGPU) and growing support for various parallel programming languages like CUDA, OpenCL, it has become possible to use GPU for normal computational tasks. I will be here describing briefly basic Viola Jones face detector and some approaches currently present in literature which deals with implementing this algorithm on GPU by exploiting some of the parallel aspects of algorithm. 

Viola Jones Face Detector

1. Haar Features
The features used for face detection are called as Haar Features which are basically Haar wavelets.Haar wavelets are calculated over the region of adjacent rectangles.  Feature value is given by the difference between the sum of pixel intensities in the bright regions and the sum of pixel intensities in the dark regions. If the feature value is above pre-defined threshold then that feature is said to be present in searching window. 

2 Integral Images
Each time computing intensity values for rectangle region would be time consuming task, therefore to speed up the computation, Integral images have been used. The integral value for each pixel is the sum of all the pixels above it and to its left. Starting at the top left and traversing to the right and down, the entire image can be integrated with a few integer operations per pixel. 
After integration, the value at each pixel location, (x,y), contains the sum of all pixel values within a rectangular region that has one corner at the top left of the image and the other at location (x,y).Suppose if we  want the summed values in D then we can think of that as being the sum of pixel values in the combined rectangle, A+B+C+D, minus the sums in rectangles A+B and A+C, plus the sum of pixel values in A. In other words, D = A+B+C+D -  (A+B) -  (A+C) + A.

3. Cascade Architecture
This algorithm uses series of classifiers to check whether particular feature present or not in the given search window. However checking for hundreds of features would computationally intensive.Therefore they have used cascade architecture of classifiers with possibly high number of false positives but very low false negatives, followed by strong classifiers. If the window fails at any stage it would be discarded as no face, and 

Image regions that pass through all stages in the chain are classified as "Face." 

GPU Implementation
Here, we will briefly see CUDA implementation of Viola Jones from "Towards a Robust, Real-time Face Processing System using CUDA-enabled GPUs" by Sharma et al. They have mainly targeted nVIDIA GeForce GTX 285 GPU. This GPU features 30 multiprocessors, 16 KB shared memory per multiprocessor and 1 GB device memory. There can be a maximum of 512 threads per block and 1024 active threads per multiprocessor. As described previously, face detection mainly consists of, calculating the integral images for fast feature evaluation, and detecting faces using a cascade of classifiers. Each of these tasks are parallelized and run as kernels on the GPU.

1.Integral Images
Despite optimization, computation of integral images would be cumbersome task as the size of the frames increases. Therefore,to parallelize this calculation, break down the integral computation into a horizontal prefix sum followed by a vertical prefix sum on the image data. These two steps are implemented as two kernels.For horizontal prefix sum computation, threads compute the prefix sum of different rows of the  images in parallel. Each thread adds to the value of the current pixel, the previous pixel value along that row, and moves forward till the last pixel in the row is reached.The pixel values are fetched from textures and hence results in high performance. The number of threads active at a time is the minimum of the total number of  threads spawned. The vertical prefix sum computation is done on the output of the horizontal prefix sum computation. It is similar to the horizontal prefix sum, except that threads compute column prefix sums in parallel. Each thread adds to the value of the current pixel, the previous pixel value along that column, and moves forward till the last pixel in the column is reached.
2. Cascade of Classifiers
The cascaded detection process can be parallelized, by allowing 
the simultaneous computation of the feature values and scores 
for sub-windows at different locations of the image at different 
scales in parallel by multiple threads. Suppose two threads 
thread 0 and thread 1, 
which extract sub-windows at different locations and compute 
the score. The sub-windows are extracted with a step size 
of 4 pixels along the width and 2 pixels along the height. 
For fast feature evaluation, the integral images computed 
previously are used. Both the integral images and the features 
are stored and retrieved from textures to enhance performance. 
The cascades are initially stored in textures and transferred to 
shared memory for faster access.

Results and Discussion

For this implementation, performance results were obtained by running the application for 1) Static Image Data Set 2) Live Frames. For static set, they have used standard data set from CMU Data Set. Following graph shows the performance comparison between CPU only version and GPU enabled face detector. Images at various scales (320 . 240, 640 . 480 and 1280 . 960) were taken from the CMU dataset and live frames from the camera. The results show the average detection speed(fps) at various scales. We can see that  CUDA based implementation is 12 to 38 times faster than the CPU version and scales much better even at higher resolutions.

Also I have come across, following analysis of face detection for different flavors of nVidia GeForce cards by Anton Obukhov, nVidia. 


1. Robust Real Time Object Detection by Paul Viola and Michel Jones
3. Towards a Robust, Real-time Face Processing System using CUDA-enabled GPUs by Bharatkumar Sharma, Rahul Thota, Naga Vydyanathan, and Amit Kale
4. Accelerating Viola-Jones Face Detection to FPGA-Level using GPUs by Daniel Hefenbrock, Jason Oberg, Nhat Tan Nguyen Thanh, Ryan Kastner, Scott B. Baden