Research‎ > ‎

Archived Projects

Viola & Jones Face Detector
We intend to accelerate the Viola & Jones[1] detection algorithm. This algorithm is well known for face detection. It was shown to run at ~15 frames per second when run on QVGA video (320x240) in 2000 on an Intel Pentium 3. It has been integrated into the OpenCV library and as a result has been used widely. Unfortunately, running the algorithm on higher resolutions (VGA or higher) results in low frame per second rates. The algorithm has been accelerated on a FPGA [2], in a standalone design. This work was extensive and resulted in only a modest improvement in frame rate. We hope to build a better performing accelerated detector using the same hardware, in less time.

The majority of time spent in the algorithm is in calculating feature values. A trained face classifier, such as the one in the OpenCV library, has over 2200 features. Each of these features must be extracted from every candidate window. The Viola & Jones attentional cascade attempts to limit the time spent extracting features from candidate windows by grouping features into stages and only calculating features in stages, sequentially. 

On a FPGA, one can parallelize the feature extraction, but not for all 2200+ features as resources are limited. Thus some of the features must be calculated sequentially. This sequential processing presents two problems. First, the features cannot be calculated on streaming data. The video frame must be buffered and then read out slowly, using more resources. Second, no optimizations can be made at design time when accessing the sliding window. This ends up being the bottleneck on smaller FPGAs due to the maximum MUX size that can be synthesized. 

Features Rejected vs. Stages Processed



To achieve a fully parallel design that can be run at data rate, we limit the number of features calculated in parallel. Specifically, we only calculate the first N stages in parallel, where N is selected at design time. A graph of the percent of rejected candidate windows as a function of the number of stages in the cascade is presented on the left. This graph shows that after stage 4, 87% of the candidate windows have been rejected. This comes at a cost of only 79 features evaluated. Thus, reducing the CPU work load by nearly 87%.

Our design calculates the first N cascade stages for each candidate window at a single scale, on the FPGA. This is outputted to the CPU as a 2D bitmap indicating whether the candidate window at the corresponding location was rejected within the first N stages. Along with this filter bitmap, the integral image and integral image squared frames are captured and outputted to the CPU. The software on the CPU then only needs to calculate the Viola & Jone cascade on the candidate windows that were not rejected. On average this would only be 13% of the candidate windows (for a single scale/resolution).

Results:
We implemented our design on a Intel dual core 2.4 GHz and accelerated it with a Xilinx Virtex 5 on the ML506 board. The software only implementation runs on VGA video at 5 fps. We were able to fit two designs within the resource limitations of the Virtex 5. One design was a single scale classifier calculating the first 3 stages of the cascade. The other design supported dual scales, each classifier calculated the first 2 stages of the cascade. The accelerated designs boosted the frame rate to 7 fps. A simulated design on a Virtex 6 was created with 5 classifiers. Each classifier calculated the first 4 stages of the cascade. This simulated Virtex 6 design boosted the frame rate to 15.8 fps.

Video of face detector



References:
  1. Junguk Cho , Shahnam Mirzaei , Jason Oberg , Ryan Kastner, Fpga-based face detection system using Haar classifiers, Proceeding of the ACM/SIGDA international symposium on Field programmable gate arrays, February 22-24, 2009


Particle Filter Multipoint Tracker
Particle filters are often used in tracking to approximate calculating the Bayes optimal location of the target as it moves. It is often employed with non-Gaussian models where algorithms like the Kalman filter cannot calculate the optimal in closed form. Particle filter tracking algorithms have been implemented on a FPGA before. But we intend to partition the tracking algorithm across CPU software and FPGA hardware.

The high level description of the algorithm is as follows:
  1. Acquire initial frame.
  2. Register location of target.
  3. Capture pixels at target location, refer to this as the template.
  4. Sample locations around target as .particles.
  5. Assign particles all equal weight.
  6. Acquire next frame.
  7. Calculate loss for each particle using new pixel frame (loss is a measure of how well the particle matches the template).
  8. Particles with high loss are discarded, replaced.
  9. Location of replacement particles is determined by sampling from weighted distribution of remaining particles.
  10. All particles are re-weighted to form a new distribution.
  11. New target location determined from weighted particle distribution (e.g. weighted average for x & y coordinates).
  12. Goto 6.
The loss function takes the most of the algorithm run time. So we choose to accelerate this on the FPGA. To achieve a stream processing design, we calculate the loss for each particle in a piece wise fashion, line by line, as the video is streamed. After the entire frame has completed, the FPGA IP core sends the scores for each particle to the CPU. The CPU does not receive the video frame, just the scores. After running the algorithm iteration, the new particle locations are sent as parameters to the IP core.

Results:
Our software implementation of the particle filter tracker runs at 2 fps when tracking 6 points (100 particles each). We are able to fit 6 tracking IP cores on a Xilinx Virtex 5 ML506. The FPGA accelerated version runs at 60 fps (without dropping any frames) while using only 25% of our Intel dual core 2.4 CPU.

Tracker operating on 640x480 @ 60 Hz