This project was a part of a class on reconfigurable computing. The system was designed on the Xilinx Zedbaord which has a dual core ARM and a Xilinx FPGA. The idea of the project was to accelerate binary searches on huge trees using binary using a hybrid CPU-FPGA system a.k.a. Programmable System on Chip. The entire project was built using Vivado tools. The HLS (high level synthesis) tool was used to work with the FPGA and the SDK was used to work with the ARM processors. The two communicate using AXI buses.
A 16 million node binary tree was initialized in the main memory of the system. 6 parallel 255-node binary search kernels were implemented on the FPGA.
The system was bench-marked against sequential software search by ARM processor.
Kernel synthesized from Vivado HLS featured stage skipping using parallel comparators making it faster than sequential search, but full tree search on FPGA was slowed down by DMA communication costs, sequential nature of binary search, and log(n) nature of search problem making ARM processor fast.
In order to achieve faster searches using blocking, the data was pre-processed to be stored in the fashion shown below . Normally, tree structures are stored as shown in the figure on the left . After pre-processing the data was stored as shown on the left. This was done in order to send the data faster in blocked form to the FPGA.
Before and After Preprocessing the binary tree data
This block diagram shows how the processing system (ARM) communicates with the FPGA. There are 6 independent binary tree search blocks (BTS_x) that can parallely perform six tree search on a trees smaller in size as compared to the main tree.
The FPGA kernel execution time for a leaf node search in an 8-level tree was 0.43 us, while ARM core took 0.6us for same search. Despite this fast kernel, the FPGA-assisted solution was ~4.6 times SLOWER than pure sequential search by ARM processor (70us vs 15 us), for 6 leaf node searches on the main tree.
This was memory Bound problem. There is hardly any computation that could be parallelized, it is the fast access to data residing in main memory which is critical. A leaf node search in a 16 million node tree would require 24 sequential comparisons which is not computationally intensive. ARM runs at 600Mhz, FPGA at 250 Mhz. ARM also enjoys direct cached-access to the main memory (Disabling cache, resulted in ARM taking almost twice as much time).
DMA communication costs from DRAM to FPGA space hurt the FPGA based solution to the maximum (~68% of total execution time) along with the AXI-interconnect latencies incurred in ARM-to-FPGA control signaling.