Completed Projects

Acceleration of Graph Processing Engines (PSDD) [J7, C14]

Probabilistic Sentential Decision Diagrams (PSDDs) provide efficient methods for modeling and reasoning with probability distributions in the presence of massive logical constraints. PSDDs can also be synthesized from graphical models such as Bayesian networks (BNs) therefore offering a new set of tools for performing inference on these models (in time linear in the PSDD size). Despite these favorable characteristics of PSDDs, we have found multiple challenges in PSDD's FPGA acceleration. Problems include limited parallelism, data dependency, and small pipeline iterations. This project proposes several optimization techniques to solve these issues with novel pipeline scheduling and parallelization schemes.

Rapid cycle-accurate simulator for FPGA HLS (FLASH) [J6, C11]

Low-level (on-board or RTL) simulation often takes long time to complete and is difficult to understand. Software-based HLS simulators can help solve these problems - however, these simulators may produce incorrect results and inaccurate performance estimation. In order to solve these issues while maintaining the high speed of a software-based simulator, this project proposes a new HLS simulation flow named FLASH. The main idea behind the proposed flow is to extract the scheduling information from the HLS tool and automatically construct an equivalent cycle-accurate simulation model while preserving C semantics. Experimental results show that FLASH runs three orders of magnitude faster than the RTL simulation. We are also looking into applying FLASH for rapid C-level power estimation and simulation-based design space exploration.

Performance debugging for HLS-based FPGA designs (HLScope) [C10, C9, C8] (Cisco Outstanding Graduate Student Research Award)

FPGA designers often spend considerable time trying to identify the performance bottleneck. This project addresses such difficulty by automating the performance debugging process based on HLS. The proposed high-level analysis allows tracing the cause of stalls on a function or loop level, which provides a more intuitive feedback that can be used to pinpoint the performance bottleneck. This project discusses the various challenges in constructing an HLS-based FPGA performance debugging framework and presents novel solutions to overcome those challenges. In particular, we propose HLScope-S, a performance estimator that automatically instruments code that models the hardware execution behavior and interprets the information from the HLS software simulation. We also present HLScope-M, an on-board monitoring flow for automated cycle extraction and stall analysis. Moreover, we describe a design space exploration framework that finds a set of HLS-based optimization directives for applications with variable loop bound.

CPU-FPGA heterogenous platform analysis and optimization [J5, C7] (Second most cited paper in DAC`16)

Heterogeneous architectures that feature specialized hardware accelerators are widely considered a promising paradigm for continued performance and energy improvement. As a result, a variety of CPU-FPGA acceleration platforms with diversified microarchitectural features have been supplied by industry vendors. For example, Alpha Data boards and Amazon AWS F1 represent traditional PCIe-based platform with private device memory; IBM CAPI and Intel HARP1 represent coherent shared memory; Intel HARP2 represents a hybrid non-coherent/coherent shared memory system. Such diversity, however, poses a serious challenge to application developers in selecting the appropriate platform for a specific application or application domain. This project aims to address this challenge by quantitatively analyzing the microarchitectural characteristics that affect the performance and providing guidance to optimize accelerator designs.

Acceleration of 3D CT reconstruction [J4, C16]

Reducing radiation doses is one of the key concerns in CT based 3D reconstruction. Although EM algorithm can be used to address this issue, applying this algorithm to practice is difficult due to the long execution time. The goal of this project is to decrease the long execution time of EM-based 3D CT reconstruction to an order of a few minutes, so that low-dose CT can be performed even in time-critical events. This project introduces several FPGA-based acceleration strategies such as novel FPGA-friendly parallel scheme, external memory bandwidth reduction strategy, and customized processing engine. Experiments on actual patient data show that a 27X speedup can be achieved over a 16-thread multicore CPU implementation.

GPGPU implementation of large-scaled MRF for stereo matching using graph cuts / belief propagation [C5, C4, J3, C3]

FPGA/VLSI implementation of large vocabulary continuous speech recognizer [J2, J1, C2, C1]