Hardware-efficient Machine Learning System and Architecture Design
Advised by Prof. Callie Hao at Georgia Tech
Current Research Project
DGNN-Booster: A Generic FPGA Accelerator Framework For Dynamic Graph Neural Network Inference
Optimized FPGA implementation of DGNN-Booster V1 based on EvolveGCN and V2 based on GCRN-M2.
Contributions and Innovations
Generic and open-source. DGNN-Booster is a model generic framework, developed using High-Level Synthesis (HLS) for ease of use.
Multi-level parallelism. In higher-level parallelism, we parallelize GNN and RNN in adjacent time steps in DGNN-Booster V1 while parallelizing GNN and RNN within one time step in V2. In lower-level parallelism, the message passing and node transformation are in streaming .
Hardware efficient architecture design. We propose a task scheduling scheme to allocate the most suitable tasks for CPU and FPGA. Besides, we implement graph renumbering and format transformation to make our design more hardware efficient. Additionally, we utilize different types of RAMs on-chip to achieve memory efficiency.
Rapid-INR: Storage Efficient CPU-free DNN Training Using Implicit Neural Representation
What is special of Rapid-INR
CPU-free decoding with flexibility. INR decoding is on general GPU without specialized hardware with high parallelism and is quicker than optimized JPEG decoding on GPU. INR can also decode arbitrary resolution images because of continuous and smooth signal modeling.
Reduced off-device data communication. On-time transferring of INR weights from disk to GPU before training starts instead of keep transferring during training.
A high-level overview of three training pipelines.
Rapid-INR encoder-decoder architecture
INR encoding. Offline training to encode the image dataset to INR weights format. Each image is encoded using a separate MLP. To achieve storage efficiency, we also apply dynamic pruning and layer-wise quantization.
INR decoding. First load all images’ weights to CUDA memory. Batches of images are decoded to RGB format for backbone training at pixel-level parallelism on-the-fly during training.
Rapid-INR encoder-decoder architecture.