Tools

Open-source Tools created by my PhD students, as well as by PhD students I collaborate with from other Institutions

AXI4MLIR: User-Driven Automatic Host Code Generation for Custom AXI-Based Accelerators

 AXI4MLIR is an extension to the MLIR compiler framework to describe AXI-based accelerators with a range of features including accelerator opcodes. It includes attribute extensions and compiler transformations to describe and automatically generate host code that can leverage different flows of flexible accelerators, allowing us to break away from simple offload HW/SW co-design models. AXI4MLIR is effective in generating host code that efficiently uses CPU resources and accelerator features. This allows for measurable runtime improvements versus manual implementations for all tested accelerators, while providing automation and convenience during the co-design cycle. Finally, the user-driven host code generation is entirely automated, providing a significant advantage in terms of productivity and maintainability, specially during the early stages of the co-design process.

STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators

STONNE (Simulation TOol of Neural Network Engines) is a cycle-accurate, highly-modular and highly-extensible simulation framework thatenables end-to-end evaluation of flexible accelerator architectures running complete contemporary DNN models. We use STONNE to model the recently proposed MAERI architecture and show how it can closely approach the performance results of the publicly available BSV-coded MAERI implementation. Then, we conduct a comprehensive evaluation and demonstrate that the folding strategy implemented for MAERI results in very low compute unit utilizations (25% on average across 5 DNN models)which in the end translates into poor performance.

NaviSim: A Highly Accurate GPU Simulator for AMD RDNA

NaviSim is the first cycle-level GPU simulator framework that models AMD RDNA GPUs. NaviSim faithfully emulates the new RDNA ISA. We extensively tune and validate NaviSim using several microbenchmarks and 10 full workloads. Our evaluation shows that NaviSim can accurately model the GPU’s kernel execution time, achieving similar performance to hardware execution within 9.92% (on average), as measured on an AMD RX 5500 XT GPU and an AMD Radeon Pro W6800 GPU.

GNNMark: A Benchmark Suite to Characterize Graph Neural Network Training on GPUs 

GNNMark is a feature-rich benchmark suite that consists of Graph Neural Network workloads that utilize a variety of different graph-based data structures, including homogeneous graphs, dynamic graphs, and heterogeneous graphs commonly used in a number of application domains that we mentioned above. We use this benchmark suite to explore and characterize GNN training behavior on GPUs. 

TAP-2.5D: A Thermally-Aware Chiplet Placement Methodology for 2.5D Systems

TAP-2.5D is the first open-source network routing and thermally-aware chiplet placement methodology for heterogeneous 2.5D systems. TAP-2.5D strategically inserts spacing between chiplets to jointly minimize the temperature and total wirelength, and in turn, increases the thermal design power envelope of the overall system. We present three case studies demonstrating the usage and efficacy of TAP-2.5D.

TFLITE-SOC: Design Space Exploration of Accelerators and End-to-End DNN Evaluation with TFLITE-SOC

TFLITE-SOC (System On Chip) is a new framework that integrates a lightweight system modeling library (SystemC) for fast design space exploration of custom ML accelerators into the build/execution environment of Tensorflow Lite (TFLite), a highly popular ML framework for ML inference. Using this approach, we are able to model and evaluate new accelerators developed in SystemC by leveraging the language’s hierarchical design capabilities, resulting in faster design prototyping. Furthermore, any accelerator designed using TFLITE-SOC can be benchmarked for inference with any DNN model compatible with TFLite, which enables end-to-end DNN processing and detailed (i.e., per DNN layer) performance analysis.

MGPUSim: A Multi-GPU Simulator based on AMD GCN3

MGPUSim is a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD’s Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. The simulator achieves a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation.  

InsideNet: A tool for characterizing convolutional neural networks

InsideNet is a tool built on top of Caffe DL framework aimed to assist researchers in exploring the values generated during the inference procedure of a Convolutional Neural Network (CNN). More precisely, InsideNet allows in-depth analysis of the values of the filters and fmaps within the convolution layers of a trained CNN during an ongoing inference procedure. To do so, InsideNet features three main operation modes. First, the Fmap Visualization Mode (FVM), which allows users to examine in a visual manner the generated fmap channels during the inference procedure of a set of images. Second, the Statistic Collector Mode (SCM), which offers a rich set of statistics for the fmap channels and weights of every convolutional layer. And third, the Histogram Collector Mode (HCM), which allows deeper exploration of value-based patterns by generating the corresponding histograms of the network.