Tools
Open-source Tools created by my PhD students, as well as by PhD students I collaborate with from other Institutions
Open-source Tools created by my PhD students, as well as by PhD students I collaborate with from other Institutions
FIDESlib, the first open-source server-side CKKS GPU library that is fully interoperable with well-established client-side OpenFHE operations. Unlike other existing open-source GPU libraries, FIDESlib provides the first implementation featuring heavily optimized GPU kernels for all CKKS primitives, including bootstrapping. Our library also integrates robust benchmarking and testing, ensuring it remains adaptable to further optimization. Furthermore, its software architecture is designed to support extensions to a multi-GPU backend for enhanced acceleration. Our experiments across various GPU systems and the leading open-source CKKS library to date, Phantom, show that FIDESlib offers superior performance and scalability. For bootstrapping, FIDESlib achieves up to 74X speedup over the AVX-optimized OpenFHE implementation.
SAFFE is a novel methodology for the flexible and scalable composition of multimodal models, specifically tailored to evolving end-user downstream tasks. In contrast with existing multimodal models and state-of-the-art fusion techniques, SAFFE-derived models eliminate the need for expensive end-toend training or full fine-tuning to achieve high accuracy on target datasets. SAFFE leverages per-modality off-the-shelf frozen encoders—readily available from major AI providers—by selectively integrating only those components necessary for the downstream task. This targeted selection avoids over-parameterization and significantly reduces the model’s memory footprint. Since these pre-trained frozen encoders are often trained independently and not within a unified multimodal context, their output embeddings may be semantically misaligned. To resolve this, we propose the FusionAlign Module (FAM)—a lightweight, bottleneck mid-fusion unit trained solely on the target end-user dataset. FAM aligns the semantic spaces across modalities, enabling effective multimodal integration without updating the parameters of the frozen encoders. Our results show that SAFFE can flexibly and efficiently compose high-accuracy of bimodal models, achieving improved prediction performance compared to state-of-the art methods, while significantly reducing computational costs.
Principal Developers: Maithri Kulasekara, PhD candidate -- University of Murcia
Published as an open-access article at Journal of Supercomputing'25. SAFFE can be found HERE.
Graph Neural Networks (GNNs) are emerging as a formidable tool for processing non-euclidean data across various domains, ranging from social network analysis to bioinformatics. Despite their effectiveness, their adoption has not been pervasive because of scalability challenges associated with large-scale graph datasets, particularly when leveraging message passing. To tackle these challenges, we introduce NeuraChip, a novel GNN spatial accelerator based on Gustavson's algorithm. NeuraChip decouples the multiplication and addition computations in sparse matrix multiplication. This separation allows for independent exploitation of their unique data dependencies, facilitating efficient resource allocation. We introduce a rolling eviction strategy to mitigate data idling in on-chip memory as well as address the prevalent issue of memory bloat in sparse graph computations. Furthermore, the compute resource load balancing is achieved through a dynamic reseeding hash-based mapping, ensuring uniform utilization of computing resources agnostic of sparsity patterns. Finally, we present NeuraSim, an open-source, cycle-accurate, multi-threaded, modular simulator for comprehensive performance analysis.
AXI4MLIR is an extension to the MLIR compiler framework to describe AXI-based accelerators with a range of features including accelerator opcodes. It includes attribute extensions and compiler transformations to describe and automatically generate host code that can leverage different flows of flexible accelerators, allowing us to break away from simple offload HW/SW co-design models. AXI4MLIR is effective in generating host code that efficiently uses CPU resources and accelerator features. This allows for measurable runtime improvements versus manual implementations for all tested accelerators, while providing automation and convenience during the co-design cycle. Finally, the user-driven host code generation is entirely automated, providing a significant advantage in terms of productivity and maintainability, specially during the early stages of the co-design process.
STONNE (Simulation TOol of Neural Network Engines) is a cycle-accurate, highly-modular and highly-extensible simulation framework thatenables end-to-end evaluation of flexible accelerator architectures running complete contemporary DNN models. We use STONNE to model the recently proposed MAERI architecture and show how it can closely approach the performance results of the publicly available BSV-coded MAERI implementation. Then, we conduct a comprehensive evaluation and demonstrate that the folding strategy implemented for MAERI results in very low compute unit utilizations (25% on average across 5 DNN models)which in the end translates into poor performance.
NaviSim is the first cycle-level GPU simulator framework that models AMD RDNA GPUs. NaviSim faithfully emulates the new RDNA ISA. We extensively tune and validate NaviSim using several microbenchmarks and 10 full workloads. Our evaluation shows that NaviSim can accurately model the GPU’s kernel execution time, achieving similar performance to hardware execution within 9.92% (on average), as measured on an AMD RX 5500 XT GPU and an AMD Radeon Pro W6800 GPU.
GNNMark is a feature-rich benchmark suite that consists of Graph Neural Network workloads that utilize a variety of different graph-based data structures, including homogeneous graphs, dynamic graphs, and heterogeneous graphs commonly used in a number of application domains that we mentioned above. We use this benchmark suite to explore and characterize GNN training behavior on GPUs.
TAP-2.5D is the first open-source network routing and thermally-aware chiplet placement methodology for heterogeneous 2.5D systems. TAP-2.5D strategically inserts spacing between chiplets to jointly minimize the temperature and total wirelength, and in turn, increases the thermal design power envelope of the overall system. We present three case studies demonstrating the usage and efficacy of TAP-2.5D.
TFLITE-SOC (System On Chip) is a new framework that integrates a lightweight system modeling library (SystemC) for fast design space exploration of custom ML accelerators into the build/execution environment of Tensorflow Lite (TFLite), a highly popular ML framework for ML inference. Using this approach, we are able to model and evaluate new accelerators developed in SystemC by leveraging the language’s hierarchical design capabilities, resulting in faster design prototyping. Furthermore, any accelerator designed using TFLITE-SOC can be benchmarked for inference with any DNN model compatible with TFLite, which enables end-to-end DNN processing and detailed (i.e., per DNN layer) performance analysis.
Main Developer: Nicolas B. Agostini, PhD candidate at Northeastern University (Boston-USA).
Presented at SBAC-PAD'20
TFLITE-SOC can be found HERE
MGPUSim is a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD’s Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. The simulator achieves a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation.
InsideNet is a tool built on top of Caffe DL framework aimed to assist researchers in exploring the values generated during the inference procedure of a Convolutional Neural Network (CNN). More precisely, InsideNet allows in-depth analysis of the values of the filters and fmaps within the convolution layers of a trained CNN during an ongoing inference procedure. To do so, InsideNet features three main operation modes. First, the Fmap Visualization Mode (FVM), which allows users to examine in a visual manner the generated fmap channels during the inference procedure of a set of images. Second, the Statistic Collector Mode (SCM), which offers a rich set of statistics for the fmap channels and weights of every convolutional layer. And third, the Histogram Collector Mode (HCM), which allows deeper exploration of value-based patterns by generating the corresponding histograms of the network.