Research Topics

Design of AI Accelerators

Fig. Overview of eMamba architecture

Mamba offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. We present eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology.

Fig. Resource utilization comparison on ZCU102 Fig. eMamba ASIC layout

Object detection using a 4D radar

Traditional point cloud detection models, such as those derived from LiDAR or 4D Radar, are accurate are computationally intensive and ill-suited for low-power edge environments. To address these shortcomings, we propose EdgePillars that combines a high-speed, simple voxel-based encoder with the low-latency Backbone of pillar-based models.

On-device LLM inference

Current large language models (LLMs) operate on servers using high-performance GPUs. However, this requires sending data to the server, which raises security concerns and incurs high costs. To address these issues, we are researching hardware architectures and software that can enable efficient on-device inference of LLMs.

Wildfire DETECTION SYSTEM

The proposed system integrates edge AI-based image classification models capable of real-time wildfire recognition directly on embedded devices, thereby reducing reliance on cloud processing and minimizing latency.

The hardware architecture consists of solar-powered modules with battery backup, enabling continuous operation in remote environments. Each observation unit integrates a processor, camera, weather sensors, and a LoRa communication module, forming a low-power distributed monitoring network that reports environmental and visual data to a central control unit.

To evaluate system efficiency, power consumption analysis was conducted using the NI PXIe-6363 DAQ and shunt resistors, with GPIO synchronization ensuring accurate temporal alignment between data acquisition and system operation. LabVIEW was used to implement the power measurement interface, while MATLAB-based preprocessing (including moving-average) enabled detailed analysis and comparison of energy usage across devices.

Additionally, LoRa communication was implemented using an STM32 microcontroller, validating the system’s long-range, low-power data transmission capabilities.

Hybrid LLM accelerator

Fig Overall, Mamba accelerates PE and VU array design

The parallel computing structure based on PE arrays and VU arrays has strong potential to maximize computational efficiency for large-scale sequence models. In this study, we propose a chiplet-oriented hardware architecture designed to accelerate sequential state-space model (SSM)-based LLMs such as Mamba2 by explicitly separating the processing characteristics of the Prefill and Decode stages. The proposed architecture employs a PE array to efficiently handle the Prefill stage, which processes long input sequences in parallel. In contrast, the VU array is optimized for the Decode stage, which sequentially performs token-by-token state updates. A high-level simulator is developed to analyze cycle efficiency and minimize precision loss across both stages. The architecture ultimately targets implementation at the chip or chiplet level, providing an effective hardware solution for next-generation edge AI accelerators.

Page updated

Report abuse