Mamba offers competitive accuracy with superior computational efficiency compared to state-of-the-art transformer models. While this advantage makes Mamba particularly promising for resource-constrained edge devices, no hardware acceleration frameworks are currently optimized for deploying it in such environments. We present eMamba, a comprehensive end-to-end hardware acceleration framework explicitly designed for deploying Mamba models on edge platforms. We also quantize and implement the entire eMamba pipeline on an AMD ZCU102 FPGA and ASIC using GlobalFoundries (GF) 22 nm technology.
Traditional point cloud detection models, such as those derived from LiDAR or 4D Radar, are accurate are computationally intensive and ill-suited for low-power edge environments. To address these shortcomings, we propose EdgePillars that combines a high-speed, simple voxel-based encoder with the low-latency Backbone of pillar-based models.
Current large language models (LLMs) operate on servers using high-performance GPUs. However, this requires sending data to the server, which raises security concerns and incurs high costs. To address these issues, we are researching hardware architectures and software that can enable efficient on-device inference of LLMs.
The proposed system integrates edge AI-based image classification models capable of real-time wildfire recognition directly on embedded devices, thereby reducing reliance on cloud processing and minimizing latency.
The hardware architecture consists of solar-powered modules with battery backup, enabling continuous operation in remote environments. Each observation unit integrates a processor, camera, weather sensors, and a LoRa communication module, forming a low-power distributed monitoring network that reports environmental and visual data to a central control unit.
To evaluate system efficiency, power consumption analysis was conducted using the NI PXIe-6363 DAQ and shunt resistors, with GPIO synchronization ensuring accurate temporal alignment between data acquisition and system operation. LabVIEW was used to implement the power measurement interface, while MATLAB-based preprocessing (including moving-average) enabled detailed analysis and comparison of energy usage across devices.
Additionally, LoRa communication was implemented using an STM32 microcontroller, validating the system’s long-range, low-power data transmission capabilities.
The parallel computing structure based on PE arrays and VU arrays has strong potential to maximize computational efficiency for large-scale sequence models. In this study, we propose a chiplet-oriented hardware architecture designed to accelerate sequential state-space model (SSM)-based LLMs such as Mamba2 by explicitly separating the processing characteristics of the Prefill and Decode stages. The proposed architecture employs a PE array to efficiently handle the Prefill stage, which processes long input sequences in parallel. In contrast, the VU array is optimized for the Decode stage, which sequentially performs token-by-token state updates. A high-level simulator is developed to analyze cycle efficiency and minimize precision loss across both stages. The architecture ultimately targets implementation at the chip or chiplet level, providing an effective hardware solution for next-generation edge AI accelerators.