A one-line description of the project
Keywords - Language Models (LLMs), Energy-Efficient Inference, Hardware Acceleration
In recent years, Large Language Models (LLMs) have emerged as the core technical backbone of natural language processing, enabling advanced capabilities such as text generation, translation, summarization, and reasoning. Models like ChatGPT, DeepSeek, and others are profoundly reshaping human experiences worldwide. At the epicenter of this revolution lies the Transformer architecture.
With superior hardware parallelism and long-range dependency modeling capabilities, the Transformer rapidly superseded traditional Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) in model design nearly a decade ago. Its fundamental computational units — multi-head self-attention and feedforward layers — provide foundational text understanding for language models. Critically, these modules rely on dense matrix multiplication (MatMul) for feature transformation and fusion. In essence, the Transformer converts input text into sequential encodings, projects them into linear algebraic vectors, and iteratively reprojects, weights, and fuses these representations through extensive matrix operations to generate outputs.
However, this architecture incurs massive computational overhead: Matrix multiplication exhibits high computational complexity and extreme memory bandwidth dependency, creating critical bottlenecks in energy consumption and data transfer latency during deployment. As models scale from millions to billions of parameters, the Transformer’s underlying computational paradigm reveals inherent scalability limitations - particularly on local devices and low-power edge platforms.
To break this computational barrier, researchers have proposed several paradigm shifts. Our focus centers on State Space Models (SSMs), which introduce linear recurrent structures to replace global attention computations. SSMs not only support long-sequence modeling but also drastically reduce memory access costs. Building on this, the recently prominent MatMul-Free architecture eliminates matrix multiplication entirely, instead utilizing ternary weighting, element-wise operations, and gated linear units for all modeling tasks. Empirical results demonstrate that at billion-parameter scales, MatMul-Free models retain Transformer-level semantic modeling capabilities while significantly reducing computational complexity and model size.
Yet a critical challenge persists: These efficient algorithms lack native support in mainstream AI hardware🥲.
Modern high-performance hardware (GPUs, TPUs) is architecturally bound to matrix multiplication—evidenced by tensor core arrays and high-throughput multiply-accumulate (MAC) units. While hardware-algorithm co-evolution historically propelled AI advancements, it has also created path dependency, forcing algorithms to conform to hardware constraints. This fundamentally limits the energy efficiency potential of novel models like MatMul-Free and SSMs, creating an "invisible ceiling" for efficient AI deployment.
This is the foundation of our work.
We propose an Application-Specific Integrated Circuit (ASIC) architecture designed algorithm-first, entirely decoupled from multiplicative arrays. Our solution natively executes models through addition-centric operations, sign-flipping, and element-wise gating. This dedicated hardware aims to fully unleash the energy efficiency, bandwidth savings, and cost advantages of MatMul-Free and SSM paradigms, enabling future LLM deployment in localized, personalized, and environmentally sustainable scenarios.
We believe this represents not merely a technical innovation, but a pivotal step toward a sustainable future for artificial intelligence.🌳
Keywords - FPGA, Loihi 2
Matmul-Free Model innovatively employs ternary-weight dense layers and gated recurrent units (GRUs) as token mixers. Experiments demonstrate that their 2.7 B-parameter model achieves performance comparable to Transformer++ across various zero-shot benchmarks, while reducing inference memory usage by over 10× (4.19 GB vs. 48.5 GB) and lowering latency by 4.6× (695.5 ms vs. 3183.1 ms).
To further exploit hardware-level advantages, the team deployed their model on an FPGA platform (D5005 Stratix 10). Through a custom RTL implementation and a fused computational kernel combining RMSNorm and BitLinear operations, they effectively mitigated memory bandwidth bottlenecks. Running at just 13W of power, the system achieved a generation speed of 62 tokens per second—approaching the energy efficiency of the human brain. This work highlights the enormous potential of algorithm–hardware co-design.
In a parallel direction, Abreu et al. (2025) ported this architecture to Intel's neuromorphic platform, Loihi 2. By converting all key operations—including sigmoid and inverse square root—into fixed-point implementations, they achieved up to a 3× increase in throughput and a 2× reduction in energy consumption compared to an edge GPU (Jetson Orin Nano) during generation tasks. Thanks to Loihi 2's in-memory computing and asynchronous event-driven execution, the model maintained constant power and latency during long-sequence prefill tasks, achieving over 6600 tokens per second with an ultra-low energy cost of just 3.7 mJ per token.
Despite these promising results, both FPGA and neuromorphic implementations fall short of fully unlocking the theoretical potential of MatMul-free architectures. FPGAs, while reconfigurable, are fundamentally general-purpose and constrained in their ability to achieve optimal power-performance trade-offs for specific workloads. Neuromorphic chips, though highly efficient, are tailored for event-driven spiking neural networks; adapting LLMs to such architectures requires significant translation and optimization, and may not align naturally with the computational flow of these models.
Keywords - SRAM Usage, Clock Cycle, Block Diagram
We first implemented a nano MatMul-Free language model inspired by the NanoGPT , with the primary objective of validating the dataflow within the inference pipeline. This initial implementation allowed us to trace how data propagates through embedding layers, accumulation blocks, normalization steps, and activation functions. By reproducing the full forward path in software, we ensured that our system correctly handles intermediate states and memory interactions on embedded targets.
Then, to evaluate memory layout and embedded feasibility, we ported the entire inference pipeline to the SiFive S21 RISC-V processor, implementing it in C. We used Valgrind to monitor SRAM usage. To assess performance, we inserted labeled memory writes at the end of key modules and captured execution timelines using Cadence Xcelium. This allowed precise measurement of per-module cycle counts and identification of compute bottlenecks.
We propose a custom system-on-chip (SoC) architecture for MatMul-Free LLM inference, combining a RISC-V processor core with dedicated hardware accelerators. The RISC-V core handles control flow and irregular computations, while major arithmetic operations are offloaded to dedicated accelerators. The design targets a 28 nm process and sub-GHz clock rate for energy-efficient edge deployment.
The core acceleration unit is a ternary matrix accumulator that replaces conventional MAC arrays. It performs large-scale additions and subtractions based on ternary weights, using parallel adder trees and sign-multiplexing logic. Zero weights are skipped to reduce redundant computation. Configurable parallelism enables performance-area trade-offs.
Other key modules include:
Ternary Accumulator Banks: Multiple parallel accumulators enable pipelined or channel-parallel processing, improving data reuse and throughput for sequential inference steps.
Nonlinearity Accelerator: Supports activation functions such as Sigmoid and SiLU using piecewise-linear approximation or LUT-based evaluation for low-latency response.
RMSNorm/LayerNorm Block: Implements normalization via pipelined square, mean, and reciprocal square root operations. May share computation resources with the main processor or use dedicated datapaths.
Element-wise Processor: Executes operations such as vector addition, gating, masking, or conditional branching. Can be programmable (e.g., micro-op style) to support GLU/GRU behavior.
GLU Gate Unit: Computes Hadamard products required in gated layers using either bit-parallel multiplier arrays or RISC-V instruction-level processing.
AXI Bus Interface: Standardized memory-mapped interconnect enabling synchronization and communication between RISC-V CPU and accelerator blocks. Supports pipelined reads/writes with burst and response channels.
SiFive S21 RISC-V Controller: Manages sequencing and coordination of hardware tasks, including DMA triggering, status monitoring, and accelerator scheduling.
SPI Interfaces: Provide serial connectivity for external configuration, debugging, or lightweight data input/output during emulation and testing.
We conducted an operation count and memory access analysis across multiple hidden dimensions (D = 128, 256, 512, 1024, 2048) for the MatMul-Free language model, where D=1024,2048 corresponds to the 340M parameter model and 1.3B parameter model in the original Matmul-Free model article. The goal was to uncover the computational bottlenecks and resource demands of the architecture.
The first chart illustrates the total number of operations per model dimension, categorized into three types: traditional multiply-accumulate (MAC), ternary operations (i.e., addition/subtraction with weights in {-1, 0, +1}), and nonlinear functions such as activations and normalization.
As the model width increases, ternary operations grow rapidly and become the dominant contributor to overall computation. At D=2048, ternary operations account for over 60% of the total instruction volume, far surpassing MAC and nonlinear components combined. This trend is consistent with the structural design of MatMul-Free networks, where large-scale additions replace expensive multiplications, leading to simpler but denser arithmetic patterns.
This data clearly underscores a key architectural insight: in hardware implementations, accelerating MatMul-Free models requires prioritizing add/sub-based accumulation units over multipliers. Consequently, in our SoC design, we dedicate a significant portion of the datapath to parallel ternary accumulators, which are capable of handling high-throughput, sparse, and low-precision computations efficiently.
The second chart shows memory access volumes on a logarithmic scale, broken down into weight reads, activation reads, and activation writes. As the model width increases, total access volume grows exponentially, particularly for weight and activation reads.
At D=2048, the number of activation reads exceeds 10^8 per token inference, while write operations remain relatively modest. This “read-heavy, write-light” profile has important implications for system design: we must provision sufficiently large, multi-port on-chip SRAM to prevent frequent and costly external memory accesses. Furthermore, DMA-driven bulk transfers and local buffer reuse within accelerator modules are essential for hiding memory latency and sustaining high throughput.
Keywords - RTL Design, Synthesis
Having completed a detailed performance characterization of the MatMul-Free language model, we have identified key computational and memory access patterns that will guide our hardware design. In particular, the dominance of ternary accumulation operations and the high memory read intensity strongly motivate architectural specialization.
Our next step is to transition from high-level profiling to register-transfer level (RTL) implementation using Verilog. This phase will involve developing a set of dedicated accelerator modules, including:
All modules will be implemented in SystemVerilog, verified via module-level simulation (e.g., Cadence Xcelium), and later integrated into a full SoC design. The RISC-V core will serve as a control processor, issuing commands to hardware accelerators and managing data flow. Careful attention will be paid to pipeline balancing, resource sharing, and area-performance trade-offs in the RTL design.
Ultimately, this RTL implementation will serve as the foundation for FPGA prototyping and potential ASIC tape-out, validating the viability of a low-power, memory-efficient inference engine for MatMul-Free large language models.