The co-design of hardware and software is essential to overcoming the "Memory Wall" in nowadays von Neumann architecture. My research aims to break this barrier by developing Processing-in-Memory (PIM) architectures designed from the ground up to understand the mathematical requirements of AI workloads.
The primary bottleneck for deploying state-of-the-art Large Language Models (LLMs) is the enormous energy cost of data movement. Current architectures waste the majority of their energy transporting parameters between memory and processor. My goal is to minimize this movement through non-volatile memory technologies and specialized, lightweight compute logic.
Traditional PIM designs are often hampered by computational inefficiencies when dealing with integer arithmetic. This research introduces a Heterogeneous PIM Architecture that leverages the inherent numerical characteristics of low-bit quantization (such as 1-bit or ternary weights) within LLMs.
Exploiting Structural Sensitivity: My research focuses on identifying how transformer model layers exhibit varying sensitivities to computational precision. By creating a heterogeneous compute data path, different parts of the network can be routed through specialized, optimized hardware routes based on their semantic significance.
Digital RRAM Optimization: The architecture is designed for RRAM crossbars but functions entirely in the digital domain. This eliminates the power and area bottlenecks associated with Analog-to-Digital Converters (ADCs), which are common failure points in analog PIM designs.
Hardware-Validated Modeling: The architecture was validated using rigorous methodologies:
Circuit Level: 22nm SPICE simulations calibrated with published foundry data.
Model Level: Validated accuracy retention on modern quantized transformer models such as BitNet b1.58.
I am always eager to collaborate with fellow researchers and engineers on the future of PIM and Hardware-Software Co-design. If you're working on similar challenges in AI acceleration, let's connect and explore how we can push the boundaries of efficient inference together.