HW/SW co-optimizations
VLSI signal processing
Digital VLSI design
Embedded processor architecture
ASIC implementation
Neural Processing Unit (NPU) Design for Emerging Models
LLM acceleration in GPU systems
Chiplet Processor Design
Forward Error Correction (FEC) Decoders for Next-generation Wireless Communications
To simultaneously achieve high accuracy and hardware efficiency for large-scale DNN inferences, we developed an Asymmetrically-Quantized bit-Slice GEMM (AQS-GEMM) for the first time. In contrast to the previous bit-slice computing, which only skips operations of zero slices, the AQS-GEMM compresses frequent nonzero slices, generated by asymmetric quantization, and skips their operations. A specialized NPU is developed to efficiently execute tiled AQS-GEMM workloads, maximizing data reuse and minimizing memory accesses.
Related works: HPCA '25
As Moore’s law approaches its physical and architectural limits, the continuous scaling of binary CMOS technology faces challenges in performance, routing, and energy efficiency. To address these limitations, this research explores ternary computing as a promising alternative for post-Moore-era architectures. We propose comprehensive software and hardware frameworks that enable to efficiently design fully-functional ternary processors. Using the proposed frameworks, the ART-9, a 9-trit RISC-based ternary processor, is developed and experimentally validated.
Related works: DATE '23
To enable emerging mission-critical applications, e.g., healthcare monitoring, remote surgery, and autonomous driving, 5G/6G ultra-reliable low-latency communication (URLLC) devices demand the concurrent fulfillment of ultra-reliability, low-latency, and low-power communications, particularly in short data transmissions. We designed several URLLC FEC decoders to achieve three requirements at the same time.
Related works: ISSCC '24, TCAS-II '24, TCAS-I '23, S.VLSI '22, ASSCC '21