Computer architecture and system
AI accelerators (GPU, NPU, and PIM)
Hardware-software co-design for AI/ML
LLMs have advanced significantly, but they present substantial memory challenges. To address this, we propose an algorithm-system co-design that overcomes these memory challenges while improving inference performance. This approach reduces GPU memory usage and maintains model accuracy, making it feasible to efficiently deploy large-scale LLMs on a single GPU.
Related works: [ASPLOS'24] [ISCA'24]
GNNs have emerged as a key technology in application domains where the input data is relational. However, their reliance on sparse matrix multiplication leads to inefficient data movement, resulting in significant performance bottlenecks. To address this, we present an NPU accelerator based on a row-wise product, co-designing hardware and software to balance locality and parallelism in GNNs. This approach achieves substantial energy-efficiency improvements compared to state-of-the-art NPU accelerators.
Personalized recommendations power major applications such as ads, videos, and e-commerce. However, recommendation systems face two key performance bottlenecks: memory-intensive embedding layers and compute-intensive multi-layer perceptron (MLP) layers. To address these challenges, we propose a chiplet-based hybrid accelerator that overcomes both the memory throughput limitations and the compute demands. We implement and evaluate our design on Intel HARPv2, a package-integrated CPU+FPGA device, achieving significant speedups and energy-efficiency improvements.