Processing-in-Memory (PIM) and DRAM Microarchitecture
To overcome the memory bottlenecks in large-scale AI and data-intensive applications, we focus on next-generation memory architectures such as Processing-in-Memory (PIM). PIM enables computation to occur directly within memory devices, significantly reducing data movement overhead. We design and optimize PIM solutions based on modern DRAM technologies including LPDDR5 and HBM4, taking into account the physical constraints and timing behavior of real memory systems. Our research spans low-level DRAM microarchitecture improvements, the design of PIM-compatible instruction protocols, and memory scheduling policies that enable efficient collaboration between the processor and PIM units. We also investigate how PIM can be integrated into existing software and system stacks without requiring major changes to application code or memory management. Recent works include cost-efficient LPDDR5-based PIM for on-device small language models and compatible PIM command protocol designs tailored for server-scale systems.
[CAL 2025] Fold-PIM Architecture
[CAL 2024] Compatible PIM Protocol
Domain-Specific Acceleration using NPU and PIM
Some applications—such as Fully Homomorphic Encryption (FHE)—demand specialized hardware support due to their high computational complexity. We profile these workloads to identify their unique performance bottlenecks and design custom acceleration strategies using NPUs or memory-centric architectures, depending on the operational characteristics of each application. In particular, we are currently working on optimizing FHE workloads for commercially available NPUs and PIM platforms, leveraging their architectural strengths to improve throughput and energy efficiency. Our goal is to make privacy-preserving AI services based on FHE more practical and scalable in real-world cloud environments.
Advanced Memory System Management
Modern memory systems are becoming increasingly complex with heterogeneous components such as CXL-attached memory, NUMA-aware processors, and tiered memory hierarchies. We investigate system-level techniques to efficiently manage these resources, aiming to balance performance, cost, and scalability. Our research interests include dynamic data placement, latency-aware scheduling, and intelligent migration strategies based on workload behavior and memory characteristics. In particular, we analyze how the latency and bandwidth of CXL memory affect system throughput and propose software-level optimizations to mitigate overheads. We also study NUMA-related performance bottlenecks and explore runtime mechanisms to improve memory locality.
CXL Device Type and Protocol
Memory Access Tracking