Research
The overarching goal of our research is to build fundamentally fast, scalable, cost-effective, and consistent systems that will have a positive impact on human lives and society
The overarching goal of our research is to build fundamentally fast, scalable, cost-effective, and consistent systems that will have a positive impact on human lives and society
Research Question: Current Systems can fully exploit Hardware Resources?
Operating System, File and Storage System, Parallel and Distributed System,
Database System, System for AI for System
In this Big Data era, computing systems need to satisfy three key requirements:
1) high performance to enable timely processing of large amounts of data
2) high cost-efficiency to achieve higher performance per cost and make computing systems more sustainable
3) high consistency to avoid data loss and system failures that can cause catastrophic consequences to our lives and society relying on digital data.
To meet these requirements, our key research methodology focuses on developing fundamental concurrent and parallel techniques—such as lock-free data structures, parallel thread models, per-core and delegation schemes, decentralized/scalable locking and architectures, along with rigorous verification methods including linearizability checks and systematic validation of consistency invariants—and building next-generation systems that integrate these techniques across operating systems, parallel and distributed systems, database systems, System for AI for System, and various simulation platforms.
We pursue System for AI research by designing next-generation memory and storage systems for efficient and reliable AI infrastructure. In large-scale AI training and inference, data movement across GPU memory, host memory, SSDs, and distributed storage has a significant impact on overall performance and cost. To address this challenge, we exploit emerging hardware technologies such as CXL, NVMe SSDs, ZNS SSDs, and GPU Direct Storage. Based on these technologies, we investigate AI workload-aware memory and storage hierarchies, scalable swap systems, high-throughput data loading systems, and fault-tolerant checkpointing systems. Our ultimate goal is to build system infrastructure that enables AI workloads to run faster, more efficiently, and more reliably.
Memory and Storage System Design for Large-Scale AI Infrastructure
AI Workload-Aware Data Movement across Heterogeneous Memory and Storage
Emerging Hardware-Driven System Software for AI Training and Inference
Scalable Data Loading, Swapping, and Checkpointing for AI Workloads
Publication: ScaleSwap (USENIX FAST'26)
At SysLab, we aim to advance AI for System research by leveraging AI techniques to improve the performance, scalability, and reliability of operating systems, file and storage systems, and distributed systems. Modern computing environments are rapidly evolving with high-performance SSDs, CXL-based memory, GPU/AI accelerators, and large-scale datacenter infrastructures. In such environments, traditional static and rule-based system designs are often insufficient to capture diverse workload behaviors and complex hardware characteristics. To address these challenges, we design intelligent system software that uses machine learning and reinforcement learning to monitor runtime system states, predict performance bottlenecks and potential failures, and automatically optimize resource management policies.
Leveraging AI and Machine Learning for System Performance and Reliability
AI-Driven Optimization for Operating, Memory, and Storage Systems
Designing Intelligent Storage Systems for Emerging Hardware and AI Infrastructure
Reinforcement Learning-based Resource Management for Scalable Systems
Publication: RL-Watchdog (USENIX ATC'24)
TBD
Publication:
ScaleCache: A concurrent and parallel page cache to scale I/O performance on multiple SSDs
Devising concurrent XArray to enable lock-free data structure of page cache
Devising direct page flush (dflush) to enable parallel I/O operations
Big data analysis of a large-scale production HPC system
Found strong correlation using various correlation analysis algorithms
Proposed a prediction scheme using machine learning approaches such as random forest and CNN
Streaming Service for New House Using Unreal Engine and Virtual Machine