Our group is dedicated to building fast and efficient AI systems. We are currently focusing on building fast and efficient LLM serving systems, which optimize resource management on commodity GPUs and accelerators (e.g., NPUs and PIMs). We are interested in multi-GPUs parallelism, disaggregating prefill and decoding, speculative decoding, test-time scaling, and reasoning.
Keywords: LLM, attention, parallelism