Research

1. Resource Management Techniques for Efficient On-Device LLM Inference

Motivation

As the demand for on-device AI assistants (e.g., ChatGPT, Gemini, Copilot) continues to grow on smartphones, laptops, and edge devices, several new system-level challenges arise, including:
- Data privacy concerns
- The need for offline availability
- Limited hardware resources (e.g., memory, compute capacity)
- Constraints on model size and complexity
These challenges call for novel techniques that enable efficient and reliable execution of large AI models in resource-constrained environments.

Our Research Focus

Analyzing and adapting emerging AI techniques for on-device execution
- Speculative decoding, Mixture-of-Experts (MoE), etc.
Designing system-level mechanisms for
- Dynamic memory management, task scheduling, and runtime optimization
Enabling cross-layer optimization of
- Computation, memory, and storage to maximize efficiency under tight resource budgets

2. System Software for On-Device Retrieval-Augmented Generation (RAG)

Motivation

On-device Retrieval-Augmented Generation (RAG) enables context-aware and personalized responses by combining LLM inference with local document retrieval — all without relying on cloud connectivity. While this approach offers clear benefits for privacy, responsiveness, and offline capability, it also introduces unique system-level challenges due to the limited compute, memory, and storage resources of edge devices.

Our Research Focus

Profiling and characterizing RAG workloads to understand performance and resource usage patterns on edge platforms
End-to-end optimization of the RAG pipeline, including:
- Tight coordination between LLM inference and vector database operations
System-level support for:
- Efficient memory and storage management
- Parallel execution and resource scheduling
- Latency reduction and caching strategies

Page updated

Google Sites

Report abuse