As the demand for on-device AI assistants (e.g., ChatGPT, Gemini, Copilot) continues to grow on smartphones, laptops, and edge devices, several new system-level challenges arise, including:
Data privacy concerns
The need for offline availability
Limited hardware resources (e.g., memory, compute capacity)
Constraints on model size and complexity
These challenges call for novel techniques that enable efficient and reliable execution of large AI models in resource-constrained environments.
Analyzing and adapting emerging AI techniques for on-device execution
Speculative decoding, Mixture-of-Experts (MoE), etc.
Designing system-level mechanisms for
Dynamic memory management, task scheduling, and runtime optimization
Enabling cross-layer optimization of
Computation, memory, and storage to maximize efficiency under tight resource budgets
On-device Retrieval-Augmented Generation (RAG) enables context-aware and personalized responses by combining LLM inference with local document retrieval — all without relying on cloud connectivity. While this approach offers clear benefits for privacy, responsiveness, and offline capability, it also introduces unique system-level challenges due to the limited compute, memory, and storage resources of edge devices.
Profiling and characterizing RAG workloads to understand performance and resource usage patterns on edge platforms
End-to-end optimization of the RAG pipeline, including:
Tight coordination between LLM inference and vector database operations
System-level support for:
Efficient memory and storage management
Parallel execution and resource scheduling
Latency reduction and caching strategies