Abstract
Abstract
Modern AI systems can talk, but very few can truly understand how a human feels. Most interactions still ignore facial expressions, vocal tone, and emotional context, making AI responses feel robotic and impersonal. LUMINA is designed to bridge this gap by creating an emotionally intelligent human–AI interaction system that listens, observes, understands, and responds with expressive awareness.
LUMINA combines voice, visual cues, and natural language understanding to perceive user emotions and generate responses that are not only meaningful, but emotionally appropriate. The system interprets a user’s emotional state and delivers replies through a realistic virtual avatar that expresses emotions through synchronized facial expressions and emotionally aware speech. This creates a more natural, engaging, and human-like interaction experience.
By unifying emotion detection, intelligent response generation, and expressive avatar-based communication, LUMINA demonstrates how AI can move beyond simple conversation toward empathetic interaction. This research area has significant implications in human–computer interaction, affective AI, virtual avatars, and emotionally responsive intelligent systems.
The LUMINA Paradigm
LUMINA transcends traditional conversational agents by embodying Large Language Models (LLMs) within a real-time, 3D-rendered framework. By converging neural speech synthesis with GPU-accelerated animations, the system architecture delivers a low-latency, emotionally resonant interface that bridges the gap between digital intelligence and human presence.
Introduction
Human communication is inherently emotional, shaped by facial expressions, vocal tone, and subtle non-verbal cues that convey empathy, intent, and understanding. In recent years, particularly following the global shift toward remote interaction during and after the COVID-19 pandemic, digital systems have become a primary medium for communication and companionship. While these technologies enable connectivity at scale, they have also increased feelings of emotional distance and detachment, as most digital interactions lack genuine emotional awareness.
This period highlighted the limitations of existing AI-driven conversational systems and virtual assistants. Although such systems are effective in providing information and performing task-oriented functions, they remain emotionally neutral and are unable to recognize or respond to users’ emotional states, including stress, sadness, or disengagement. As a result, their effectiveness is reduced in scenarios where users seek empathetic support or emotionally sensitive interaction rather than purely informational responses.
Advancements in artificial intelligence, particularly in large language models, speech processing, and computer vision, have created opportunities for developing emotionally intelligent systems. However, most current assistants rely primarily on text or voice input and fail to consider rich emotional cues from facial expressions, vocal patterns, or behavioral signals. Research in affective computing and human–computer interaction indicates that systems capable of emotionally appropriate responses foster stronger user engagement, trust, and long-term interaction.
LUMINA is proposed as a research and development project focused on designing an emotionally intelligent virtual companion capable of understanding and responding to users’ emotional states through multimodal perception and intelligent response generation. By integrating speech, facial, and behavioral analysis with expressive avatar-based interaction, LUMINA delivers responses that are empathetic, supportive, and human-like, providing users with a sense of engagement and emotional alignment.
Rather than replacing human relationships, LUMINA is designed to complement them by offering emotionally aware interaction in digital environments. This project contributes to research in affective computing and embodied conversational agents and investigates how emotionally responsive AI companions can enhance user experience, emotional engagement, and trust in future human–AI interactions.
Problem Statement
Despite the growing interest in interactive avatars and AI companions, current solutions face several significant limitations. Open-source, browser-based talking avatar systems are rare, while commercial alternatives often rely on expensive cloud APIs or server-side processing. Real-time applications frequently experience imprecise lip synchronization, breaking the natural flow of conversation. Integrating multiple AI components, such as text-to-speech engines, avatar animation, and large language models, remains technically complex. In addition, support for multiple languages is limited, reducing accessibility and usability for diverse user groups.
LUMINA addresses these challenges by providing a fully integrated, open-source, and cost-effective avatar system that delivers expressive, real-time interaction. By combining advanced lip-syncing, multimodal emotion perception, and intelligent response generation, LUMINA creates a lifelike and emotionally engaging digital companion. This system not only enhances user experience but also establishes a new standard for realistic and responsive AI avatars in real-time human–computer interaction.
Goals and Objectives
Primary Objective:
To develop an open-source, browser-based 3D talking avatar that serves as a friendly, emotionally intelligent companion, delivering real-time, natural conversations with lifelike lip-sync, expressive facial animations, and empathetic interactions to support communication, learning, and emotional connection.
Specific Objectives:
Deliver real-time speech synthesis and avatar animation with minimal delay.
Achieve accurate lip-sync and natural facial expressions for immersive interaction.
Integrate seamlessly with AI-driven conversational modules for intelligent responses.
Support multi-language communication and allow modular expansion.
Provide a user-friendly interface for both end-users and developers.
Offer a cost-effective, browser-based solution without relying on expensive cloud APIs.
Scope
Enable real-time interaction through a 3D avatar with voice and facial expressions.
Ensure realistic lip synchronization with speech for natural conversation.
Generate emotion-driven responses based on multimodal inputs, including voice and facial cues.
Implement a browser-based platform for wide accessibility and ease of use.
Maintain a modular architecture to integrate various AI and TTS engines.
Allow real-time speech recognition or transcription through external modules.
Support advanced conversational capabilities, such as context memory and personalized dialogue management.
System Architecture
1. View & controller
UI & Rendering: The Presentation Layer manages the user interface and the WebGL Canvas for rendering the 3D environment.
Logic Bridge: The Application Layer utilizes the and to bridge the gap between visual assets and the backend logic.
2. AI & Synthesis
Input Handling: The Processing Layer normalizes data via the Input Processor and detects sentiment using Emotion Analysis.
Generation: The Core Engines drive the interaction, where the AI Response Engine creates the text and Speech Synthesis generates the audio.
3. Interface & Data
Performance: The Inference Layer ensures low latency using ONNX Runtime and WebGPU / WASM for browser-based acceleration.
Resources: The system relies on a robust Data Assets library, containing optimized Avatar Models, Voice Samples, and language modules.
Operational Workflow
Phase 1: Data Acquisition
The cycle begins with multimodal User Input (Text or Voice). The system performs real-time Context Detection to identify the user's language and intent, ensuring the data is localized and optimized for the AI core.
Phase 2: Neural Processing
The AI Response Engine evaluates the input to generate a contextually relevant reply. This text is then processed by the Neural Speech Synthesis module to create high-fidelity, expressive audio data.
Phase 3: Multimodal Rendering
This phase executes the Synchronization Loop, mapping audio streams to phoneme visemes. The Lip-Sync data drives the 3D Animation via Three.js/WebGL, ensuring the avatar’s performance is perfectly aligned with the audio before being delivered to the Final UI Display
System Technology Layers
The Development Toolchain
The LUMINA development toolchain comprises an integrated ecosystem of state-of-the-art AI frameworks, 3D rendering engines, and high-performance inference runtimes optimized for real-time web execution
Datasets & Resources
Current Resources
Kokoro Voice Model (ONNX): Link
Ready Player Me Avatars: Link
Mixamo Animations: Link
English Phoneme Dictionary: Link
Language Modules: Link
Datasets for Future Development
RAVDESS: Link
TESS: Link
Emotions Dataset for NLP: Link
Mental Health Conversational Data: Link
Counsel-Chat: Link
Code & Documentation
Applications & Use Cases
Education: Enables interactive, personalized learning through virtual tutoring, video lectures, and language practice, making quality education accessible anytime, anywhere.
Customer Service & Support: Delivers natural voice interactions to handle inquiries efficiently, reducing response times and operational costs while enhancing customer satisfaction.
Mental Health & Wellness: Provides empathetic, supportive conversations for users experiencing stress, anxiety, or loneliness, complementing professional care and expanding access to emotional support.
Content Creation & Media Production: Empowers creators to produce high-quality animated videos with talking avatars, eliminating the need for expensive studios and accelerating production timelines.
Corporate Training & Development: Offers consistent, fatigue-free training sessions that improve onboarding, skill development, and engagement for distributed teams.
Accessibility & Assistive Technology: Supports inclusive interactions for individuals with disabilities through voice commands, text input, and visual/audio output, enhancing usability for all users.
Entertainment & Gaming: Powers dynamic, interactive avatars for NPCs and virtual environments, creating immersive, personalized experiences for players.
Interactive Presentations & Events: Transforms webinars, presentations, and product demonstrations into engaging, memorable experiences that capture audience attention and drive retention.
Usage Guidelines
To enjoy the best experience with LUMINA:
Use a laptop or desktop with a working camera and microphone.
Ensure a quiet environment for clearer voice recognition.
Access LUMINA through a modern browser (Chrome, Edge, or Firefox).
Currently, English is supported; multilingual support is planned for future updates.
Speak naturally and clearly to help the avatar respond accurately.
Note: These guidelines reflect LUMINA’s current development phase. Future updates will expand language support, offline features, and overall performance
Challenges
While LUMINA demonstrates advanced capabilities, it is currently in the development phase and faces several limitations and challenges:
Hardware Dependency: Performance varies with the user’s computer and GPU. Lower-end systems may experience reduced animation quality or slower response times.
Speech Input Limitations: Voice recognition accuracy is influenced by microphone quality, background noise, and user accents, affecting the naturalness of interactions.
Internet Dependency: Some features, such as cloud-based AI models, require internet access, although offline use is supported with local models at reduced capability.
Initial Model Loading Time: First-time users may experience 10–30 seconds for neural model loading; subsequent interactions benefit from caching.
Single Avatar Limitation: Current design supports one-on-one interactions, limiting group or multi-avatar conversational scenarios.
AI Response Quality: Generated responses depend on the AI model and may sometimes be inaccurate, biased, or inappropriate, requiring oversight in sensitive contexts.
The Team
Dr. Usama Ijaz Bajwa
Co-PI, Video Analytics lab, National Centre in Big Data and Cloud Computing,
HEC Approved PhD Supervisor,
Tenured Associate Professor
Department of Computer Science,
COMSATS University Islamabad, Lahore Campus, Pakistan
References
[1] Google DeepMind, "Gemini: A Family of Highly Capable Multimodal Models," Google Research, Dec. 2023. [Online]. Available: https://blog.google/technology/ai/google-gemini-ai/
[2] hexgrad, "Kokoro-82M: High-Fidelity Neural Text-to-Speech Model," Hugging Face Repository, 2024. [Online]. Available: https://huggingface.co/hexgrad/Kokoro-82M
[3] R. Cabello, "Three.js: A Cross-browser JavaScript Library for 3D Computer Graphics," GitHub Documentation, 2010–2024. [Online]. Available: https://threejs.org/
[4] Microsoft, "ONNX Runtime: Cross-platform, High-performance ML Inferencing and Training Accelerator," ONNX Official Documentation, 2024. [Online]. Available: https://onnxruntime.ai/
[5] Ready Player Me, "Cross-Platform 3D Avatar System for Web and Mobile Integration," Ready Player Me Technical Docs, 2024. [Online]. Available: https://readyplayer.me/
[6] J. Massey and O. Lewis, "Transformers.js: State-of-the-art Machine Learning for the Web Browser," Hugging Face Documentation, 2023. [Online]. Available: https://huggingface.co/docs/transformers.js/
[7] Adobe Systems, "Adobe Mixamo: Predictive Rigging and Skeletal Animation Library for 3D Assets," Adobe Technical Support, 2024. [Online]. Available: https://www.mixamo.com/
[8] A. Vaswani et al., "Attention is All You Need," in Advances in Neural Information Processing Systems (NIPS), 2017. [Foundational Research for Transformer-based LLMs].
[9] Khronos Group, "WebGL 2.0 Specification and WebGPU API standards," 2017–2024. [Online]. Available: https://www.khronos.org/webgl/
[10] W3C, "Web Audio API: Advanced Audio Processing and Synthesis in the Browser," 2021. [Online]. Available: https://www.w3.org/TR/webaudio/