Heads Up eXperience (HUX)

Your Always-On AI Companion for Human Computer Environment Interaction.

Abstract

While current personal smart devices excel in digital domains, they fall short in assisting users during human environment interaction. This paper proposes Heads Up eXperience (HUX), an AI system designed to bridge this gap, serving as a constant companion across the extended reality (XR) environments. By tracking the user's eye gaze, analyzing the surrounding environment, and interpreting verbal contexts, the system captures and enhances multi-modal data, providing holistic context interpretation and memory storage in real-time task specific situations. This comprehensive approach enables more natural, empathetic and intelligent interactions between the user and HUX AI, paving the path for human computer environment interaction. Intended for deployment in smart glasses and extended reality headsets, HUX AI aims to become a personal and useful AI companion for daily life. By integrating digital assistance with enhanced physical world interactions, this technology has the potential to revolutionize human-AI collaboration in both personal and professional spheres paving the way for the future of personal smart devices.

Demonstration-1: Hello World with HUX_AI-Mk.1

Demonstrated Features

Sukanth K et al., 2024

HUX AI Can Perceive Your Environment


What do you see?” 

"Do you see any fruits?" 

"How many foods are present?"


Just ask questions about your surroundings, and let HUX AI provide the answers.

Ask Questions with Your Eye Gaze and Speech in Real Time


“What am I looking at?”

Simply look at what intrigues you, and ask multiple questions without referring to the object verbally. HUX understands. - just pure, intuitive interaction.

Ask About Changes in Your Environment in Real Time


“Did you observe anything change in the environment?”

We all miss things. It's human nature (selective attention). But what if you had an attentive companion?

HUX is always watching, always alert. It catches what you might miss in your immediate environment, keeping you informed and aware when you are focused on your task.

In a world where a single detail can make all the difference, HUX ensures you're always in the know.

Ask Multimodal Contextual Questions


”Does this go well with a Pizza?”

(without referring the object by speech)


What are some recipes with it?

(without referring the object by speech)


By combining multiple streams of information, HUX provides answers that are relevant and insightful.

This isn't just a chatbot. It's a multimodal context-aware collaborator, enhancing your understanding and decision-making like never before.

Recollect Multimodal Context Rich Conversations in Real-Time


"Can you repeat the recipe suggestion you gave me again?"


”Go through all the multi-modal data you collected during our conversation”


HUX doesn’t just listen and answer. It goes beyond instance-based multimodal question and answering by building on both previous and current multimodal data, including speech, eye gaze, and your scene. 

Need to reference something from five minutes ago and build questions on top of it? HUX has got you covered. It enriches your current conversation with immediate, relevant context, making every interaction more meaningful and productive.

Store Long Term Multimodal Context Rich Memories


"Snap person, his name is Imam. He is a new engineer in our lab."

Now, let's think bigger. Much bigger.

HUX doesn't just remember your current conversation; it creates a vast library of your experiences that you intend to store - “Snap.” 

Trying to recall that person you met last year in a crucial meeting? Or that brilliant idea you had on vacation while watching the sunset? Just ask. HUX will find it, complete with all the rich, multimodal context.

It's about having a powerful, searchable archive of your life's most important moments. Why search images and texts manually when multimodal memories can be created and retrieved by HUX using just mere contextual cues?

Enhance Your Visual Feed for Task-Specific Activities and Ask Questions About It


"Are there any defects in the circuit?"


In the world of HUX, your environment adapts to you.

Whether you're performing surgery or assembling a rocket or tinkering around with a PCB, HUX transforms your visual world to highlight what matters most.

And the best part? It's always ready to answer your questions about what you see.

This isn't just augmented reality - it's your reality, optimized.

Interaction:

1. User: "What am I seeing?"  

Agent: The image depicts a complex printed circuit board (PCB) with various electronic components, including resistors, capacitors, and integrated circuits (ICs). The PCB is green, typical for PCBs, and features soldered connections, wires, and joints.


2. User: "Are there any defects in the circuit?" 

Agent Activates Task Specific Perception (Right Image) using the above interaction context.

Agent: There appear to be three defects, all of which are mouse bites located at the bottom right and top left corners of the circuit board.


3. User: "What does this mean?"  

Agent: A mouse bite is a type of defect that occurs when there’s an imperfection or damage to the copper foil on the circuit board. It can be caused by physical stress, such as bending or flexing, which can lead to cracking or lifting of the copper layer. Mouse bites can cause electrical connections to fail, leading to errors or malfunctions in the system.


4. User: "How many problems did you see?"  

Agent: There are three defects identified as "Mouse bite" located at the bottom right and top left of the circuit board.

Architecture and Processes

Sukanth K et al., 2024

System Architecture

Heads Up eXperience (HUX): Always-On AI Companion for Human Computer Environment Interaction

HUX AI Architecture: The origin of multi-modal data, data processing, multi-modal context processing, the generation of outputs. The ”LIOU Stack” means Last-In-Only-Used Stack”  

Real-Time Video Feed Filtering based on Events

Real-time Event-based Video Context Processing using VLMs for detected events of interest (EOIs). 

Task Specific Scene Enhancement

Task Specific Scene Enhancement, Question and Answering. Task-1, Task-2,... Task 5 indicates the different Task Specific Modes. (below)

Heads Up eXperience (HUX): Always-On AI Companion for Human Computer Environment Interaction

Left: PCB in the unlabelled condition. Right: PCB labelled with a computer vision model whose only goal is to find the defects in the PCB

Eye Gaze Processing

Processing of Eye Gaze based Regions of Interest in parallel with Scene Processing (Events of Interests)

Multi-Modal Contextual Memory Creation

Process of Multi-Modal Contextual Memory Pipeline (below)

Data Structure of a multi-modal Contextual Memory (below)

Cite this Work - BibTex

Sukanth K et al., 2024

@article{rajan2024heads,

  title={Heads Up eXperience (HUX): Always-On AI Companion for Human Computer Environment Interaction},

  author={Sukanth, K and Rajan, Sudhiksha Kandavel and Rajashekhar, V S and Prabhakar, Gowdham},

  journal={arXiv preprint arXiv:2407.19492},

  year={2024},

}


Heads Up eXperience (HUX)

Your Always-On Companion for Human Computer Environment Interaction.