Our Product:
AURA
Our Product:
AURA
How does it work?
Our virtual human contains 3 different modalities: language generation and response, text-to-speech, and visual generation. All 3 modalities are key, constructed by AI models that work together to create a seamless, interactive experience.
The first modality is language generation and response, which is responsible for taking in the verbal input of the user and generating a human-like language response we used Llama 2.1 large language model(LLM).
The second modality is text-to-speech, converting the output of the LLM to audio, making verbal communication possible. For this modality, we chose to use VITS, an AI model that combines transformer sequence models and variational inference to generate high-quality, natural-sounding speech. Using a conditional variational autoencoder (CVAE), VITS creates a probabilistic representation of voice based on the input text.
The last modality is visual generation, which uses the audio output to create an animation of a talking head. For this modality, we chose to use NeRF (neural radiance field), a deep learning method that learns a continuous volumetric representation of the scene in order to synthesize unique views of complex scenarios. Once trained, NeRF can create scenes in a 2D perspective of 3D objects
Progress
To validate the product, we conducted machine validation tests and human experiments. There are three different metrics to evaluate the quality and realism of the model. PSNR is a metric to evaluate image quality. It showed a smooth improvement throughout the training process. Initially, it achieved the value of 35.2, showing promising results of image reconstruction. Over the training process, the value gradually increases, which reflects the model’s ability to reduce noise and generate better reconstructions. LPIPS quantifies the perceptual similarity between generated and reference images. The initial value was 0.09, decreasing throughout the training, reaching a low of 0.045. This result indicates that, as the model continues to learn, the perceptual quality of the generated images becomes similar to the reference.
The Sound Loss measures the discrepancy between predicted and actual acoustic features. It started at 7000 but showed a continuous decrease throughout the training process. The model reached a peak value of 6500 and maintained this value in the final stage. These numbers indicate that as training went on, the model became realistic and should be enough to look like a real human.
After this, we conducted human experiments with autistic children. For each participant, we conducted three different tasks: Reading text, interacting with the model, and engaging in a Rubik’s cube activity. During these tasks, we tracked their eye gaze and used three metrics to quantify engagement: Average gaze frames per second, average seconds per gaze, and average seconds per gaze transfer. We found that for nearly all experiments, the three metrics increased in quantity during the second and third tasks, meaning engagement also increased during these tasks. From these, we can conclude that the model can effectively engage autistic children and can, therefore, help with accompaniment.
The field tests and findings of five distinct eye-gazing patterns demonstrate significant progress toward achieving the minimum viable product for the AI-driven virtual companion. Tests with 17 autistic participants highlight the system’s effectiveness. The identification of five eye-gazing patterns—dense, linear, scattered, repetitive circular, and mixed—demonstrates the system’s analytical strength and supports its use as both a companion and a diagnostic tool. These patterns offer actionable insights into engagement and cognitive focus, enabling real-time adjustments to optimize user experience. The interface further supports these capabilities with a functional, user-friendly design tailored for autistic children.
While the system fulfills core MVP functionalities, including engagement tracking and adaptive feedback, some areas require refinement. Reducing video generation latency from one minute is critical to improve real-time interaction. Strategies such as edge computing and pre-generating common responses are promising solutions. Our minimum viable product is a demo website with the completed activity-Rubik's Cube.
Test Result
AURA can enhance cognitive and motor skill development through personalized interaction for children with autism. In our test with Rubik's Cube activity, users’ engagement was increased by 14% in gaze frames per second, 128% in seconds per gaze count, and 580% in seconds per gaze shift compared to passive activities. Socially, the system fosters communication and emotional connections by creating realistic human-like interactions. Its affordability and scalability address gaps in access to professional therapy, particularly in underserved communities. Scientifically, it introduces advanced diagnostic tools, such as eye-gazing metrics, which identify five distinct engagement patterns, offering robust data for interventions. Trials with 17 participants demonstrate the system's feasibility and scalability for broader application.
This innovation complements traditional therapies by reinforcing learned skills outside structured sessions and addresses the gap in consistent, high-quality companionship. By merging scientific principles with cutting-edge AI, it empowers children with ASD to achieve greater social and emotional growth, transforming their quality of life and fostering a more inclusive society.