SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers 

Abstract

Blind or Low-Vision (BLV) users often rely on audio descriptions (AD) to access video content. However, conventional static ADs can leave out detailed information in videos, impose a high mental load, neglect the diverse needs and preferences of BLV users, and lack immersion. To tackle these challenges, we introduce SPICA, an AI-powered system that enables BLV users to interactively explore video content. Informed by prior empirical studies on BLV video consumption, SPICA offers novel interactive mechanisms for supporting temporal navigation of frame captions and spatial exploration of objects within key frames. Leveraging an audio-visual machine learning pipeline, SPICA augments existing ADs by adding interactivity, spatial sound effects, and individual object descriptions without requiring additional human annotation. Through a user study with 14 BLV participants, we evaluated the usability and usefulness of SPICA and explored user behaviors, preferences, and mental models when interacting with augmented ADs.

Frontend Web App

🎥 Explore frames without original audio descriptions

SPICA allows users to use arrow keys to shift to either the preceding or subsequent frame that features a unique visual scene. As they do, the video automatically adjusts to the relevant timestamp.

🔎 Go deep into the frame

SPICA incorporates strategies that facilitate users exploring and examining objects within a video frame spatially. They can either use arrow keys on the keyboard or interact directly with the video frame on a touch device.

🎧 Feel object-aware spatial sound and contrastive color mask

When an object is selected, SPICA plays the associated spatial sound effect of the object based on the type and the estimated 3D position of the object. Meanwhile, a contrastive color mask will be rendered on the object. 

Check out the demo video on the right. We recommend wearing a headset for hearing the spatial sound.

Sequence 03_2.mp4

Backend ML Pipeline

1️⃣ Key frame detection

The pipeline adopts the following heuristics to segment the video:

2️⃣ Object depth estimation

3️⃣ Object segmentation and description generation

Examples of the Improved Descriptions from the ML pipeline

Video talk

 BibTex



@inproceedings{ning2024spica,title={SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers}, author={Zheng Ning and Brianna L. Wimer and Kaiwen Jiang and Keyi Chen and Jerrick Ban and Yapeng Tian and Yuhang Zhao and Toby Jia-Jun Li},year={2024},publisher = {Association for Computing Machinery},address = {New York, NY, USA},doi={10.1145/3613904.3642632},isbn={979-8-4007-0330-0/24/05},booktitle={Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems}}