SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers
Abstract
Blind or Low-Vision (BLV) users often rely on audio descriptions (AD) to access video content. However, conventional static ADs can leave out detailed information in videos, impose a high mental load, neglect the diverse needs and preferences of BLV users, and lack immersion. To tackle these challenges, we introduce SPICA, an AI-powered system that enables BLV users to interactively explore video content. Informed by prior empirical studies on BLV video consumption, SPICA offers novel interactive mechanisms for supporting temporal navigation of frame captions and spatial exploration of objects within key frames. Leveraging an audio-visual machine learning pipeline, SPICA augments existing ADs by adding interactivity, spatial sound effects, and individual object descriptions without requiring additional human annotation. Through a user study with 14 BLV participants, we evaluated the usability and usefulness of SPICA and explored user behaviors, preferences, and mental models when interacting with augmented ADs.
Frontend Web App
🎥 Explore frames without original audio descriptions
SPICA allows users to use arrow keys to shift to either the preceding or subsequent frame that features a unique visual scene. As they do, the video automatically adjusts to the relevant timestamp.
🔎 Go deep into the frame
SPICA incorporates strategies that facilitate users exploring and examining objects within a video frame spatially. They can either use arrow keys on the keyboard or interact directly with the video frame on a touch device.
🎧 Feel object-aware spatial sound and contrastive color mask
When an object is selected, SPICA plays the associated spatial sound effect of the object based on the type and the estimated 3D position of the object. Meanwhile, a contrastive color mask will be rendered on the object.
Check out the demo video on the right. We recommend wearing a headset for hearing the spatial sound.
Backend ML Pipeline
1️⃣ Key frame detection
The pipeline adopts the following heuristics to segment the video:
Native audio descriptions exists: The video is segmented at frames containing original audio descriptions that were built into the original video.
Visual scene varies: A segment is made if the visual scene description of a frame differs significantly from the previous frame.
Object information changes: Segmentation occurs if the present frame and its predecessor differ significantly in object count and types.
Reaches maximum slice interval: We set a cap on the interval duration at 5 seconds. If this threshold is reached without a new cut, the video is automatically segmented.
2️⃣ Object depth estimation
3️⃣ Object segmentation and description generation
Examples of the Improved Descriptions from the ML pipeline
Video talk
BibTex
@inproceedings{ning2024spica,title={SPICA: Interactive Video Content Exploration through Augmented Audio Descriptions for Blind or Low-Vision Viewers}, author={Zheng Ning and Brianna L. Wimer and Kaiwen Jiang and Keyi Chen and Jerrick Ban and Yapeng Tian and Yuhang Zhao and Toby Jia-Jun Li},year={2024},publisher = {Association for Computing Machinery},address = {New York, NY, USA},doi={10.1145/3613904.3642632},isbn={979-8-4007-0330-0/24/05},booktitle={Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems}}