Look-Hear:
Gaze Prediction for Speech-directed Human Attention
Gaze Prediction for Speech-directed Human Attention
For computer systems to effectively interact with humans using spoken language, they need to understand how the words being generated affect the users' moment-by-moment attention. Our study focuses on the incremental prediction of attention as a person is seeing an image and hearing a referring expression defining the object in the scene that should be fixated by gaze. To predict the gaze scanpaths in this incremental object referral task, we developed the Attention in Referral Transformer model or ART, which predicts the human fixations spurred by each word in a referring expression.
The overall architecture of ART is shown in the figure below. The input to ART consists of an image and each word from a referring expression. The model jointly learns gaze behavior and the corresponding object grounding tasks through a multimodal transformer encoder, referred to as the "Visuo-Linguistic Encoder." The output is a sequence of fixations, which we dubbed as "pack", generated by the transformer decoder ("Pack Decoder"). ART employs an autoregressive design, where the Pack Decoder predicts a variable number of fixations for each new word in the referring expression, based on the history of previous fixations. Code and model weights are available at our official repository.
To train ART, we created RefCOCO-Gaze, a large-scale, lab-quality dataset in which we collected eye movements of humans searching for a target object in an image while hearing a referring expression describing that target. The dataset includes 19,738 human gaze scanpaths corresponding to 2,094 unique image-expression pairs, collected from 220 participants performing our referral task. The size of this dataset parallels previous larget-scale gaze datasets we collected, such as COCO-Search18 and COCO-Freeview. However, unlike these datasets, in RefCOCO-Gaze, the target is designated by a complex referring expression (e.g., “red baseball glove on the desk”) and the image may contain other objects belonging to the same category as the target, making it even more challenging to precisely localize the target without the broader descriptive context. Our more ecologically valid incremental object referral task therefore contributes to this cognitive science question by enabling exploration of natural referential expressions in real-world image contexts and generating testable hypotheses about how humans integrate language and vision. For download links and details of our RefCOCO-Gaze dataset, please visit our dedicated dataset repository
Although there is still room for improvement, ART not only outperforms current state-of-art models in scanpath prediction but also effectively captures several human attention patterns, including waiting, scanning, and verification. For more detailed results, including model comparisons and ablation studies, please check our paper!
Dataset:
RefCOCO-Gaze consists of 19,738 scanpaths that were recorded while 220 participants with normal or corrected-to-normal vision viewed 2,094 COCO images and listened to the associated referring expressions from the RefCOCO dataset.
The gaze data, recorded by an EyeLink 1000 eyetracker, includes information about the location and duration of each fixation, the bounding box of the search target, audio recordings of the referring expressions, the timing of the target word, and the synchronization between the spoken words and the sequence of fixations (tells us which word triggered which fixations). RefCOCO-Gaze covers a diverse range of linguistic and visual complexity, making it an ideal dataset for researchers studying human integration of vision and language, and HCI researchers alike.
For download links and details of our RefCOCO-Gaze dataset, please visit our dedicated dataset repository
@InProceedings{mondal2024look,
author = {Mondal, Sounak and Ahn, Seoyoung and Yang, Zhibo and Balasubramanian, Niranjan and Samaras, Dimitris and Zelinsky, Gregory and Hoai, Minh},
title = {Look Hear: Gaze Prediction for Speech-directed Human Attention},
booktitle = {The European Conference on Computer Vision (ECCV)},
month = {September},
year = {2024}
}