Human gaze behavior prediction is important for behavioral vision and for computer vision applications. Most models mainly focus on predicting free-viewing behavior using saliency maps, but do not generalize to goal-directed behavior, such as when a person searches for a visual target object. We propose the first inverse reinforcement learning (IRL) model to learn the internal reward function and policy used by humans during visual search. We modeled the viewer's internal belief states as dynamic contextual belief maps of object locations. These maps were learned and then used to predict behavioral scanpaths for multiple target categories. To train and evaluate our IRL model we created COCO-Search18, which is now the largest dataset of high-quality search fixations in existence. COCO-Search18 has 10 participants searching for each of 18 target-object categories in 6202 images, making about 300,000 goal-directed fixations. When trained and evaluated on COCO-Search18, the IRL model outperformed baseline models in predicting search fixation scanpaths, both in terms of similarity to human search behavior and search efficiency. Finally, reward maps recovered by the IRL model reveal distinctive target-dependent patterns of object prioritization, which we interpret as a learned object context.
Yupei Chen, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Minh Hoai, and Gregory Zelinsky. COCO-Search18 fixation dataset for predicting goal-directed attention control. Scientific Reports, 11 (1), 1-11, 2021. https://www.nature.com/articles/s41598-021-87715-9
Zhibo Yang, Lihan Huang, Yupei Chen, Zijun Wei, Seoyoung Ahn, Gregory Zelinsky, Dimitris Samaras, & Minh Hoai. Predicting Goal-directed Human Attention Using Inverse Reinforcement Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), oral, 2020.
Yupei Chen & Gregory Zelinsky. Predicting Goal-directed Attention Control Using Inverse Reinforcement Learning and COCO-Search18. Vision Sciences Society (VSS), oral, 2020.
People spend a significant amount of their time freely viewing the world in the absence of a task. The dominant class of models attempting to explain this free-viewing behavior computes saliency, a measure of local feature contrast in an image, to obtain a strictly bottom-up attention priority map. Our contention is that the directionality of attention control may be exactly opposite; that free viewing may be guided by a top-down control process that we refer to as multiple-object search. Unlike standard search in which there is typically only a single target, multiple-object search distributes the target goal over several objects, thereby diluting the contribution of any one and creating a diffuse object-priority signal. To compute this signal we borrowed computer vision methods for localizing a trained object class in an image by backpropagating activity from a high-layer in a deep network to lower layers. For each scene, the location of maximum multiple-object-map activity was selected for fixation, followed by spatial inhibition and the iterative selection of the next most active location until six-fixation scanpaths were obtained. We evaluated our method by predicting the free-viewing fixations in the OSIE and MIT-ICCV datasets. Comparing to the predictions from several bottom-up saliency models, we found that predictions from multiple-object priority maps were superior.
Yupei Chen & Gregory Zelinsky. Multiple-object Control Predicts Movements of Attention During Free Viewing. Vision Sciences Society (VSS), 2019.
Attention controls the selective routing of visual inputs for classification. This “spotlight” of attention has been assumed to be a Gaussian, but here we propose that this routing occurs in the form of a shape. We show that a model of attention control that spatially averages saliency values over proto-objects (POs), fragments of feature-similar visual space, is better able to predict the fixation density maps and scanpaths made during the free viewing of 384 natural scenes by 12 participants than comparable saliency models that do not consider shape. We further show that this image-computable PO model is nearly as good in predicting fixations (density and scanpaths) as a model of fixation prediction that uses hand-segmented object labels. We interpret these results as suggesting that the spotlight of attention has a shape, and that these shapes can be quantified as regions of space that we refer to as proto-objects.
Yupei Chen & Gregory Zelinsky. Is there a shape to the attention spotlight? Computing saliency over proto-objects predicts fixations during scene viewing. Journal of Experimental Psychology: Human Perception and Performance, 2018.
Yupei Chen & Gregory Zelinsky..Computing Saliency over Proto-Objects Predicts Fixations During Scene Viewing. Vision Sciences Society (VSS), oral, 2017.
Traditional saliency models predict fixations during scene viewing by computing local contrast between low-level color, intensity and orientation features; the higher the summed contrast the greater the probability of fixation. Literatures have been suggested that shape may also be one of the basic guiding features. A computational saliency model with shape contrast added was shown to outperform models without shape contrast in a search task with participants searched for a simple target among distractors. This suggested the importance of shape being treated as a guiding feature and the benefits for adding shape into saliency models.