Research

Research Interests:

Computational Cognitive Neuroscience
Machine Learning; Artificial Intelligence; Computer Vision
Visual Cognition; Computational Modeling; Visual Attention; Object Recognition
Assistive Technology; Robotics

Learned Region Sparsity and Diversity Also Predict Visual Attention

Learned region sparsity has achieved state-of-the-art performance in classification tasks by exploiting and integrating a sparse set of local information into global decisions. The underlying mechanism resembles how people sample information from an image with their eye movements when making similar decisions. In this paper we incorporate the biologically plausible mechanism of Inhibition of Return into the learned region sparsity model, thereby imposing diversity on the selected regions. We investigate how these mechanisms of sparsity and diversity relate to visual attention by testing our model on three different types of visual search tasks. We report state-of-the-art results in predicting the locations of human gaze fixations, even though our model is trained only on image-level labels without object location annotations. Notably, the classification performance of the extended model remains the same as the original. This work suggests a new computational perspective on visual attention mechanisms, and shows how the inclusion of attention-based mechanisms can improve computer vision techniques.

Wei*, Z., Adeli*, H., Zelinsky, G., Samaras, D., Hoai, M. (2016) Learned region sparsity and diversity also predicts visual attention. In Neural Information Processing Systems (NIPS) (pp. 1894-1902). *Equal Contribution. [Link] [PDF][Scholar]
Wei, Z., Adeli, H., Zelinsky, G., Samaras, D., Hoai, M. (2017) Region Ranking and Selection for Image Classification and Attention Prediction. Submitted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).

Predicting Scanpath Agreement during Scene Viewing using Deep Neural Networks

Eye movements are a widely used measure of overt shifts of attention, but this measure is often limited by poor agreement in peoples’ gaze, which can vary significantly in the context of free viewing. In this work we ask whether the level of scanpath agreement among participants during scene viewing, quantified using a modified version of MultiMatch method (Dewhurst et al., 2012), can be predicted using a Deep Neural Network (DNN). Specifically, using image features extracted from the last convolutional layer of a DNN trained for object recognition, we found a linear weighting such that positive regressor weights indicated the presence of image features resulting in greater gaze agreement among viewers. Image regions corresponding to these features were then found by back-propagating the features to the image space using the probabilistic Selective Tuning Attention model (Zhang et al., 2016, ECCV). Combining these regions from all positively weighted features yielded an activation map reflecting the image features important for predicting scanpath consistency among people freely viewing scenes. The model was trained on a randomly selected 80% of the MIT1003 dataset (Judd et al, 2009) and tested on the remaining 20%, repeated 10 times. We found that this linear regressor model was able to predict for each image the level of agreement in the viewers’ scanpaths (r = 0.3, p < .01). Consistent with previous findings, in qualitative analyses we also found that the features of text, faces, and bodies were especially important in predicting gaze agreement. This work introduces a novel method for predicting scanpath agreement, and for identifying the underlying image features important for creating agreement in collective viewing behavior. Future work will extend this approach to identify those features of a target goal that are important for producing uniformly strong attentional guidance in the context of visual search tasks.

Conference Abstracts:

Wei, Z., Adeli, H., Zelinsky, G., Samaras, D., Hoai, M., Predicting Scanpath Agreement during Scene Viewing using Deep Neural Networks, Vision Sciences Society Meeting (VSS) 2017, FL USA

A Computational Biased Competition Model of Visual Attention using Deep Neural Networks

“Biased competition theory” proposes that visual attention reflects competition among bottom-up signals at multiple stages of processing, and the biasing of this competition by top-down spatial, feature, and object-based modulations. Our work advances this theory in two key respects: by instantiating it as a computational model having an image-based “front-end”, thereby enabling predictions using real-world stimuli, and by using an 8-layer deep neural network to model ventral pathway visual processing. A categorical cue (object name) activates a specific frontal node (goal state; layer 8), which feeds activation back to modulate Inferior Temporal (IT; layers 7-6) and V4 (layer 5) using the same feedforward weights trained for object classification. This feedback is multiplied by the feedforward bottom-up activation, biasing the competition in favor of target features (feature-based attention). Reentrant connectivity between V4 and FEF selects a spatial location (spatial attention), causing the selective routing (attentional gating) of object information at that location. This routing constricts receptive fields of IT units to a single object and makes possible its verification as a member of the cued category. Biased retinotopic V4 activation and spatial biases from FEF and LIP (maintaining an Inhibition-of-Return map) project to the superior colliculus, where they integrate to create a priority map used to direct movements of overt attention. We tested our model using a categorical search task (15 subjects, 25 categories of common objects, 5 set sizes), where it predicted almost perfectly the number of fixations and saccade-distance travelled to search targets (attentional guidance) as well as recognition accuracy following target fixation. In conclusion, this biologically-plausible biased competition model, built using a deep neural network, not only can predict attention and recognition performance in the context of categorical search, it can also serve as a computational framework for testing predictions of brain activity throughout the cortico-collicular attention circuit.

Conference Abstracts:

Adeli, H., Zelinsky, G., A Computational Biased Competition Model of Visual Attention using Deep Neural Networks, Vision Sciences Society Meeting (VSS) 2016, FL USA [Slides]

A Computational Model of Attention in the Superior Colliculus

Modern image-based models of search prioritize fixation locations using target maps that capture visual evidence for a target goal. But while many such models are biologically plausible, none have looked to the oculomotor system for design inspiration or parameter specification. These models also focus disproportionately on specific target exemplars, ignoring the fact that many important targets are categories (e.g., weapons, tumors). We introduce MASC, a Model of Attention in the Superior Colliculus (SC). MASC differs from other image-based models in that it is grounded in the neurophysiology of the SC, a mid-brain structure implicated in programming saccades—the behaviors to be predicted. It first creates a target map in one of two ways: by comparing a target image to objects in a search display (exemplar search), or by using a SVM-classifier trained on the target category to estimate the probability of search display objects being target category members (categorical search).

Adeli, H., Vitu, F., & Zelinsky, G. J. (2017). A model of the superior colliculus predicts fixation locations during scene viewing and visual search. Journal of Neuroscience, 37(6), 1453-1467.
Vitu, F., Casteau, S., Adeli, H., Zelinsky, G. J., & Castet, E. (2017). The magnification factor accounts for the greater hypometria and imprecision of larger saccades: Evidence from a parametric human-behavioral study. Journal of Vision, 17(4):2, 1–38. [link][PDF]

Conference Abstracts:

Zelinsky, G., Adeli, H., Vitu, F., Modeling attention and saccade programming in real-world contexts. European Conference on Visual Perception (ECVP) 2016, Barcelona Spain
Cooper, B., Adeli, H., Zelinsky, G., McPeek, R., Macaque monkey use of categorical target templates to search for real-world objects. European Conference on Visual Perception (ECVP) 2016, Barcelona Spain
Cooper, B., Adeli, H., Zelinsky, G., McPeek, R., Macaque monkey use of categorical target templates to search for real-world objects. Society for Neuroscience Meeting (SFN) 2016, CA USA
Zelinsky, G. J., Adeli, H., Vitu, F., The new best model of visual search can be found in the brain, Vision Sciences Society Meeting (VSS) 2016, FL USA [Poster]
Adeli, H, Casteau, S., Vitu, F., Zelinsky, G., An image-based population model of human saccade programming in the Superior Colliculus, Vision Sciences Society Meeting (VSS) 2014, FL USA

Reading without a lexicon

Most models of eye-movement control during reading assume that saccadic behavior primarily reflects ongoing word-identification processing. Here we show, in contradiction with this view, that an image-based model of saccade programming in the superior colliculus (SC) can predict the highly stereotyped saccadic behavior observed during reading, simply by averaging early visual signals. Twenty-nine French-native speakers read 316 French sentences presented one at a time on a computer screen, while their eye movements were recorded. Images of the sentences were input to the model. Like participants, the model initially fixated the beginning of each sentence. On each fixation, it first performed gaze-contingent blurring of the sentence to reflect visual acuity limitations. A luminance-contrast saliency map was then computed on the blurred image and projected onto the fovea-magnified space of the SC, where neural population activity was averaged first over the visual map and then over the motor map. Averaging over the most active motor population determined the subsequent saccade vector. The new fixation location was in turn inhibited to prevent later oculomotor return. Results showed that the model, like participants, mainly made left-to-right, forward saccades, with just a few (21% and 20% respectively) regressive saccades. The model also successfully captured benchmark, and here-replicated, word-based eye-movement patterns: a greater likelihood to skip shorter and nearer words, a preferred landing position near the centers of words, a linear relationship between a saccade's launch site and its landing site, a greater likelihood to refixate a word when the initial fixation deviated from the word's center, and more regressions following word skipping. Thus, eye movements during reading primarily reflect fundamental visuo-motor principles rather than ongoing language-related processes. The proof is that a model of the SC, which treats sentences as a meaningless visual stimulus, reproduces readers' eye-movement patterns, despite being unable to recognize words!

Conference Abstracts:

Vitu, F., Adeli, H., Zelinsky, G., Reading without a lexicon: An illiterate model of saccade programming in the superior colliculus predicts where readers move their eyes!, Vision Sciences Society Meeting (VSS) 2016, FL USA

Modelling eye movements in a categorical search task

We introduce a model of eye movements during categorical search, the task of finding and recognizing categorically defined targets. It extends a previous model of eye movements during search (target acquisition model, TAM) by using distances from an support vector machine classification boundary to create probability maps indicating pixel-by-pixel evidence for the target category in search images. Other additions include functionality enabling target-absent searches, and a fixation-based blurring of the search images now based on a mapping between visual and collicular space. We tested this model on images from a previously conducted variable set-size (6/13/20) present/absent search experiment where participants searched for categorically defined teddy bear targets among random category distractors. The model not only captured target-present/absent set-size effects, but also accurately predicted for all conditions the numbers of fixations made prior to search judgements. It also predicted the percentages of first eye movements during search landing on targets, a conservative measure of search guidance. Effects of set size on false negative and false positive errors were also captured, but error rates in general were overestimated. We conclude that visual features discriminating a target category from non-targets can be learned and used to guide eye movements during categorical search.

Zelinsky, G., Adeli, H., Peng, Y., Samaras, D. (2013). Modelling eye movements in a categorical search task. Phil. Trans. R. Soc. B, 368(1628), 20130058.

Modelling Salient Object-Object Interactions to Generate Textual Descriptions

Abstract—In this paper, we propose a new method for automatically generating textual descriptions of images. Our method consists of two main steps: Using saliency maps, it detects the areas of interests in the image, and then creates the description by recognizing the interactions between detected objects within those areas. These interactions are modeled using the pose (body parts configuration) of the objects as a reference. To create sentences, a syntactic model is used that builds sub-trees around the detected objects and then combines those sub-trees using recognized interaction. Our results show the improved accuracy and naturalness of the generated descriptions.

Adeli, H., Yadranjiaghdam, B., Pool, N., & Tabrizi, N. (2016, December). Modelling Salient Object-Object Interactions to Generate Textual Descriptions for Natural Images. In 2016 International Conference on Computational Science and Computational Intelligence (CSCI), (pp. 1220-1225). IEEE.