Invited Speakers

James Elder

Department of Electrical Engineering and Computer Science

Department of Psychology

York University, Toronto, Canada

Human and machine perception of 3D shape from contour

Humans see the world in 3D, and matching the reliability and generality of human 3D perception continues to be a goal of computer vision research. Most of this research focuses on multi-view 3D reconstruction, and of course humans compute depth from stereopsis and motion parallax as well. But human 3D perception persists even when the input is a single image, and also at longer viewing distances for which stereopsis and motion cues are weak. How does the brain do this? In contrast to computer vision systems for single-view 3D reconstruction which are generally pixel-based, we suspect that the human system for single-view 3D reconstruction is based on a sparse contour encoding. Here we study both local and global components of this 3D shape from contour system, and report results of computer vision algorithms based on this representation.

Greg Zelinsky

Department of Psychology

Stony Brook University

Predicting goal-directed attention control: A tale of two deep networks

The ability to control the allocation of attention underlies all goal-directed behavior. Here two recent efforts are summarized that apply deep learning methods to model this core perceptual-cognitive ability.

The first of these is Deep-BCN, the first deep neural network implementation of the widely-accepted biased-competition theory (BCT) of attention control. Deep-BCN is an 8-layer deep network pre-trained for object classification, one whose layers and their functional connectivity are mapped to early-visual (V1, V2/V3, V4), ventral (PIT, AIT), and frontal (PFC) brain areas as informed by BCT. Deep-BCN also has a superior colliculus and a frontal-eye field, and can therefore make eye movements. We compared Deep-BCN’s eye movements to those made by 15 people performing a categorical search for one of 25 target categories of common objects and found that it predicted both the number of fixations during search and the saccade- distance travelled before search termination. With Deep-BCN, a DNN implementation of BCT now exists that can be used to predict the neural and behavioral responses of an attention control mechanism as it mediates a goal-directed behavior—in our study the eye movements made in search of a target goal.

The second model of attention control is ATTNet, a deep network model of the ATTention Network. ATTNet is similar to Deep-BCN in that both have layers mapped to early-visual and ventral brain structures in the attention network and are aligned with BCT. However, they differ in two key respects. ATTNet includes layers mapped to dorsal structures, enabling it to learn how to prioritize the selection of visual inputs for the purpose of directing a high-resolution attention window. But a more fundamental difference is that ATTNet learns to shift its attention as it greedily seeks out reward. Using deep reinforcement learning, an attention shift to a target object elicits reward that makes all the network’s states leading up to that covert action more likely to occur in the future. ATTNet also learns to prioritize the visual input so as to efficiently control the direction of its focal routing window—the colloquial spotlight of attention. It does this, not only to find reward faster, but also to restrict its visual inputs to potentially rewarding patterns for the purpose of improving classification success. This selective routing behavior was quantified as a “priority map” and used to predict the gaze fixations made by 30 subjects searching 240 images from Microsoft COCO (the dataset used to train ATTNet) for a target from one of three object categories. Both subjects and ATTNet showed evidence for attention being preferentially directed to target goals, behaviorally measured as oculomotor guidance to the targets. Other well-established findings in the search literature were observed.

In summary, ATTNet is the first behaviorally-validated model of attention control that uses deep reinforcement to learn to shift a focal routing window to select image patterns. This is theoretically important in that it shows how a reward-based mechanism might be used by the brain to learn how to shift attention. Deep-BCN is also theoretically important in being the first deep network designed to capture the core tenant of BCT: that a top-down goal state biases a competition among object representations for the selective routing of a visual input, with the purpose of this selective routing being greater classification success. Together, Deep- BCN and ATTNet begin to explore the space of ways that cognitive neuroscience and machine learning can blend to form a new computational neuroscience, one harnessing the power and promise of deep learning.

David Crandall

School of Informatics, Computing, and Engineering

Indiana University

Studying visual learning in children using computer vision

Despite all of the exciting recent progress in computer vision and machine learning, algorithms still pale in comparison to the best known visual learning system: the human child. For example, compared to computers, children are amazingly efficient one-shot weakly-supervised learners, able to recognize new object categories given experience with just a single object instance. A better understanding of how children learn could help to improve visual learning in machines. Moreover, insight into which kinds of learning strategies work well for machines may help develop better learning curricula for children, especially those who are experiencing developmental difficulties. In this talk, I'll give an overview of a collaboration with development psychologists that is applying computer vision to better understand visual learning in children. Using head-mounted, wearable cameras and eye gaze trackers on children and parents, we can collect good approximations of children's visual fields -- their "training data" -- during learning moments, and use computer vision techniques to analyze and characterize this data. We hypothesize that the unique properties of this training data may help lay the groundwork for more efficient visual learning.