Object and scene vision

We can often look at a person's face and guess where they belong to, yet we may not know how we did it ! This question drove our curiosity and we wondered what features are used in such a fine grained face categorisation task. Over a hundred participants categorised close to two thousand faces with an accuracy of 63.7%, and their responses were highly similar across individuals, suggesting that despite varied visual experience, everyone had learned similar face  features. We then trained computer vision models using local part shape, global face information and representations from deep convolutional networks. Our analysis revealed that humans used mouth shape more than other face parts for this task and that computer models had qualitatively different representations than humans. In a follow up experiment where we used faces with occluded face parts, we establish that mouth shape can causally affect human performance more than other face part shapes. In another follow up experiment we found that humans use more generic features that are akin to deep convolutional representations, when categorising inverted faces.

Harish Katti; S. P. Arun, Are you from North or South India? A hard face-classification task reveals systematic representational differences between humans and machines Journal of Vision. 2019;19(7):1 doi:10.1167/19.7.1

H Katti, MV Peelen, SP Arun, Machine vision benefits from human contextual expectations, Scientific reports 9 (1), 2112 (research article)

We expect a plate and spoon to be on a dining table but not on a high-way. More importantly we often get to see the same scene with and without objects like when we see a car come in and park in a parking lot or when someone takes away the plate and spoon from a table. We not only seem to have a reliable expectation about whether an object should be present but also about its location and size. Human vision then has an opportunity to learn independent representations of the scene and objects therein. How systematic are these expectations across humans? Do state-of-art machine vision algorithms also learn similar representations as humans? In this study we found that they indeed are and also that even good deep convolutional networks do not learn these expectations. We then went on to show that augmenting decisions of deep convolutional networks with human like priors improves their performance and this gives us new guidelines and constraints to train these machine vision models.

Human peripheral blur is optimal for object recognition RT Pramod*, H Katti*, SP Arun Vision Research 200, 108083

It is well known that humans have the highest visual acuity (resolution) at the location they are currently looking at. We asked whether human foveal blur has a bigger role to play than just being an information bottleneck and is indeed optimal for object recognition. How would one test whether human foveal blur profiles are indeed optimal? For this we trained deep convolutional networks with inputs at varying levels of spatial blur using a very large (~1 million) natural image dataset. The key insights we obtained were that (a) blur profiles matched to human foveal blur gave an advantage over other kinds of blur and even seeing the full resolution intact scene. (b) Deep convolutional networks had faster learning rates and higher accuracy when the input was subjected to human foveal blur profiles. (c) This advantage came about because object identity is more discriminable in coarser spatial frequencies at the periphery as opposed to the center of the visual field. 

We can detect real-world objects even from a glimpse of a complex natural scene! What is the influence of different kinds of visual features on our decision making process? In this study we explained response variability in a naturalistic object detection task by training state-of-the-art computer vision models that captured target object, coarse background and non-target object information. We found that task specific as well as task independent components played a role. 

H Katti, MV Peelen, SP Arun, How do targets, nontargets, and scene context influence real-world object detection? Attention, Perception, & Psychophysics 79 (7), 2021-2036