ANR-16-CE28-0016

A computable account of human image reconstruction from natural scenes (acronym: ACCURATE)

Funded by Agence Nationale de la Recherche (ANR-16-CE28-0016)

Consider the image in Fig. 1. The region indicated by the yellow dashed oval in A (left) presents a sharp transition from bright (red circle) to dark (blue): it is a well-defined edge that causes neurons in our primary visual cortex to fire vigorously. Now consider the region indicated by the yellow dashed oval in B (right): it is poorly defined by luminance changes, and will therefore command little response from primary visual cortex. This is the state of affairs for our visual system before it makes sense of the image.

Figure 1 An iconic picture from David Marr's Vision (1982). The characteristics of early visual operators (encircled by yellow dashed ovals) are different before (A) vs after (B) the content of the scene is interpreted: following image interpretation (magenta), contextual signals interact with the local operator and reshape its properties (compare red/blue ovals in B vs A). Although we know that this is true, we do not know how it exactly happens and what its implications for perception are.

At some stage of image understanding, we become aware that this image represents a Dalmatian dog approaching the ground; once this knowledge is gained, the well-defined edge in A becomes uninteresting as probably reflecting a shadow, i.e. a feature of the environment that changes with illumination and does not necessarily reflect scene layout. On the contrary, the poorly defined edge in B is recognized to mark the boundary between the main object in the scene (dog) and the background: it is now one of the most interesting features. How did we go from the black/white blobs in the image to the understanding that it contains a dog, and back to a re-evaluation of the blobs in light of the newly aquired representation? In the 1980's Marr commented that this was an untouchable problem: as he famously put it when discussing pictures like Fig. 1, `such images are not considered here' (Marr 1982).

Thirty years later we know a lot more about this fundamental problem. In the 1980's vision was dominated by `bottom-up' or `feed-forward' accounts of visual processing (Lamme et al 1998), prompted by early electrophysiological recordings from primary visual cortex (Hubel & Wiesel 1959; 1968). Those measurements had established that individual neurons can be remarkably selective for fundamental image features such as texture orientation; the characterization of single-cell properties delivered by these important discoveries was complemented by perceptual measurements in humans demonstrating potential links between neuronal selectivity and sensory discrimination (Blakemore & Campbell 1969; Graham 1989). From the bottom-up perspective, feature detectors extract image fragments and feed them onto subsequent layers where they are assembled into increasingly complex representations (Serre et al 2007). Over the following decades, vision science has established that human image interpretation is not purely bottom-up: when we extract lines/edges and piece them together to make sense of an image, our evolving interpretation of the image is fed back to the line/edge extraction stage and can impact its operation (Bar 2004); further, this top-down process is shaped by the statistics encountered in natural scenes (Gilbert & Li 2013). If we are to understand vision, we must therefore secure an adequate account of how top-down signals interact with bottom-up signals while viewing natural scenes.

Although it is well known that bottom-up processing operates alongside top-down processing, it is not known exactly how these processes affect each other. We use the term "exactly" in a computational sense: it is not currently possible to describe a computational model consisting of fully specified operations, such as template-matching or gain control, that will capture the effect of top-down signals on bottom-up image processing. This limitation is exacerbated by the fact that natural vision cannot be studied by simply turning Gabor patches into natural scenes: when taken to the laboratory, natural images raise innumerable difficulties because they are not under full control of the experimenter, and analysis tools that may have been relevant to simple stimuli become inapplicable/irrelevant/uninterpretable. So there are two very serious limitations with our current understanding of top-down visual processing: 1) we do not have a computational account of how it operates in relation to bottom-up processing; 2) we do not possess a set of tools that enables us to study both bottom-up and top-down processing within the context of natural scenes (i.e. the context where the issue of bottom-up/top-down interaction matters most).

ANR-16-CE28-0016 has a specific goal: to develop a computational model that will not only describe bottom-up image analysis, but also how this process is affected by top-down signals. What we mean here is a model that is specified to the extent that it can be written in software. Achieving this goal would rectify limitation #1 listed above. In order to achieve this goal, we also propose an entirely new set of tools (all unpublished) that will enable an extremely detailed understanding of how image features are extracted from natural scenes by the human visual system, and that will constrain our computational models to the extent described above. These tools will go a long way toward rectifying limitation #2 listed above.

Why does it matter? We cannot treat visual impairments without an understanding of how vision operates under healthy conditions. At the same time, if we are to actually construct effective therapeutic tools, e.g. an artificial retina or cortical module, we cannot "understand" healthy vision using vague terms such as top-down and bottom-up without explaining how they must be implemented within the artificial device using electronic supports. To do this, we must be able to describe top-down and related processes using the language of circuits and mathematical operations, in other words we need a computational understanding of these phenomena. This is a necessary step in order to translate basic vision research into applied technology for restoring function in the visually impaired. We are a long way away from having this level of understanding of higher-level vision, and the goal of this proposal is to push forward in this direction.