The very notion of what it means to "understand" a neural operation is disputed. In our work, understanding comes in the form of computational understanding constrained by empirical characterization. To illustrate this statement we will draw from our recent work in motion processing (Neri 2014).
Moving objects change spatial position over time so that their trajectory can be represented as spatiotemporal orientation (Fig. 1A-C). Motion-sensitive neurons in visual cortex display spatiotemporal response characteristics that are similarly oriented (Fig. 1D). Related measurements have been attempted in humans (Anderson & Burr 1989; Anderson et al 1991), however it has not been previously possible to perform those measurements in the native dimensions of space and time; instead, the human characteristics have been measured in Fourier frequency space and back-projected onto space-time (Burr et al 1986). Although this procedure has demonstrated spatiotemporal orientation (Burr & Ross 1986; Braddick 1986), it cannot resolve the dynamics of space-time exhaustively. Using targeted technical developments, we mapped human motion detectors in their native space-time coordinates (Neri 2014) and uncovered a previously unobserved dynamic characteristic: amplitude steadily increases over the first 90 ms, as demonstrated by the red ridge in Fig. 1E (amplitude increases as time progresses along x axis).
The gold standard for modelling motion processing is the motion energy detector (Fig. 2A, magenta). It carries spatiotemporally oriented characteristics (Fig. 2E), but does not generate the dynamic amplitude modulation exposed by human data (Neri 2014). We consider two ways in which this model may be augmented to reproduce that trend, with the goal of illustrating the difference between fitting a model to data on the one hand (Fig. 2), and building a model that enhances understanding on the other (Fig. 3).
It is easy to generate amplitude dynamics via the application of a temporally modulated read-out envelope. When temporal read-out is modified from uniform (Fig. 2B) to decreasing/increasing (Fig. 2C/D), the associated space-time map unsurprisingly presents temporally decreasing/increasing amplitude (Fig. 2F/G). This exercise is not helpful: as implemented in Fig. 2, the change in read-out profile simply reproduces the dynamic effect observed experimentally without genuine explanatory power. The question remains as to what may cause read-out to modulate over time, and why this modulation should disappear in the presence of additional experimental manipulations (Neri 2014). No mechanistic explanation for these effects is offered by the modelling effort in Fig. 2.
A different approach involves modelling schemes that retain physiological plausibility, while at the same time delivering a more informative account of the results (Neri 2010; Carandini 2012). Differently from the approach in Fig. 2, these models can be validated via further experimentation. More specifically, the dynamic trend can be captured via the addition of a simple component to the motion energy model. In this additional component, indicated by orange diagrams in Fig. 3A, outputs from the front-end convolution stage are summed, delayed in time (τ), and fed back onto the front-end oriented filters by positively (X) controlling their gain. The associated space-time map (Fig. 3B) captures the trend effect observed in the human data (Fig. 1E).
Importantly, the model in Fig. 3A can be exploited to make (and test) specific predictions. Because the increasing trend is produced by the gain-control module, any manipulation that reduces the overall input to this module should result in smaller dynamic effects. Lowering contrast provides a relevant example: the simulated space-time map presents reduced amplitude dynamics (Fig. 3C). We tested this prediction experimentally by repeating the measurements with low-contrast stimuli (see Figure 8 in Neri 2014), which substantially reduced the magnitude and reliability of dynamic trend effects, consistent with the properties of the delayed gain-control circuit (Fig. 3A,C).
The difference between the two modelling approaches outlined above is clarified by the following consideration: the 'model' in Fig. 2 does not support predictions of the kind just discussed with relation to the gain-control model (Fig. 3C). Rather, it is analogous to fitting: given a set of data points displaying an increasing trend, we simply fit a line through those points. But fitting does not explain the trend, and does not allow us to make useful predictions. It is only at the stage where we have gained sufficient understanding of the underlying mechanism that we can build a useful model and use it to make predictions. This is the level of understanding that is now being evaluated as possibly the most fruitful approach for bridging neural mechanisms to behaviour (Carandini 2012; Hass & Durstewitz 2014).
The class of modelling scheme in Fig. 3A may be termed a small-scale 'circuit' or 'algorithm' for sensory processing (Neri 2010). We have tackled several problems using this approach; Fig. 4 summarizes examples spanning a period of ~1 decade. Panel A shows a variant of the disparity energy model (Ohzawa et al 1990), which successfully predicts changes in disparity tuning of both single neurons (Cumming & Parker 1997) and human observers (Neri et al 1999) when contrast polarity is reversed in one eye. Panel B presents a multiplicative gain-control model that incorporates both simple-cell-like and complex-cell-like operators (Carandini 2006) as linear and nonlinear components of a small circuit (Neri & Heeger 2002), with the latter boosting gain of the former (after delay τ). Using this circuit it is possible to capture the bulk of both human and electrophysiological data (Neri & Levi 2006). Panel C shows a minimal model involving linear directional filters (tilted ovals) followed by a delayed self-normalization stage (÷). This model captures two important features of directional tuning in human motion processing: 1) tuning is time-invariant over the neuronal timescale of 300 ms, as first demonstrated by our group (Neri & Levi 2008) and later confirmed by others (Busse et al 2008); 2) the amplitude of the directional filter peaks during the first 100 ms, then decreases (next 100 ms), and finally shows a tendency to increase again (final 100 ms). The latter dynamic property is accounted for by delayed self-normalization (Neri & Levi 2008).
Of particular significance for settling an important issue in pattern vision is the model in panel D, which implements a coarse-to-fine strategy for image analysis (Marr 1982): the image is initially analyzed at a coarse spatial scale (large red/blue ovals in D); this analysis is then used to guide subsequent inspection at a finer scale (smaller ovals). Neurons in visual cortex often display response characteristics that are consistent with this hypothesis (Bredfeldt & Ringach 2002; Mazer et al 2002; Menz & Freeman 2003; Tanabe et al 2011). Although measurements in human observers (Watt 1987; Mareschal et al 2006) had initially failed to demonstrate similar coarse-to-fine dynamics for human pattern vision, we were able to expose a clear coarse-to-fine structure for image processing dynamics (Neri 2011) using tools from nonlinear system analysis (Neri 2010). The coarse-to-fine transition happens over ~100 ms and maps to a nonlinear-to-linear processing mode transition, as demonstrated by our earlier results (Neri & Heeger 2002; Neri 2009) and subsequently replicated by others (Nagai et al 2007; Megna et al 2011). Interestingly, single-unit measurements in monkey cortex (Tanabe et al 2011) published shortly after our article have exposed a coarse/fine dichotomy between nonlinear/linear descriptors that is remarkably similar to the one exposed by our psychophysical experiments. Furthermore, an electrophysiological study in cat area 17 from the same year (Fournier et al 2011) has reported a temporal asynchrony between the peaks of nonlinear and linear descriptors (the former preceding the latter) that is consistent with our earlier psychophysical measurements.
Possibly the most important reason why models like those in Fig. 4 are effective is that the addition of only one or two nonlinear elements, such as cross-correlation (Fig. 4A) or divisive normalization (Fig. 4D), confers a great degree of flexibility in their behaviour, often beyond what may be expected from their relatively simple structure. To illustrate this point, Fig. 5B plots a large-scale dataset of empirical measurements from human visual operators (Neri 2015) as a compact aggregate (x axis), versus corresponding predictions (y) from three small-scale circuits (Fig. 5A). Perfect agreement between empirical measurements and model simulations is indicated by the diagonal solid line running from bottom-left to top-right of B. As a starting point, we consider the most common model in visual neuroscience: the template matcher (Hauske et al 1976; Brunelli & Poggio 1997) (blue in Fig. 5A). This model is the minimal circuit capable of supporting (at least in principle) visual detection/discrimination (Green & Swets 1966), so it is a natural starting point for any computational attempt at simulating human vision. Although its predictions span the diagonal line (see light-blue contour in Fig. 5B), there are several departures (see numerous blue data points falling above the diagonal line).
Several such failures can be remedied by the addition of a gain-control element (green diagram in Fig. 5A). It implements an operation known as divisive normalization (Heeger et al 1996), currently one of the most common modelling tools in the vision literature (Beaudot & Mullen, 2006) and generally acknowledged as a canonical computation (Carandini & Heeger 2011). The associated predictions cluster more tightly around the equality line than those from template matching (green versus blue data points in Fig. 5B). Notice that this result is not restricted to one particular experiment: Fig. 5B spans several experimental conditions employing different stimuli and tasks (Neri 2015). The introduction of the gain-control module brings all of those into line with the human estimates, demonstrating the potential and flexibility of this small-scale circuit (Reynolds & Heeger 2009).
Critically, it is not the case that any modification of the template-matcher would produce the above outcome. In other words, although the addition of computing elements in and of itself is expected to add some degree of flexibility, it must be the right kind of flexibility to capture the human data (Reynolds & Heeger 2009; Carandini & Heeger 2011). We demonstrate this notion using another popular wiring scheme from the vision literature, the push-pull opponent-energy scheme (Neri 2011) indicated by the red diagram in Fig. 5A. Predictions from this augmented circuit are worse, not better, than those associated with the template matcher (compare red versus blue data points in Fig. 5B). Divisive normalization (green circuit in Fig. 5A) therefore confers not only enhanced flexibility, but more importantly the right kind of flexibility that appears to operate at the level of human visual pattern analyzers.
Relevant publications:
• Neri P The elementary operations of human vision are not reducible to template matching 2015 PLoS Computational Biology 11 (11) e1004499
• Neri P Dynamic engagement of human motion detectors across space-time coordinates 2014 Journal of Neuroscience 34 8423-8461
• Neri P Coarse to fine dynamics of monocular and binocular processing in human pattern vision 2011 PNAS 108 10726-10731
• Neri P Stochastic characterization of small-scale algorithms for human sensory processing 2010 Chaos 20 045118
• Neri P, Levi DM Temporal dynamics of directional selectivity in human vision 2008 Journal of Vision 8(1) 22 1-11
• Neri P, Heeger DJ Spatiotemporal mechanisms for detecting and identifying image features in human vision 2002 Nature Neuroscience 5 812-6
• Neri P, Parker AJ, Blakemore C Probing the human stereoscopic system with reverse correlation 1999 Nature 401 695-8