Machine learning

We have developed a repository of psychophysical data that is structured and formatted with machine learning applications in mind, called HELIOS (Human ELementary Input Output Specification).

Feature extraction from natural scenes: HUMAN VS NETS

When challenged with image reconstruction of noisy local elements embedded within natural scenes, human discrimination correlates with the availability of object-specific information (see map in Fig. 1B and green data point in Fig. 1D), not bottom-up image-contrast information (see map in Fig. 1C and black data point in Fig. 1D). When testing a selection of computer vision models for constructing object-anchored maps, we found that only deep networks produce measurable correlation with the human data (red data points in Fig. 1D). These results (published in Neri 2017) have prompted further research into the application of neural networks to biological vision (see below).

Figure 1 Our lab introduced a novel technique for labelling local image regions as driven by top-down (B) or bottom-up (C) information, and demonstrated that human discrimination aligns with top-down map (D). Deep networks produce maps that capture human behaviour (red points in D); other computer vision models fail (yellow/blue points in D). Please refer to Neri (2017) for details.

In recent (unpublished) work, we have derived human 2D perceptual filters for detecting a target bar embedded within natural scenes. The last column in Fig. 2 shows aggregate perceptual filters from 9 human observers (~130K trials) separately computed from noise samples for which the target bar was either aligned (rows 1 and 3) or misaligned (rows 2 and 4) with respect to the local orientation structure defined by the scene, and either embedded within artificial noise (top two rows) or 'noise' associated with the scene itself (bottom two rows). Our interest in examining filters separately for aligned and misaligned configurations stems from the conceptual significance of this distinction: it is based on the structure of the scene and, as such, it holds the potential to provide insights into how scene structure is analyzed by the human visual system. 

Figure 2 Perceptual filters returned by psychophysical reverse correlation applied to artificial (Gaussian white) noise (top two rows) or `natural' noise (bottom two rows: local elements of scene that interfere with probe). Filters are computed separately for trials on which target bar is aligned with local orientation structure of scene (first and third rows), as opposed to trials on which it is misaligned (second and fourth rows). Filters are plotted after rotating noise samples so that vertical refers to target bar orientation. Last column plots human data; other columns plot filters returned by deep networks trained/tested on same detection task.

We can compare these descriptors to those returned by a range of models challenged with the same stimuli/task used with human observers. Classic networks for modelling human vision (e.g. Alexnet) produce entirely different patterns (fifth column in Fig. 2). Squeezenet (second column) appears to resemble the human data. VGG (sixth column) produces descriptors that are similar in all 4 cases (vertical energy), as expected from a template-matching strategy. In general, human behaviour on the detailed scale afforded by our measurements is not easily captured by deep network models. We are currently exploring other modelling strategies to identify a suitable architecture for explaining the human data, alongside additional experimental approaches to further constrain candidate models.

Deep networks do not capture human behaviour

We have also carried out research that demonstrates some fundamental limitations of deep convolutional networks (Neri 2022). Although capable of excellent performance in certain complex tasks, we ask whether the specific manner in which the networks perform those tasks mirrors the strategy adopted by humans.

Figure 5 The target bright bar can appear either to the left (A) or to the right (B) of the midline; task is to report the side on which the bar appeared. This task is made difficult by adding bar-noise to both left and right regions (C).

We designed a minimal task of extreme simplicity, but nevertheless relevant to the fundamental operations of human vision. The detection task merely involves deciding whether a bright bar appears at one of two specified spatial locations (Fig. 5A vs Fig. 5B). This task is rendered more difficult by the addition of visual noise (Fig. 5C). 


We then characterize the human visual process at three different 'depths' of specification: a zeroth-order level based on classic metrics such as sensitivity d' (green elements in Fig. 6); a first-order level associated with classification images (first-order kernels), meant to reflect the portion of behaviour that is akin to template-matching (blue elements in Fig. 6); a second-order level of characterization (second-order kernels), meant to capture behaviour that is not adequately represented by first-order descriptors (red elements in Fig. 6). Our laboratory has pioneered the derivation of first- and second-order kernels from human behaviour (Neri 2004, Neri 2010). 


Fig. 6 plots the outcome of applying these procedures to human data from the detection task (Fig. 6A-D), alongside equivalent results produced by a classic neural network for digit/scene classification (lenet-5; Fig. 6E-H). Each panel demonstrates how a given descriptor (whether zeroth, first or second order) varies with two basic stimulus parameters that were manipulated during the experiments: contrast and signal-to-noise ratio (SNR).

Figure 6 Human versus network behaviour at different depths of characterization. Human performance in the detection task was kept at threshold throughout (A). Lenet-5 easily learnt this task during training (E). Human sensitivity (indicated by radius of green elements in B) varies substantially with stimulus contrast/SNR; this variation is well captured by Lenet-5 (F). C shows human first-order kernels (spectral centroid is indicated by radius of blue elements). Corresponding results from Lenet-5 (G) do not match the human pattern. D,H show second-order kernels (bright/dark for positive/negative values); radius of red circles reflects RMS ratio between second-order and first-order kernels. This descriptor again demonstrates poor alignment between human (D) and network (H).

Lenet-5 easily learns the detection task during training (Fig. 6E). If one were to restrict characterization to sensitivity alone (green elements), human behaviour and network behaviour appear in excellent alignment across the entire matrix of pairwise contrast-SNR configurations (compare Fig. 6B with Fig. 6F). However, the first-order characterization (black traces and blue elements in Fig. 6) looks substantially different between humans (Fig. 6C) and model (Fig. 6G), particularly with relation to the manner in which it varies with contrast/SNR. This mismatch becomes even more evident for the second-order characterization (Fig. 6D and Fig. 6H). Similar results apply to other detection tasks and popular network architectures, and the discrepancies are not ameliorated by different training regimes (e.g. transfer-learning). We are currently fine-tuning network models towards more accurate approximations of these biological processes.

Deep-learning framework for human perception of abstract art

We have also applied machine learning to a different class of 'natural' images: paintings (Lelièvre & Neri 2021). Although they obviously differ from environmental scenes, they must be informative about the perceptual operations of the human brain. Artistic graphical composition can be roughly defined as the structural organization of pictorial elements on a canvas. Art history offers some basic rules and heuristics for understanding the qualitative characteristics of this phenomenon, however it does not codify processes such as segmentation/interaction of pictorial elements to the degree of specification required by quantitative analysis.

Figure 3 Schematic architecture of the multi-level orientation classification model employed in this study. Each of 5 convolutional blocks is associated with a classifier (indicated by classifier-n with n=1 to 5). The output dimensionality of each classifier is indicated by (x,x,4), where x is the number of samples across each spatial dimension (see density of circle array within insets overlaying local filters onto painting), and 4 is the number of orientation labels {up,90,180,270}. The 4 values within [ ] show one example of the categorical distribution generated by the network for Komposition VIII by Wassily Kandinsky (1923). In the legend, k/s stand for kernel/stride size.

Our model is structured around a general architecture not originally devised for application to art material (Fig. 3). We exploit a large database of paintings to train the model. Despite not being hand-engineered to tailor our specific problem of interest, the model outperforms previous applications and extends to a greater variety of painting styles, mirroring human performance (see below).

Figure 4 Human versus model performance for whole paintings and fragments. In A, model performance from classifier-5 is plotted alongside human performance on whole paintings (dark versus light bars, respectively), grouped by style. In B-C, model performance from different classifiers (1 to 5) is plotted alongside human performance on image fragments, separately for abstract (B) and figurative styles (C).

Human observers were asked to perform the orientation-judgement task on whole paintings as well as fragments of different sizes. Human data for different fragment sizes, indicated by light-gray bars, is plotted in Fig. 4B-C alongside data from network for corresponding receptive-field size of classifiers 1-5, indicated by dark-gray bars. The model provides a satisfactory account of the human process, although we did identify some discrepancies between human and simulated results (not shown) that serve as useful starting points for future work.