Machine learning
We have developed a repository of psychophysical data that is structured and formatted with machine learning applications in mind, called HELIOS (Human ELementary Input Output Specification).
Feature extraction from natural scenes: HUMAN VS NETS
When challenged with image reconstruction of noisy local elements embedded within natural scenes, human discrimination correlates with the availability of object-specific information (see map in Fig. 1B and green data point in Fig. 1D), not bottom-up image-contrast information (see map in Fig. 1C and black data point in Fig. 1D). When testing a selection of computer vision models for constructing object-anchored maps, we found that only deep networks produce measurable correlation with the human data (red data points in Fig. 1D). These results (published in Neri 2017) have prompted further research into the application of neural networks to biological vision (see below).
In recent (unpublished) work, we have derived human 2D perceptual filters for detecting a target bar embedded within natural scenes. The last column in Fig. 2 shows aggregate perceptual filters from 9 human observers (~130K trials) separately computed from noise samples for which the target bar was either aligned (rows 1 and 3) or misaligned (rows 2 and 4) with respect to the local orientation structure defined by the scene, and either embedded within artificial noise (top two rows) or 'noise' associated with the scene itself (bottom two rows). Our interest in examining filters separately for aligned and misaligned configurations stems from the conceptual significance of this distinction: it is based on the structure of the scene and, as such, it holds the potential to provide insights into how scene structure is analyzed by the human visual system.
We can compare these descriptors to those returned by a range of models challenged with the same stimuli/task used with human observers. Classic networks for modelling human vision (e.g. Alexnet) produce entirely different patterns (fifth column in Fig. 2). Squeezenet (second column) appears to resemble the human data. VGG (sixth column) produces descriptors that are similar in all 4 cases (vertical energy), as expected from a template-matching strategy. In general, human behaviour on the detailed scale afforded by our measurements is not easily captured by deep network models. We are currently exploring other modelling strategies to identify a suitable architecture for explaining the human data, alongside additional experimental approaches to further constrain candidate models.
Deep networks do not capture human behaviour
We have also carried out research that demonstrates some fundamental limitations of deep convolutional networks (Neri 2022). Although capable of excellent performance in certain complex tasks, we ask whether the specific manner in which the networks perform those tasks mirrors the strategy adopted by humans.
We designed a minimal task of extreme simplicity, but nevertheless relevant to the fundamental operations of human vision. The detection task merely involves deciding whether a bright bar appears at one of two specified spatial locations (Fig. 5A vs Fig. 5B). This task is rendered more difficult by the addition of visual noise (Fig. 5C).
We then characterize the human visual process at three different 'depths' of specification: a zeroth-order level based on classic metrics such as sensitivity d' (green elements in Fig. 6); a first-order level associated with classification images (first-order kernels), meant to reflect the portion of behaviour that is akin to template-matching (blue elements in Fig. 6); a second-order level of characterization (second-order kernels), meant to capture behaviour that is not adequately represented by first-order descriptors (red elements in Fig. 6). Our laboratory has pioneered the derivation of first- and second-order kernels from human behaviour (Neri 2004, Neri 2010).
Fig. 6 plots the outcome of applying these procedures to human data from the detection task (Fig. 6A-D), alongside equivalent results produced by a classic neural network for digit/scene classification (lenet-5; Fig. 6E-H). Each panel demonstrates how a given descriptor (whether zeroth, first or second order) varies with two basic stimulus parameters that were manipulated during the experiments: contrast and signal-to-noise ratio (SNR).
Lenet-5 easily learns the detection task during training (Fig. 6E). If one were to restrict characterization to sensitivity alone (green elements), human behaviour and network behaviour appear in excellent alignment across the entire matrix of pairwise contrast-SNR configurations (compare Fig. 6B with Fig. 6F). However, the first-order characterization (black traces and blue elements in Fig. 6) looks substantially different between humans (Fig. 6C) and model (Fig. 6G), particularly with relation to the manner in which it varies with contrast/SNR. This mismatch becomes even more evident for the second-order characterization (Fig. 6D and Fig. 6H). Similar results apply to other detection tasks and popular network architectures, and the discrepancies are not ameliorated by different training regimes (e.g. transfer-learning). We are currently fine-tuning network models towards more accurate approximations of these biological processes.
Deep-learning framework for human perception of abstract art
We have also applied machine learning to a different class of 'natural' images: paintings (Lelièvre & Neri 2021). Although they obviously differ from environmental scenes, they must be informative about the perceptual operations of the human brain. Artistic graphical composition can be roughly defined as the structural organization of pictorial elements on a canvas. Art history offers some basic rules and heuristics for understanding the qualitative characteristics of this phenomenon, however it does not codify processes such as segmentation/interaction of pictorial elements to the degree of specification required by quantitative analysis.
Our model is structured around a general architecture not originally devised for application to art material (Fig. 3). We exploit a large database of paintings to train the model. Despite not being hand-engineered to tailor our specific problem of interest, the model outperforms previous applications and extends to a greater variety of painting styles, mirroring human performance (see below).
Human observers were asked to perform the orientation-judgement task on whole paintings as well as fragments of different sizes. Human data for different fragment sizes, indicated by light-gray bars, is plotted in Fig. 4B-C alongside data from network for corresponding receptive-field size of classifiers 1-5, indicated by dark-gray bars. The model provides a satisfactory account of the human process, although we did identify some discrepancies between human and simulated results (not shown) that serve as useful starting points for future work.