How much confidence can we place in our classification accuracies? If our system judges a painting’s author to be Rembrandt with a particular confidence level, what does that mean? Certainly this confidence level must be discounted by the inherent accuracy of the system itself: a classification probability of 100% reported by a system that’s 85% reliable means the true confidence level is 85%. Moreover, the inherent accuracy will differ for each artist studied. How reliably can we estimate this inherent accuracy?
The easy answer – just run some test images and tally up the errors – can mislead us into overestimating our accuracy. The problem of “algorithmic shortcutting” plagues virtually all efforts to analyze and classify images using artificial intelligence (AI). A recent article in Nature described how “deep learning” models like ours can make seemingly reasonable predictions based on misleading visual cues. The researchers tested whether a convolutional neural network (CNN) could be trained to predict, from a knee X-ray, whether the patient eats refried beans or drinks beer. The predictions were surprisingly accurate despite the absence of any real relationship between our knees and our fondness for beans and beer. What’s happening, the authors explain, is that the CNN is “exploiting unintended or simpler patterns in the data rather than learning the more complex, underlying relationships it was intended to learn. … CNNs automatically detect features, features we know we wanted as well as the ones we didn’t know we needed, but it also means we get the features we never wanted and shouldn’t have.”
The risks of algorithmic shortcutting are even greater for art. In a medical image we generally have some idea of what features the CNN should be analyzing, although our understanding may be incomplete. In judging artwork, by contrast, we often have only a general idea of the intrinsic visual features that distinguish an artist’s work from that of close imitators or forgers. And the CNN can’t tell us what features it relies on; as we have said, we can tell where a CNN is looking but not what it sees. Moreover, whereas the supply of medical images for training is essentially unlimited, only a few Leonardos and Vermeers grace the planet, and even the work of prolific artists is small relative to a typical training set for machine learning.
It’s possible, therefore, that a CNN trained to distinguish the work of an artist from imitators will provide superficially accurate results even though it’s relying on the wrong visual features. How can we guard against this?
One important step is to test on images completely removed from, and ideally more challenging than, the images used to train the system. As shown in the below figure from an article we published on our work with Raphael’s drawings, the “in-sample” accuracy is quite good – too good – across different tile sizes. (We decompose artwork images into tile regions small enough for our CNN to process, then aggregate the tile-level predictions into an overall classification.) Out-of-sample testing with more challenging images reveals one tile size to be superior, and indeed, we find this to be true across the artists we’ve studied. The superiority of a particular tile size may mean that we’re focusing the CNN on the most salient artist features, and may also mean we’re elevating the predictive features over the spurious shortcut ones.
Thus, rigorous out-of-sample testing across different tile sizes represents an important tool in the anti-shortcutting toolkit. Out-of-sample testing also gives us the maximum inherent accuracy of our system for the artist under study.
Another tool is dividing the training images into multiple “cross-validation” sets having different mixes of images used for training and in-sample testing (i.e., validation). We might, for example, train each of four independent CNNs on a different cross-validation set. We test new, unknown images on all four CNNs and compare the resulting prediction probabilities. They should be very close to each other. If they aren’t, the predictions can’t be trusted. One reason they may differ is that the overall number of training images is too small to reliably judge the new test image. This deficiency would not be apparent had we used a single CNN trained on all images, producing a single prediction probability.
Still another tool is probability mapping. This is different from class activation or “saliency” mapping, which shows us what parts of an image most strongly affect the CNN’s prediction. Such maps can be very useful to understand the source of predictions. They may tell us, for example, where a CNN looks to distinguish cats from dogs or Persian from Siamese cats – e.g., is it the ears? This approach is far less useful for artwork, however, since the salient features are spread all over the canvas and the distinctions between the works of artist and imitator are subtle – the better the imitation, the more subtle the distinctions will be. The fact that the CNN places greater weight on certain image features does not tell us much about how it makes a prediction.
Probability mapping, by contrast, shows us the probability levels the CNN assigns to different examined regions of an image:
Our analysis of the Salvator Mundi (above), controversially attributed to Leonardo da Vinci, highlights regions (in blue) likely painted by someone other than Leonardo, with red and yellow regions associated more strongly with Leonardo’s hand. Like class activation and saliency maps, our probability maps cannot tell us how the CNN judges what is “Leonardesque.” It can, however, reveal feature-level probabilities that allow us to evaluate the credibility of the overall prediction. If art-historical evidence of collaboration or rework is available, for example, we can evaluate its consistency with our probability map. The Salvator Mundi underwent extensive restoration before its sale in 2017 for over $450 million, including in regions our CNN strongly attributes to Leonardo. Whether this reveals a flaw in our system or a very faithful restoration is difficult to say, but at least the basis for the prediction is exposed for consideration. Regional probabilities that conflict with art historical evidence may suggest algorithmic shortcutting.
Tile sifting also enables us to avoid regions of an artwork likely to promote shortcutting. Image regions that correspond to the background or otherwise contain little visual content may be identified and excluded from consideration using image entropy. Such regions are at best largely irrelevant for classification, since the more limited their visual differentiation, the less likely they are to reflect artist-specific features. At worst, however – and precisely for this reason – they offer fertile ground for shortcutting.
Finally, basic data hygiene is important. Our training images are all high-quality digital photographs rescaled to a consistent resolution – e.g., 25 pixels per centimeter of canvas. That means, for example, that similar brushstrokes will span similar numbers of pixels across paintings. Because CNNs are not scale-invariant, they can’t recognize that small and large images of a horse represent the same animal. Mixing image resolutions – i.e., feature scales – degrades CNN performance because it obscures scale-specific features (such as a brushstroke) that may distinguish an artist’s work patterns. Similarly, drawing training images from low-quality sources such as book illustrations will introduce spurious artifacts that can serve as fodder for shortcutting. Halftone patterns, Moiré patterns, color shifts due to processing, ink selections and aging – all of these may affect printed images to varying degrees. Suppose, for example, that a training set of images includes equal numbers of artist works and comparative works. If a larger proportion of the the artist’s works are drawn from books, the CNN may learn to associate printing artifacts with authorship by the artist. Algorithmic shortcutting can then lead the CNN to classify a photograph of a genuine work as fake because of the absence of the expected printing artifacts.
Algorithmic shortcutting may represent a permanent, ineradicable weakness of AI when applied to visual features. It is particularly elusive in the domain of artwork. While doctors can recognize spurious relationships between visual cues and medical conditions based on fixed principles of biology, art experts often disagree about an artist’s stylistic signatures and sometimes revise their own attributions. Not knowing what features a CNN uses to distinguish Rembrandt from Flinck or forgery leaves us only with the CNN’s established record of accuracy. We must avoid being misled by apparent success. With enough care, we can minimize algorithmic shortcutting; and with enough out-of-sample testing, we can gauge the effectiveness of our efforts. But it is the nature of machine learning to resist certainties. Our prediction probabilities will never be guarantees.