Inside the black box

What is a 'black box'

Deep learning is often referred to as a 'black box', because, aside from selecting model and the model hyperparameters (like the learning rate, activation function parameters), users don't really know or control what happens inside the model. We can know it works by measuring the accuracy of the model. But, beyond that, it is difficult to 'see' how the model arrives at its decision.

In some cases, this might make deep learning seem less appealing, from the perspective of a researcher. How can we know why the model arrives at particular outputs? Is there something the model has 'learned' that we should learn too, in order to contribute to scientific understanding?

While much work needs to be done to make it easier for us to 'peek' inside the black box for deep learning in biology, what do we currently know about how these methods work in other domains?

So how do we 'look' inside?

The computer vision community in recent years has developed a number of approaches for understanding the inner working of the convolutional neural nets (CNNs) used to categorize images by their content. Since similar illustrations have yet to be developed for deep learning in biology, we will use these examples to help us develop an intuition.

Approach 1: What portion of the 'input' contributes most strongly to the 'output'

Suppose you have a particular image that is successfully classified by its content and you want to understand which features in the input most strongly impact that decision. How can we see this relationships between input and output classification?

One approach is to generate synthetic inputs, for instance by blocking out portions of the image, and measuring whether this changes the classification. Imaging that you took at image that classified the content correctly. Then you dragged a gray box across the image, generating lots of synthetic images where a portions of the original was blocked out. Then you ran these synthetic images through to see if the correct output is still returned. Then, for every pixel in the image, you can assess how important this pixel is for returning the correct classification. Below is a representation of how this approach works on 3 different images.

On the left is the original image, with an example 'blocked out' portion. On the right is a visualization of the 'impact' this portion of the input has on the output. If a pixel is blue, this means that when it is blocked out the model tends to return an incorrect classification, suggesting this pixel is very important to the classification. If the pixel is red, it can be blocked out without modifying the final classification result significantly.

In the case of the Pomeranian, the face is the most important portion of the input space. In the case of the car, the wheel area seems to be the most important part. In the case of the afghan hound on the bottom, the body of the dog is clearly important for arriving at the correct class, but we can blocking out much of the image is neutral to the final result, and blocking out the faces of the owners may even help the classifier, by leaving primarily the dog for classification.

Approach 2: How does the model 'group' inputs together?

Suppose you are running an image classification model on lots of input images, to separate these images by their content. Right before the final classification step, before you decide if it contains a dog or a cat or a house or a tree, we can grab the outputs from this final layer, and get a high-dimensional vector which encapsulates the final step before a decision is made on that input.

High dimensional data is hard to interpret on its own. So we can use a technique called 'tSNE', which stands for t-Distributed Stochastic Neighbor Embedding to reduce this high-dimensional vector to a two dimensional space, effectively splatting each instance into a form that can be visualized. Items 'near' each other in this two dimensional space are effectively seen by the deep learning model as 'close', and will therefore be classified similarly.

Suppose you took thousands of images, grabbed this final layer from the CNN, and used tSNE to map each image into 2d space. What kind of view would you get. Something like this:

This view comes from here: http://cs.stanford.edu/people/karpathy/cnnembed/cnn_embed_6k.jpg Click to zoom and see more detailed views of the images. Each input image is mapped to its location in 2d space based on that final layer right before classification. Zooming in we can start to see sets of images coming together into categories.

From these examples, we can see fish and dolphins and other ocean animals pictures placed together. We can also see boats and windmills and satellites groups more closely together (often with a common blue sky). These groupings seem to be finding similar elements.

In some cases there are images that are placed in an unexpected place, like the panda shown here.

So, why are the panda's placed here?

Perhaps because the algorithm is picking up on coloration (dark eyes and dark patches of fur) more than the shape of the features (eg. rounded ears in pandas vs pointy ears in dogs).

Perhaps because the training set contains many more dogs than pandas, that this model hasn't learned to discriminate pandas very well.

This kind of observation (pandas are incorrectly grouped with dogs) could be helpful for understanding what the network is doing and/or how to fix errors in classification.

Below is another use of tSNE, this time looking at classification of hand drawn digits, where color is the 'correct' label. By hovering over a dot, we can see the original image. If a color is away from its 'cluster' it means it was misclassified (eg. the '0' is classified correctly below but the '5' is not).

Approach 3: What portions of the input activate specific nodes most strongly?

Suppose we want to understand what role a particular node plays in the process of producing a result? Or what do nodes in the same layer have in common? In terms of image analysis, researchers have mapped the activation of a node back onto the input, to capture which portions of the input space 'activate' a particular node.

This gives interesting results like this:

What we see is that, given inputs with faces, the first hidden layer learns low-level features. Rounded curves, straight lines, angled edges, white corners. At the second hidden layer, the neurons appear to combine these low level features together into components that almost look like human features (eyes, noses), but aren't quite complete. At the final hidden layer these features come together into faces.

What can we learn about this? Since each layer depends on the previous one, it makes sense that there is a hierarchy of learned features. The question we might ask about deep learning on biological data is whether there is a similar hierarchy of learned features in these data sets that are 'learned' by the model.

Approach 4: "Write back" into the input space, to amplify a particular output

Suppose you have an image that is fundamentally noise and run it through a CNN. But, instead of asking 'what is this image' you say 'assume this image is a dumbbell. Now write back into the image to make it more 'dumbbell' like, according to the trained CNN. Researchers at Google did exactly that and here is the image that was produced:

https://web.archive.org/web/20150708233542/http://googleresearch.blogspot.com/2015/06/inceptionism-going-deeper-into-neural.html

What you'll notice is that the CNN doesn't just associate the actual dumbbell, with the dumbbell classification. But it also includes hands and arms with this classification. Perhaps because most of the training instances of dumbbells include arms and hands.

This technique of 'writing back' into images, has produces some pretty interesting images, which you can check out here:

https://deepdreamgenerator.com/gallery