It's all pretty Sketchy
explorations of over 75,400 sketches of seemingly random stuff
Kiran Bhattacharyya
Kiran Bhattacharyya
The Sketchy database is a collection of about 75,400 digitized sketches made by hand of 125 different classes of objects like airplanes, bees, chickens, mice, zebras, etc. and compiled by a group at Georgia Tech. You can download it here.
Over 75,400 sketches is a lot of sketches. Far too many for one person to go through or even remember after they've gone through it. These are the computational and informational limitations of our brain. So how can we experience these sketches in a way that is meaningful? With this project, I explore different ways of representing these sketches and compressing the information within them.
These explorations fall into 3 main categories: 1) looking at the average sketches of the same object to see what they can tell us 2) seeing if sketches could be sonified (turned into sound) to experience them through a different sensory modality and 3) training an algorithm to make new sketches or compare different ones and learn from it's findings. The code to do the experiments here can be found in this Github repo.
To see what the data looks like, here are 8 sketches of 8 different objects (of the 125 total) found in the database. Each object had between 500-800 different sketches.
an airplane (709 total sketches)
a bee (651 total sketches)
a chicken (674 total sketches)
a mouse (670 total sketches)
a zebra (608 total)
a tank (606 total)
a pair of glasses (538 total)
a cabin (548 total)
Some of the 125 objects were represented similarly across all of the 500+ drawings for each object made by different people. This is easily visualized by taking the average of the all of sketches of each object and looking at the average image. If the average sketch across hundreds of sketches still resembles the object, then it was drawn similarly in different sketches.
Here are the average sketches of some of the objects that were drawn similarly across different sketches.
bell (average of 583 sketches)
wine bottle (average of 603)
guitar (average of 528)
apple (average of 551)
Take note of the average guitar sketch - some "ghosts" of different orientations of the guitar can be seen in the average sketch. Also note that the "ghost" guitars are on the left side of the image and not the right. This is a common orientation experienced of the guitar since most guitar players are right handed and someone holding/playing the guitar in front of us would hold it that way (unless they're left handed, which about 10-12% of people are).
bicycle (average of 641)
car_sedan (average of )
chair (average of 669)
jack-o-lantern (average of 526)
Representations of some objects could take many different orientations and forms so when averaged together were some combination of all these orientations. Like the ones below. Look hard and you may be able to make out some of the different superimposed orientations.
hammer (average of 641)
chicken (average of 674)
knife (average of 624)
deer (average of 568)
And some average sketches were virtually unrecognizable or, at best, an abstract representation of the object since representations of the same object in different sketches were very dissimilar.
beetle (average of 549)
bear (average of 722)
zebra (average of 608)
flower (average of 518)
I like the above visualizations because it brings nuance to the saying "the mean is not the message". It's always a message even when it isn't one that's easy to understand.
Interestingly, some objects were very similar across different representations while others were not and yet other average sketches just vaguely resembled the object. Nonetheless, it suggests that some objects have strong "archetypes" that were held across the people who made these sketches. This also seems to be influenced by how the object is experienced in "every day life" with respect to gravity (the bottom of the image), the different lines of symmetry of the objects, the changes in appearance of the object with changes in perspective, and the salient features in the object. There are also strong socio-cultural differences in the mental representations of objects but that is hard to study with a database of this size and skewed sampling. We would need even more data and from more people.
Sonification is the effort to perceptualize data through sound. Images can be sonified in many different ways. I used one specific way here where 1) the y-axis of the image denotes frequencies of sound, 2) the x-axis denotes time, and 3) the pixel value at position (x1, y1) is the amplitude (magnitude or volume) of the frequency y1 at time x1. Therefore, each column of pixels is a slice of time where the value of each pixel contributes to the volume of that frequency of sound for that time slice.
For instance, in the Sketchy database images are 256 by 256 binary pixels in size. And let's say that we want to sonify each image into a sound waveform that lasts about 25 seconds. Therefore, each column would represent 0.1 seconds in time (256*0.1 = 25.6) . And the value of each pixel in that column (0 or 1) would determine if a certain frequency of sound was on or off during that time.
Here are some sonifications of sketches of chickens.
Since we hear the image from the left to right, each sonification is like a drawing of the sketch in sound. We can also sonify the mean images that we saw earlier. Sonifications of the mean images are like hearing all of the different sketches of the same object at once. Objects that were drawn similarly in different sketches create louder and clearer sonic patterns where their strokes make sweeping frequency changes in the sound. Even if these strokes don't appear in the same location across all sketches, they'll still generate sweeping sonic patterns. This way we can hear the 500+ sketches of each object at once differently than when we say the mean images.
Sonifications of these sketches and the average sketches are interesting since they allow us to experience the sketch through another sensory modality. We may be better able to understand certain qualities of these sketches through sound than through vision. Moreover, it's also a way of computationally creating synesthesia.
We have now experienced these sketches in different ways. We saw a few of the individual sketches, the average sketches of objects, and explored sonification as a way of hearing these sketches. Now let's see if there is another way to compress the information in these sketches.
We already know that these are sketches of 125 objects so the obvious method of compression is just a list of the 125 objects. This would be a valid way to compress these 75,400+ sketches but maybe an unsatisfying way to experience them. How well does a list of objects encompass the forms and representations of those objects?
It may do so very well for those objects that had consistent representations across sketches but describe very poorly the representations of those objects with many forms and orientations. Here I will train different models that attempt to the encapsulate the statistics of the forms seen in the sketches and then try to recreate the sketches using them.
My first attempt was the simplest. Try to predict the value of the next pixel in the drawing based on the values of the last n pixels. I used the sketches to learn the statistics of the sketches in 2 ways. 1) I made the 256 by 256 pixel images into 65536 pixel long vectors and learned the 1D n-gram statistics where n was 12 going from left to right along the vector. 2) I used the 256 by 256 pixel image to learn the the probability of the value of a pixel given the 12 pixels in a row before it and the 12 pixels in a column above it, preserving some of the 2D information. After learning these statistics, I could generate images from those statistics. Here are the images below from the 1D and 2D n-gram methods. I've also sonified them to see if we can notice any patterns of sound.
Clearly, the n-gram method cannot recreate sketches that look anything like the ones in the data set. However, they seem to learn very local statistics of the drawings. The 1D n-gram and the 2D n-gram both learn different kinds of strokes that compose the sketches. But these strokes are not organized in any way to create objects that we can recognize. Also the sonfications of both sound like the beginnings of the sonifications of real sketches suggesting that these n-grams are learning statistics over a very small spatial scale. This scale could potentially be increased by increasing n, the number of pixels predicting the value of the next one, but this comes with severe computational and data storage limitations. Also, it's hard to tell how large n should be a priori. It may be a very large number due to the coherence of the strokes that compose the sketches on the spatial scale of the entire image.
So I pursued other methods which attempt to learn the structure in the images in different ways while still being generative - meaning that they can recreate the sketches from the learned structure.
Neural networks can learn patterns from large dimensional data sets. So I explored using neural nets to find hidden patterns and structure in the sketches that could help compress the information in them. However, to do so I had to shrink the sketches down from 256 by 256 to 32 by 32 (curse of dimensionality).
This is because a 256 by 256 pixel image has 65,536 pixels total and the sketchy dataset only has 75,400 images. Learning the structure in 65,536 pixels with 75,400 samples would be like fitting a line to 3 points in 2 dimensions. We can do it... but our fitted line would be very sensitive to noise in the dataset. Therefore, I resized the images to 32 by 32 so there were 1024 pixels total. A sample size of over 75,400 for 1024 dimensions is much more acceptable.
First I trained some simple autoencoders that attempted to compress the sketches in 32, 64, 128, or 256 latent dimensions. Below are the results of compressing sketches down to the latent dimensions and then attempting to reconstruct some sketches that the network wasn't trained on.
Simple autoencoders performed increasingly better with more latent dimensions. This makes sense since more latent dimensions allow the network to store more information about the data. They started with very blurry reconstructions of objects and then slowly improved. However, since I had 32 by 32 images, there were 1024 total pixels or dimensions. Therefore, a compression to 256 latent dimensions wasn't an especially impressive compression. And even with latent space of that dimensionality, the reconstructions lack detail and are quite poor, as is visible above.
I also tried compression with deep autoencoders which have many more layers before the latent compressed layer. They generally perform better than simple autoencoders. Here are the results below. For more info about the design of the deep autoencoder, please refer to the Github repo.
The deep autoencoder with 256 latent dimensions starts creating objects that are almost recognizable. However, it still has difficulty recreating sketches with a lot of detail like the tigers face (3rd from the left) or the shoe (3rd from the right) and the details of the fan blades (4th from the right). I pursued deeper autoencoders with more layers and/or more neurons per layer with similar results. This tends to suggest that deep autoencoders may perform better than simple autoencoders in compressing information but they reach some upper limit of their capability.
One other approach is to use deep convolutional autoencoders which first learn convolutional filters that compress the image into many smaller images with 2D convolution successively through many layers. Below are the results of training a deep convolutional autoencoder with the Sketchy data set. Please refer to the Github repo for more details on the architecture of the network. Results can be very sensitive to the network architecture.
The deep convolutional autoencoder with 32 latent dimensions is a clear improvement over it's just deep and simple counterparts with the same number of latent dimensions. Interestingly, it seems to capture the sketch boundary really well but is unable to reconstruct the details within the sketch. Let's jump to a deep convolutional autoencoder with 256 latent dimensions and a similar architecture to see how well it can reconstruct the sketches with the added latent dimensions.
Increasing the number of latent dimensions by 8 times did not seem to dramatically improve the reconstructions. The convolutional autoencoder is still missing a lot of the detail in the drawings. It seems that finding a latent space that can completely reconstruct these drawings is not a straightforward task. So it may be hard to teach a neural net to make these sketches. A generative adverserial network may be better suited to this task.
However, the deep convolutional autoencoder with 256 latent dimensions does a fairly good job at capturing the outline of the sketch and some brightness patterns of the details within the sketch. Since this represents at least some of the form of the drawing, could this at least be used to classify the drawings? Let's train an autoencoder to reconstruct drawings and then use the encoder portion as the bottom of a neural network that is trained to classify the sketches.