A Brief Intro to Convolutional Neural Networks


Convolutional Neural Networks (CNN) have proven very good at processing data that is closely knitted together. A CNN uses a three-dimensional structure, with three specialized neural networks analyzing the red, green and blue layers of a color image. CNN scans an image one area at a time, identifies and extracts important features, and uses them to classify the image.

CNNs are primarily used for computer vision, powering tasks like image classification, face recognition, identifying and classifying everyday objects, and image processing in robots and autonomous vehicles. They are also used for video analysis and classification, semantic parsing, automatic caption generation, search query retrieval, sentence classification, and more.

While the typical CNN uses 2-dimensional or 3-dimensional neural layers to analyze images with 2 or 3 color channels, CNNs with 1-dimensional layers are also very useful. A 1D CNN can derive important features from short segments of an overall dataset when the position of each segment is not so important. For example, sensor data, audio signals, and natural language processing.


CNN Architecture


A plain vanilla neural network, in which all neurons in one layer communicate with all the neurons in the next layer (this is called “fully connected”), is inefficient when it comes to analyzing large images and video. For an average size image with hundreds of pixels and three color channels (red, green, blue), the number of parameters using a traditional neural network will be in the millions, which can lead to overfitting.

To limit the number of parameters and focus the neural network on important parts of the image, a CNN uses a three-dimensional structure in which each set of neurons analyzes a small region or “feature” of the image. Instead of having all neurons pass their decisions to the next neural layer, each group of neurons specializes on identifying one part of the image, for example, the nose, left ear, mouth or hair. The final output is a vector of probability scores, representing how likely each of the features is to be part of a class.

How CNNs Works


A CNN operates in three stages. The first is a convolution, in which the image is “scanned” a few pixels at a time, and a feature map is created with probabilities that each feature belongs to the required class (in a simple classification example). The second stage is pooling (also called downsampling), which reduces the dimensionality of each feature while maintaining its most important information. The pooling stage creates a “summary” of the most important features in the image.





Most CNNs use “max pooling”, in which the highest value is taken from each pixel area scanned by the CNN, as illustrated below.





A CNN can perform several rounds of convolution then pooling. For example, in the first round, an image can be broken down into objects, such as a boat, a person, a plane of grass. In the second round, the CNN can identify features within each object, for example, a face, torso, hands, legs. In a third round, the CNN could go deeper and analyze features within the face, etc.

Finally, when the features are at the right level of granularity, the CNN enters the third stage, which is a fully-connected neural network that analyzes the final probabilities, and decides which class the image belongs to. The final step can also be used for other tasks, such as generating text—a common use of convolutional networks is to automatically generate captions for images.

We deliberately kept the theoretical part short, to get to the punch of how to build a CNN yourself.

  • To understand convolutional neural networks in more detail, see this in-depth guide from Andrej Karpathy

Enough talk—let’s see how it’s done.