Let’s take a deep dive into Convolutional Neural Networks (CNNs), their mathematical workings, and the fascinating journey of their creation. Before we dive in, let me tell a story.
The Plot
It’s time to travel back to the early 1960s when brilliant minds like David Marr and Hans Moravec asked a bold question: Can we teach machines to see? With clever algorithms and brute-force methods, they took the first steps in giving computers the ability to understand images. But progress was slow, relying heavily on human-crafted rules and manual feature extraction.
Then came the game-changer: deep learning. It was as if computer vision gained superpowers. Machines now could learn directly from raw data, no longer limited by manual intervention. This breakthrough sparked a renaissance, turning computer vision into a dynamic field where adaptive models interpret the world with incredible precision.
Today, computer vision is a driving force behind innovation, powered by a global community and open-source collaboration. From early experiments to cutting-edge tools, computer vision has become the hero, forever changing how we interact with technology and the world.
Just think, it all started with a simple problem: How can machines recognize image patterns as efficiently as the human brain?
Researchers had developed some algorithms for pattern recognition, but they hit a wall—regular neural networks just didn’t cut it for visual data. The challenge? Images are large, complex, and full of nuances.
A standard image has thousands, if not millions, of pixels. Treating each pixel as an input node in a traditional neural network would result in enormous models with many parameters. Training such models was like trying to catch a fly with a forklift: slow, cumbersome, and inefficient.
Then came a godfather—Yann LeCun. Along with Geoffrey Hinton and Yoshua Bengio,Yann LeCun is referred to as one of the “three musketeers” of deep learning.
LeCun, a computer scientist, proposed a radically new approach with his team in their work on LeNet. Instead of handling every pixel individually, they thought: What if we could break down an image into smaller, meaningful chunks and find patterns within those chunks?
This would mimic the way humans recognize patterns—by focusing on key features (like edges, corners, textures), rather than processing every tiny pixel.
And that was the why behind CNNs: to reduce the complexity of processing images, while still being able to learn abstract patterns effectively.
Yann LeCun’s insight was simple yet profound: just like our eyes don’t process every tiny detail of an image all at once, a machine could do the same by focusing on local regions and learning hierarchies of patterns.
This hierarchical approach wasn’t entirely new; our visual cortex works the same way. Hubel and Wiesel’s work in the 1960s found that certain neurons in the visual cortex respond only to specific orientations and shapes. CNNs mimic this by creating layers that detect low-level features in early layers and higher-level ones as the network deepens.
The Intuition
Think of CNNs as pattern finders. Let’s say you’re looking at a picture of a cup of coffee. Your brain doesn’t focus on each individual coffee bean or analyze every tiny bubble in the foam, right? Instead, you notice the larger, more meaningful patterns: the shape of the cup, the steam rising from it, and the contrast between the dark coffee and the light surface. CNNs do the same, but mathematically.
The first core idea of CNNs is the convolution operation. The name may sound fancy, but convolution is simply a method of scanning an image using small filters (also called kernels).
Imagine sliding a small window (a filter) over a large image, looking at just one patch at a time. This filter scans through the image, performing a simple dot product between the pixels in that patch and the filter's weights.
The result is a number representing how much that filter’s pattern appears in that patch of the image. Now, let’s jump into the mathematics that makes this magic happen.
Convolution Operation
Convolution is essentially a dot product between the input matrix (image) and the filter (or kernel) matrix. Let’s say we have a 5x5 image and a 3x3 filter, as we slide this filter across the image, we perform element-wise multiplication between the filter and the current portion of the image it covers.
We then sum up these products and place the result in a new matrix—called the feature mapor activation map.Mathematically, if the image is I and the filter is K, then at any position (i,j), the convolution output is S(i,j) where,
- M and N are the dimensions of the filter.
- (i,j) is the position in the image.
Some more technical terms that are involved in this step are -
a. Depth - The depth of the feature map directly corresponds to the number of filters used.
We just learnt that applying a filter to an image generates a feature map. Here's where CNNs get fascinating: they don't use just one filter—they employ multiple filters at each layer.
Picture this: one filter might detect vertical edges, while another identifies textures.
With 3 filters, for instance, the feature map's depth becomes 3—each slice representing a single filter's output. As the network deepens, it evolves from detecting simple features like edges to recognizing complex shapes and objects. This increasing depth allows the network to learn increasingly abstract patterns.
b. Stride - Stride determines how the filter moves across the image—controlling the speed of information scanning. It's the step size—how many pixels you slide the filter over the image at each step.
Stride of 1: The filter moves one pixel at a time, covering every possible part of the image. This results in a larger feature map.
Stride of 2 or more: Increasing the stride causes the filter to skip some pixels as it moves across the image. This results in a smaller feature map but makes the network more computationally efficient—like taking larger steps while walking across a room.
Stride controls the balance between computational complexity and how much detail the network can capture. A smaller stride captures more granular details from the image but increases the computational load. A larger stride reduces the computational cost but could lead to a loss in fine-grained information.
c. Padding: When performing convolutions, a challenge arises at the image edges: the filter can't fully cover the area. This is where padding comes into play.
- Valid padding (no padding): The filter skips the edges, resulting in a smaller output feature map.
- Same padding (zero padding): Extra rows and columns of zeros are added around the image borders. This allows the filter to slide over the entire image, including edges, producing an output feature map with the same size as the input.
Padding is crucial for preserving edge information in an image. Without it, important features—like object corners—could be lost. By adding zeros, CNNs can capture edge patterns without excessively shrinking the image after each convolution. This becomes particularly important in deeper networks, where significant information loss at each layer could hamper performance.
Before we move to the next layer, one important step is to add non-linearity.
Activation Function
Convolution is a linear operation – element-wise matrix multiplication and addition, so to account for non-linearity we introduce a non-linear function (typically ReLU - Rectified Linear Unit).. The function is simply:
f(x) = max(0, x)
This introduces non-linearity to the model, enabling it to learn complex patterns and features. While other non-linear functions like tanh or sigmoid can be used instead of ReLU, ReLU has proven more effective in most scenarios.
The Pooling Layer
Once CNNs detect patterns, they simplify their findings using pooling (usually max pooling) to reduce the size of the feature map (also called downsampling or subsampling).
This step reduces the spatial size of the image, which helps in two ways: it reduces the computational cost and ensures that the network doesn’t get bogged down by tiny variations, like a cup of coffee being slightly tilted or shifted.
In simple terms, pooling is a way of simplifying without losing key information.
Pooling helps the network focus on the most important parts of an image, ensuring that small changes—like an object being shifted slightly—don't drastically affect its ability to recognize it. The most common types of pooling are:
1. Max Pooling: For each region in the feature map, max pooling selects the highest value. Imagine a 2x2 grid—max pooling simply picks the largest value from that grid and discards the rest.
2. Average Pooling: Instead of selecting the maximum value, average pooling calculates the mean of all values in the grid. This approach can smooth out the information, which may be useful in some tasks but is generally less effective than max pooling.
How it helps?
Pooling helps in decreasing computational complexity, memory usage, and the risk of overfitting. It's like zooming out on an image—letting the model see the bigger picture.
Importantly, max pooling retains stronger signals (like edges and shapes) while discarding weaker ones, making it the go-to choice for most CNNs. This technique makes the network invariant to small transformations, distortions, and translations in the input image.
It helps achieve an almost scale-invariant representation of the image (the technical term is "equivariant"). This capability is powerful, as it allows the network to detect objects in an image regardless of their location.
The Fully Connected Layer
After condensing and extracting high-level features from the image, these features are passed through traditional fully connected layers (just like in a regular neural network). This is where the final decision is made—does this collection of features represent a cup of coffee or something else?
In a fully connected layer, every neuron is connected to every neuron in the previous layer. Essentially, the FC layer is tasked with combining all the features detected by the convolutional and pooling layers and assigning weights to them to figure out what the image most likely represents.
Softmax - At the very last layer, a CNN typically applies a softmax function. This is crucial for multi-class classification problems (like recognizing different objects in an image). The softmax function transforms the raw outputs (called logits) from the final fully connected layer into probabilities that sum up to 1, which makes it easier to interpret.
How does it work?
Suppose you’re training a CNN to classify between three objects: a cup of coffee, a book, and a smartphone. The final fully connected layer will output three numbers (logits), each representing a class. These numbers can be positive, negative, or zero.
The softmax function takes these logits and transforms them into probabilities. For instance, if the logits were [2.0, 1.0, -1.0], softmax would convert them into probabilities like [0.7, 0.2, 0.1], meaning there’s a 70% chance the object is a cup of coffee.The class with the highest probability is chosen as the model’s final prediction which in this case, cup of coffee.
In this way, the fully connected layers and the softmax function work together to classify the image, translating the patterns detected by the CNN into meaningful categories.
A Flowchart explaining the necessary steps involved in CNN
Loss Function
Now that we’ve explored how CNNs work layer by layer, let’s talk about how CNNs learn. The magic happens through a process called gradient descent. This is the core algorithm that trains the network, allowing it to improve its predictions over time.
To train CNNs, we need to minimize a loss function. For image classification tasks, the most common one is categorical cross-entropy, During training, the network adjusts its filters to reduce this loss using backpropagation and gradient descent.
Gradient Descent: The Heartbeat of Learning
Each iteration of gradient descent pushes the CNN towards better and better predictions, refining its ability to extract meaningful patterns from data.
The key parameters in gradient descent are the learning rate and number of iterations. The learning rate determines how large each step is during the update. If it’s too large, the network may overshoot the optimal point, and if it’s too small, learning will be painfully slow. That’s why we spend time fine-tuning this parameter.
There are also variations of gradient descent like Stochastic Gradient Descent (SGD), where only a random subset (batch) of the data is used to update the weights in each iteration. This speeds up learning and introduces a bit of randomness, which can help the network escape local minima.
Why CNNs Changed Everything?
CNN revolutionized computer vision by enabling machines to process images more like humans—through layers of abstraction. Rather than analyzing each pixel, CNNs efficiently capture the essence of an image by focusing on meaningful patterns.
From recognizing handwritten digits in LeCun’s early work to powering today’s self-driving cars, facial recognition, and medical imaging, CNNs have become the foundation of image-based AI.
The brilliance of CNNs lies in their balance of complexity and efficiency, mimicking how the human brain processes visuals. From detecting diseases to guiding autonomous vehicles, CNNs have transformed how machines understand the visual world, all starting from a simple question: How can machines see like us?