In our 2nd week, we train and apply an image recognition AI with some datasets
MNIST consists of 60,000 training images and 10,000 testing images of handwritten digits (0-9), each image a 28x28 grayscale image.
CIFAR-10 consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. It has 50000 training images and 10000 test images. Examples of the classes are airplane, deer, frog.
trainset = torchvision.datasets.MNIST(
root='./data',
train=True,
download=True,
transform=transform
Here we download the MNIST dataset and loads the dataset to train our AI.
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)
Here we download the CIFAR-10 dataset and loads the dataser train our AI. The code may seem different than the one on the left but its function is still the same.
One of the earliest convolutional neural networks: a pioneering architecture in the field of deep learning. Designed by Yann LeCun for recognizing the MNIST dataset(handwritten digits). It's a relatively simple architecture compared to modern CNNs and it typically consists of two convolutional layers followed by two fully connected layers.
VGG (Visual Geometry Group): a family of CNN architectures known for their deep structures with small (3x3) convolutional filters. VGG architectures have achieved excellent performance in image classification tasks, particularly on ImageNet.
VGG-16, used for illustration purposes, includes a series of convolutional blocks followed by a number of fully connected layers.
LeNet-5, a pioneering convolutional neural network, is structured as a sequence of layers, each performing a specific operation on the input data.
Convolutional layer with 6 feature maps and a 5×5 kernel.
Average pooling layer.
Convolutional layer with 16 feature maps and a 5×5 kernel.
Another average pooling layer.
Convolutional layer with 120 feature maps and a 5×5 kernel.
Fully connected layer with 84 units.
Fully connected layer with 10 units (for the 10 digits).
Convolutional Neural Network (CNN): forms the basis of computer vision and image processing.
A convolutional layer typically employs multiple filters. Each filter is applied to the input image independently, generating a separate feature map. The number of filters used in the layer is a hyperparameter that you can adjust. The key idea is that each filter will learn to recognize distinct patterns or features within the input image during the training process."
Image:
[[1, 1, 1, 0, 0],
[0, 1, 1, 1, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 0],
[0, 1, 1, 0, 0]]
Filter:
[[1, 0, 1],
[0, 1, 0],
[1, 0, 1]]
This calculation results in a single value (4 in this case).
(1 * 1) + (1 * 0) + (1 * 1) + (0 * 0) + (1 * 1) + (1 * 0) + (0 * 1) + (0 * 0) + (1 * 1)
= 4
We repeat this process, sliding the filter across the entire input image, calculating a new value for each position. The collection of these calculated values forms a new matrix called an output feature map.
self.conv1 = nn.Conv2d(1,6, kernel_size=5, stride=1, padding=1)
Creates the first convolutional layer with 1 input channel (grayscale), 6 output channels, and a 5x5 kernel.
self.conv2 = nn.Conv2d(6, 16, kernel_size=5, stride=1 padding=1)
Creates the second convolutional layer with 6 input channels (from the previous layer), 16 output channels, and a 5x5 kernel.
The code may seem different than the one on the left but its function is still the same.
Two important concepts in the configuration of the convolutional layers.
Two important concepts in the configuration of the convolutional layers:
Padding is the process of adding layers of zeros to the input matrix. It allows the size of the input to be adjusted so that the filter fits neatly over the input data. Padding can help preserve the spatial dimensions of the input into the output, maintaining more information at the borders of the input matrix.
Stride is the number of pixels by which we slide our filter over the input matrix. When the stride is 1, we move the filters one pixel at a time. When the stride is 2, we move the filters two pixels at a time and so on. Increasing the stride makes the output feature map smaller and helps to reduce the computational cost.
Image:
[[1, 1, 1, 0, 0],
[0, 1, 1, 1, 0],
[0, 0, 1, 1, 1],
[0, 0, 1, 1, 0],
[0, 1, 1, 0, 0]]
Filter:
[[1, 0, 1],
[0, 1, 0],
[1, 0, 1]]
(0 * 1) + (0 * 0) + (0 * 1) + (0 * 0) + (1 * 1) + (1 * 0) + (0 * 1) + (0 * 0) + (1 * 1) = 2
Convolution Operation:
Initial Position: The filter is placed over the top-left 3x3 region of the padded image.
Element-wise Multiplication and Summation: We perform element-wise multiplication between the filter and the corresponding image pixels
Moving the Filter: We move the filter 2 pixels to the right (stride = 2) and repeat the process.
Repeating: We continue this process, sliding the filter across the entire image (both horizontally and vertically), to calculate all the elements of the output feature map.
Output size = [(5 - 3 + 2 * 1) / 2] + 1 = 3
The output feature map will be a 3x3 matrix.
Output size = [(Input size - Filter size + 2 * Padding) / Stride] + 1
To determine the output size of the feature map, we use this formula.
self.avgpool1 = nn.AvgPool2d(2, 2)
self.avgpool2 = nn.AvgPool2d(2, 2)
Applies the first and second convolutional layer and then a ReLU activation function.
self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
Creates the second convolutional layer with 6 input channels (from the previous layer), 16 output channels, and a 5x5 kernel.
The code may seem different than the one on the left but its function is still the same.
Purpose: introduced within CNN architectures primarily to down sample the feature maps generated by convolutional layers. This downsampling serves two main purposes:
Reducing Computational Complexity: By decreasing the spatial dimensions of the feature maps, pooling layers reduce the number of parameters and computations required in subsequent layers of the network, making the overall processing more efficient.
Increasing Robustness to Input Variations: Pooling makes the network less sensitive to small changes in the position or shape of features in the input image. This is because pooling summarizes features within a region, so even if a feature is slightly shifted, its summarized representation is likely to remain similar.
Every neuron is connected to the preceeding and to all layers: after convolutional and pooling layers have extracted features from the input data, fully connected layers take these features and perform high-level reasoning. They learn complex relationships and patterns within the extracted features.
self.fc1 = nn.Linear(16*5*5, 120)
Creates the first fully connected layer. The input size 16 * 5 * 5 is calculated based on the output size of the previous convolutional layer and max-pooling operations.
self.fc2 = nn.Linear(120, 84)
Creates the second fully connected layer.
self.fc3 = nn.Linear(84, 10)
Creates the final fully connected layer, producing 10 outputs (for the 10 digit classes).
x = self.avgpool1(F.relu(self.conv1(x)))
x = self.avgpool2(F.relu(self.conv2(x)))
Applies the first and second convolutional layer and then a ReLU activation function.
x = x.view(x.size(0), -1)
Flattens the output from the convolutional layers into a vector to be fed into the fully connected layers.
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
Applies the first fully connected layer and ReLU activation.
x = self.fc3(x)
Applies the final fully connected layer to produce the output.
self.conv2 = nn.Conv2d(6, 16, kernel_size=5)
Used to generate predictions from new input data: it is the process of feeding input data through a neural network to obtain an output prediction. It's a fundamental step in both training and inference (using the network for prediction).
After the forward pass, the network's prediction is compared to the actual target value. The difference between the prediction and the target is used to calculate the loss, which is then used to update the network's weights using the backpropagation algorithm (another key concept).
class ImprovedLeNet(nn.Module):
def __init__(self):
super(ImprovedLeNet, self).__init__()
self.conv1 = nn.Conv2d(1,16,3,1)
self.conv2 = nn.Conv2d(16,32,5,1)
Increasing the number of channels: allows the network to learn more complex features from the input images. This enhances the model's ability to capture finer details and patterns.
Initial smaller kernel (3x3) :more effective in capturing fine-grained features in early layers, while a larger kernel (5x5) is appropriate for the latter layers to aggregate information from a wider receptive field.
self.pool = nn.MaxPool2d(2, 2)
Max pooling: instead of Average Pooling like previously
ACIFAR-10 Accuracy
Total mean class accuracy: 95.81 %
Total mean class accuracy: 98.21 %
We can see our model is more accurate from the increased accuracy percentage.
CIFAR-10 Accuracy
Accuracy of plane : 72 %
Accuracy of car : 89 %
Accuracy of bird : 78 %
Accuracy of cat : 50 %
Accuracy of deer : 48 %
Accuracy of dog : 60 %
Accuracy of frog : 72 %
Accuracy of horse : 68 %
Accuracy of ship : 78 %
Accuracy of truck : 71 %
Total mean class accuracy: 68.94 %
As we did not try to improve our LeNet model, it's accuracy is not high.
Useful for my daily life: This week we learned the fundamentals of image recognition . As an International student, Google Translate is a daily tool, and I'd like to thank the AI behind it that allows this app to recognize Chinese characters.