A Historical but Still Technically Basic Look at CNNs

Part 1 (out of 3): LeNet and AlexNet

In this series of a three-part blogs ("A Historical but Still Technically Basic Look at CNNs") we are going to discuss briefly about some historical development (but still with some basic level of technicality) of Convolutional Neural Networks (CNNs), the most widely utilized deep learning network for image-related AI tasks, starting from LeNet but branching out to popular established models such as AlexNet, VGG16/VGG19, InceptionNet, GoogLeNet and ResNet. We will not discuss models utilizing encoder and decoder blocks, instead delegating it to future blog post series. In the first blog post of this series we will delve into LeNet and AlexNet.

The content of this blog post is as follows:

The Birth of CNNs

1.1 What is a Convolutional Layer?

1.2 What is a Pooling Layer?

The LeNet-5
The AlexNet

3.1 What is Weight Regularization?

3.2 What is Dropout?

3.3 What is Data Augmentation?

3.4 What is Batch Normalization?

Summary

1. The Birth of CNNs

Images (as well as videos) are known to be 2d signals (with an additional time domain for videos) , and a computer usually 'sees' images as a 2d numerical matrix, with the matrix size dependent upon the input image sizes. For example, a digit '3' from the MNIST dataset [1] is interpreted by a computer to be a 28 $\times$ 28 matrix, with each of its pixel values ranging from 0-255. (0 represents black, 255 represents white, and the values between them represents the gray level intensity).

A vanilla Neural Network (NN), one which is composed of a multi-layer perceptron structure, only takes as input a 1d column vector. They cannot interpret the 2d matrix directly, so the matrix has to be converted into a large 1d column vector containing all the pixel values of the image, a process known as image flattening. For example, the MNIST image '3' will be flattened to a 28 × 28 = 784 vector, specifically of a dimension of 1 × 784. Based on Fig.1, the input vector will probably looks like

x = [0,0,0,..., 55,87,157,156,187,215,81,...,5,80,155,155,156,111,58,58,36,....,0,0,0]

where the non-zero values depict the pixel intensity from the 6th row and 25th row of Fig.1 respectively. As the vector is passed through the hidden layers/nodes and eventually the output layer/nodes, an output composed of 10 nodes is created (corresponding to 10 classes in MNIST). The input now has 784 nodes, and we can already predict that a huge amount of weights and biases would be required to be learned and adjusted during the training stage.

To be more precise, let's suppose we are training a NN with 2 hidden layers, each with 512 nodes (See Fig.2). Along with the input and output nodes, there are a total of 1818 nodes. We also recalled the formula WX + b, where X denotes the number of nodes in the previous layer, W denotes the weights or edges between the previous and next layer, and b denotes the corresponding biases.

Since the learnable weights and bias corresponds to the edges of the neural network, there are a total of (784*512 + 512) = 401,920 parameters in layer 1, (512*512 + 512) = 262,656 parameters in layer 2, and (512*10 + 10) = 5130 parameters in the output layer. In total we have 401,930 + 262,656 + 5130 = 669,706 trainable parameters, in agreement with the keras model summary if we were to implement it (See Fig.3). 669,706 is a lot of parameters for a simple 28 × 28 size image and for such a small network. The numbers could grow exponentially if we add more nodes and layers or increase the size of the image, which quickly puts a strain on current computational resources.

Furthermore, MLPs have no knowledge about how the visual features in the image are spatially related in the 2d matrix array, which is essential in pattern recognition of the image. Therefore, flattening the 2d array into a 1d vector leads to information loss.

To address the above two issues, Convolutional Neural Networks (CNNs) were invented. CNN, unlike vanilla NN, can accept the image 2d matrix as input and extract the patterns inherent in the features of the image, allowing spatial relationships between the pixels to be taken into account and hence enabling effective learning in image-based tasks. The high-level architecture of CNNs comprised mainly of 4 components:

The input layer
The convolutional layers for feature extraction
Fully Connected layers (FCN) for classification
The output prediction layer

These are illustrated in Fig.4. In the feature extraction stage involving the convolutional layers, note that at each layer, a feature map is produced that contains some specific extracted features from the layer. The image dimension decreases while the number of feature maps increases as more layers are being passed through. Eventually a long array of extracted features are obtained at the last layer of the feature extraction component. More abstract layers of the features are learned the deeper the convolutional layers go. You can think of the first layer as having the role of extracting basic contours and lines features from the input image, with the subsequent layers extracting the more complex higher-order patterns formed by specific configuration of the lines and contours ("patterns within patterns"). The flattened feature vector is fed to the fully connected layers to classify the extracted features of the image. Lastly, in the prediction stage, the output nodes that represent the correct class are fired by the neural network.

Apart from the convolution and the fully connected layers, there is another important layer commonly used in CNN: the pooling layer. We will cover a bit about the convolution and the pooling layer in the next subsections.

Fig.1: The MNIST digit '3' is interpreted by the computer to be comprised of a matrix of 28 rows and 28 columns, with each matrix elements corresponding to the pixel intensity in the range 0-255. 0 means the pixel represents black color, while any non-zero values up to 255 denotes the increasing intensity of gray color, with 255 representing white colors. The image is taken from Fig. 3.2 from chapter 3 of Mohamed Elgendy's book "Deep Learning for Vision Systems" [2].

Fig.2: The simple 4-layer neural network comprising of one input layer, 2 hidden layers, and 1 output layer. The image is taken from Fig. 3.3 from chapter 3 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

Fig.3: Keras model summary for a 4-layer simple neural network with two hidden layers. The image is taken from Fig. 3.4 from chapter 3 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

Fig.4: The main components of a simple CNN model. The image is taken from Fig. 3.10 from chapter 3 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

1.1. What is a Convolution Layer?

A Convolutional layer served as a layer that utilized a feature finding window (the convolution filter window) which slides over the image pixel by pixel to extract meaningful features that identify the objects in the image. Through sliding the filter over the input image, the network breaks the image into little chunks and processes those chunks individually to assemble the modified image which is the feature map. For a 3 × 3 convolutional filter, the entire convolution operation is succinctly demonstrated in Fig.5 as illustrated below.

The convolutional filter is also known as a kernel. In many research papers on CNNs the two terms convolutional filter or convolutional kernel may be used interchangeably. The matrix elements in the convolution filter served as the weights of the CNN model. The filter slides over the whole image and each time the corresponding pixel values are multiplied with the corresponding kernel values in an element-wise fashion before the values are summed together, culminating in a new (convolved) image with new pixel values (see Fig.5). Hence we can see that the convolution operation (*) is nothing but the weighted sum of the multiplied component of the filter and the component comprising the receptive field of the input image:

X * W = x1.w1 + x2.w2 + x3.w3 + ... + xi.wi (1)

where we take note that xi and wi are component of the weight and input array matrices respectively, and b is the bias. The weight values are randomly initialized at the beginning of training and the optimal weight values would be learned throughout.

Fig.5: A simple illustration of a convolution operation using a 3 × 3 filter or kernel. The image is taken from Fig. 3.13 from chapter 3 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

Unlike the weights and biases of the CNN, there also exist hyperparameters associated with the convolution layer that have to be decided manually before the training process, as they cannot be learned by the model. The four hyperparameters include filters, kernel size, stride, and padding.

Filters: The number of convolutional filters in each layer.
Kernel size: The size or dimensionality of the convolution filter.
Stride: The amount by which the filter slides over the image. A stride value of 1 and 2 indicates that the convolution filter is slided over the image one and two pixels at a time respectively. Using strides of 3 or more are uncommon and rare in practice.
Padding: Allow us to preserve the spatial size of the input so the input and output width and height are the same via adding zero around the border of the image (zero-padding). In this way, deeper convolutional layers can be easily built since the height and width of the inputs/outputs would not shrink rapidly.

Ultimately, the role of both strides and padding is to retain important details of the image features and transfer them to the next layer. They can also selectively neglect some of the image's spatial information for a more computationally affordable model. Still, as more sophisticated CNN models are proposed to effectively extract more essential features of the image, the number of parameters required would eventually get very large and put a huge strain on the available computational resources. This is where the pooling layer comes in handy.

The convolutional layer and the relevant hyperparameters as described above can all be implemented within one line of Keras code:

from keras.layers import Conv2D

model.add(Conv2D(filters=16, kernel_size=2, strides='1', padding='same', activation='relu'))

where the relu activation function is utilized. The value of the hyperparameters displayed is for clarity purposes here but feel free to utilize different filters, kernel size, strides, and padding types as appropriate for your CNN model.

1.2. What is a Pooling Layer?

A pooling layer (or subsampling layer) served to reduce the dimensionality of the CNN model via reducing the number of parameters being passed to the subsequent convolution layers. This reduces the computational complexity of the model. In many proposed CNN architectures it is common to see a pooling layer being added after every one or two convolutional layers. In general there are two main types of pooling operations:

Max Pooling: Utilizes a kernel window similar to the convolution kernel that slides over the pixel matrix and selects the max pixel value in that window to be passed to the next convolution layer (see the top diagram of Fig.6).
Average Pooling: Computes the average of all pixel values in the feature map (see bottom diagram of Fig.6). The reduction in the dimensionality of the feature map is more drastic than that of max pooling as no window or stride were utilized for narrower pooling operations.

We can think of a pooling operation as an image compression scheme: Although pooling reduces the dimensionality of the feature maps, the essential important features are retained while the image resolutions are reduced.

However, they are interesting work that suggests doing away with pooling and instead proposing careful tuning of the strides and padding in the model. See "Striving for Simplicity: The All Convolutional Net" by Springenberg et al. [3].

Similar to the convolutional layer, the pooling layer can also be implemented using a line of Keras code: (remember to import the MaxPooling2D library from keras.layers:

from keras.layers import MaxPooling2D

model.add(MaxPooling2D(pool_size=(2, 2), strides = 2))

Fig.6: The mechanism of max pooling (top) and average pooling (bottom). The images are taken from Fig. 3.21 and Fig.3.23 from chapter 3 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

2. The LeNet-5

Now that we know what a convolution and pooling layer are, we are ready to discuss the very first CNN introduced by LeCun et al. [4]: The LeNet-5.

The architecture is depicted in the top diagram of Fig.7 and is composed of 5 weights layers (3 convolutional layers denoted by C and 2 fully connected layers denoted by F) and 2 pooling layers denoted by S (which are not weight layers since they do not contain weights). Note that the tanh activation function was utilized instead of the more commonly utilized ReLU as of current, as the latter was not invented yet in the late 1990s.

Here is a quick breakdown of the architecture for its subsequent implementation in Keras (see bottom diagram of Fig.7):

The first and second convolutional layer has 6 filters, and the third convolutional layer has 120 filters. A stride of 1 is utilized throughout the transition to the convolutional layers. The original paper specified that the kernel size is 5 × 5.
An average pooling layer was inserted after each of the first two convolutional layers. The pooling size (or receptive field f in bottom diagram of Fig.7) is 2 × 2.
The tanh activation function was utilized as mentioned earlier.

LeNet-5 is actually a small network in today's standard (requiring only 61,706 parameters, see Fig.8) , which often required millions to billions of parameters. LeNet-5 is also first utilized for handwritten and Optical Character Recognition (OCR).

The main body of the LeNet-5 code in Keras is as shown below:

from keras.models import Sequential

from keras.layers import Conv2D, AveragePooling2D, Flatten, Dense

model = Sequential()

# The C1 Convolutional Layer

model.add(Conv2D(filters = 6, kernel_size = 5, strides = 1, activation = 'tanh', input_shape = (28,28,1), padding = 'same'))

# S2 Pooling Layer

model.add(AveragePooling2D(pool_size = 2, strides = 2, padding = 'valid'))

# C3 Convolutional Layer

model.add(Conv2D(filters = 16, kernel_size = 5, strides = 1,activation = 'tanh', padding = 'valid'))

# S4 Pooling Layer

model.add(AveragePooling2D(pool_size = 2, strides = 2, padding = 'valid'))

# C5 Convolutional Layer

model.add(Conv2D(filters = 120, kernel_size = 5, strides = 1,activation = 'tanh', padding = 'valid'))

model.add(Flatten())

# FC6 Fully Connected Layer

model.add(Dense(units = 84, activation = 'tanh'))

# FC7 Output layer with softmax activation

model.add(Dense(units = 10, activation = 'softmax'))

model.summary() # For printing the summary shown in Fig.8.

Fig.7: The overall architecture of LeNet-5 represented in two illustration variation. The images are taken from Fig.5.3 and 5.4 from chapter 5 of Mohamed Elgendy's book "Deep Learning for Vision Systems" respectively.

Fig.8: Keras model summary for the LeNet-5 model. The image is taken from Fig. 5.5 from chapter 5 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

Additional Information

The authors trained the LeNet-5 model for 20 epochs, using a varying learning rate of 0.0005 for the first two epochs, 0.0002 for the next three epochs, 0.00005 for the next four, and then 0.00001 thereafter. This is called scheduled decay learning. It helps the algorithm to converge faster and to a more optimal solution.

def lr_schedule(epoch):

if epoch <= 2:

lr = 5e-4

elif epoch > 2 and epoch <= 5:

lr = 2e-4

elif epoch > 5 and epoch <= 9:

lr = 5e-5

else:

lr = 1e-5

return lr

The batch size utilized is 32, the optimizer utilized is the Stochastic Gradient Descent (SGD), and the categorical cross-entropy loss function was selected as the MNIST dataset comprised of more than 2 classes. I may devote a future blog explaining the relevant concepts during training.

model.compile(loss='categorical_crossentropy', optimizer='sgd', metrics=['accuracy'])

hist = model.fit(X_train, y_train, batch_size=32, epochs=20, validation_data=(X_test, y_test), verbose=2, shuffle=True)

####### To perform inference on the trained model ############

pred = model.predict(X_test)

3. The AlexNet

Although LeNet-5 performed very well on the MNIST dataset, the latter is a grayscale image-based dataset and contained only 10 classes (0-9). AlexNet was designed by Krizhevsky et al. [5] and was the winner of the ILSVRC image classification competition in 2012. It was widely considered as the first deep CNN that opened the path for the computer vision community to take into serious consideration the power and usefulness of convolutional networks. The aim of AlexNet was to build a bigger (and deeper) network that can learn more complex functions.

AlexNet and LeNet-5 are similar in terms of building block design but the former has more filters per layer and more hidden (deep) layers (See Fig.9). In fact it has 5 convolutional layers, 3 pooling layers, and 3 fully connected layers. Unlike LeNet-5, AlexNet required 60 million parameters, hence giving the model a larger learning capacity to extract and analyze more complex features.

Fig.9: The overall architecture of the AlexNet. The image is taken from Fig.5.6 from chapter 5 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

Here is a quick breakdown of the AlexNet architecture for its subsequent implementation in Keras:

The first convolutional layer utilized a kernel size of 11 × 11, the second convolutional layer utilized a kernel size of 5 × 5, and the subsequent convolutional layers utilized a kernel size of 3 × 3. A stride of 4 is utilized in the first convolutional layer, while the remaining ones utilize a stride of 1.
The first and second convolutional layer utilized a filter of 96 and 256 respectively, both the third and fourth convolutional layer utilized a filter of 384, and the fifth convolutional layer utilized a filter of 256.
Each pooling is of size 3 × 3, with the first pooling downsizing the image from 55 × 55 $ to 27 × 27 , the second pooling downsizing the image from 27 × 27 to 13× 13, and the last pooling downsizing the image from 13 × 13 to 16× 6.
The first two fully connected layers have 4096 neurons, with the last layers having 1000 neurons (since it was trained on 1000 classes of the ImageNet dataset).
The ReLU function was utilized throughout with the softmax function implemented for the final layer. Using ReLU function reduces the vanishing gradient problem.

Additionally, AlexNet utilized dropout layers, weight regularization, data augmentation and batch normalization. We go through each of them briefly before showcasing the main codes for the AlexNet model.

3.1 . What is Weight Regularization?

Simply put, weight regularization is part of the deep/machine learning regularization techniques to reduce overfitting, whereby the model performed very well on training data (seen by the model) but generalized poorly on testing data (not seen by the model and need to be inferred). Usually this implies that the model is overly complex and requires further simplification. Weight regularization does this by adding an additional term to the original training loss function of the model that would penalize the model, hence reducing the weight values of the hidden layers and in turn simplifying the model. Mathematically,

new error function = old error function + regularization term (2)

with the exact mathematical form of the regularization not that significant in the current discussion. Still, we can see that adding the regularization term helps reduce overfitting as the new error term is now larger than the old error term, and thus the relevant derivative involved in the weight update term becomes larger (see equation 3).

Wnew = Wold − α(∂(Error)/∂W) (3)

Hence this leads to a smaller new weight term and subsequently the reduction in the weight values.

Despite its slight mathematical intricacies, weight regularization (and most of the rest of the augmentation techniques below) can be implemented using one line of Keras code either inside the Conv2D layers or Dense layers in Keras:

model.add(Dense(units=16, kernel_regularizer=regularizers.l2(ƛ), activation='relu'))

model.add(Conv2D(32, (3,3), kernel_regularizer=l2(ƛ)))

The ƛ denotes the kernel regularizer hyperparameter to be tune. The original authors of AlexNet felt that a value of 0.0005 suffices for model learning.

3.2 . What is Dropout Layers?

Another regularization technique. This involved ignoring neurons temporarily at a probability p which decides the dropout value at each training iteration. This is called the dropout rate and it is yet another hyperparameter needed to be set beforehand (typically in literature it is set in the range 0.3 and 0.5). This leads to a less complex and robust model as the neural network needs to learn new paths of information propagation at each training phase. More specifically the neural network that the model 'sees' during each training phase would not be the same due to the inactivation of some neurons (see Fig.10).

Dropout layers can be implemented using one line of Keras code as follows (remember to import from keras.layers package):

from keras.layers import Dropout

model.add(Dropout(0.3))

Fig.10: Comparisons of neural network with (left) and without dropout layers (right). The NN with the dropout layers leads to a more robust model as the model is exposed to a new configuration of network architecture at every training epoch. The image is adapted from https://vitalitylearning.medium.com/understanding-dropout-a-key-to-preventing-overfitting-in-neural-networks-21b28dd7c9b1.

3.2 . What is Data Augmentation?

Another method to prevent overfitting is to obtain more data, which may not be feasible at all times due to the difficulty of obtaining certain domain-specific data, such as those of disaster aftermaths [6] or rare diseases [7]. Therefore we can generate new instances of the same images with some transformation techniques, a process known as data augmentation. Some augmentation techniques utilized include flipping, rotation, scaling, gaussian blur, zooming, brightness, colour jitter,etc. See Fig. 11 for the effect of some augmentation techniques on the digit '6' in the MNIST dataset.

Data augmentation can be considered as a regularization technique because the model's exposure to many variants of the image reduces its dependence on the original form of the image, making it more robust and hence better generalization to the unseen test data.

Some useful data augmentations are implemented below in Keras:

data_augmentation = keras.Sequential([layers.RandomFlip("horizontal_and_vertical"), layers.RandomRotation(0.4), layers.RandomZoom(0.3) ])

In the above we have implemented a random flip that gives an equal probability of flipping the input images horizontally and vertically, as well as giving the input images a random rotation with a probability of 0.4 and a random zoom with a probability of 0.3. Most of the data augmentation operations can be extracted from the 'keras.layers' package.

Fig.11: Some illustrations of image augmentation techniques as applied to the digit '6' in MNIST. The image is extracted from Fig.4.27 from Chapter 4 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

3.4 . What is Batch Normalization?

During a majority of deep learning training, data normalization (normalizing input pixel values to be between 0 and 1) served as an important preprocessing step to ensure faster convergence and enhancing learning performances. It turns out that an analogous type of normalization could also be done on the extracted features in the hidden layers itself. (See Fig.12). This is known as Batch Normalization, which is first introduced by Ioffe and Szagedy in their 2015 paper "Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift" [8]. Simply put, it was designed to reduce covariant shift in the neural network, which is caused by the change in the distribution of the shallower hidden unit layers as seen from the perspective of the deeper part of the hidden layers during the learning process where the parameters adjusted itself. Batch Normalization performed the following three operations before the activation function of each layer:

Zero-center the inputs,
Normalized the zero-centered inputs,
Scaling and shifting the results.

The last layer introduced two additional hyperparameters 𝛾 and β. In the first step, the input mean of the mini-batch is calculated along with the corresponding variance, and in the third step, the normalized output Xi is multiplied by the scaling factor 𝛾, followed by its addition with β to shift it.

Fig.13 depicts the exact mathematical steps of batch normalization. The good news is that despite the slightly more complicated mechanism and mathematical intuition behind batch normalization, it can be executed using one line of Keras code: Just remember to import the keras.layers.nomalization library as follows:

from keras.layers.normalization import BatchNormalization

model.add(BatchNormalization())

Fig.12: Thinking of Batch normalization as "normalization of the extracted features in the intermediate hidden layers of the neural network". The image is extracted from Fig.4.28 from Chapter 4 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

Fig.13: The exact mathematical machinary of batch normalization. It is extracted from Chapter 4 of Mohamed Elgendy's book "Deep Learning for Vision Systems".

Finally, the main code body of the AlexNet is depicted below:

from keras.models import Sequential

from keras.regularizers import l2

from keras.layers import Conv2D, AveragePooling2D, Flatten, Dense, Activation,MaxPool2D, BatchNormalization, Dropout

###################### Start of model code ##############################

model = Sequential()

# ########### 1st layer (CONV + pool + batchnorm) ########################

model.add(Conv2D(filters= 96, kernel_size= (11,11), strides=(4,4), padding='valid', input_shape = (227,227,3)))

model.add(Activation('relu'))

model.add(MaxPool2D(pool_size=(3,3), strides=(2,2)))

model.add(BatchNormalization())

################# 2nd layer (CONV + pool + batchnorm) ##################

model.add(Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))

model.add(Activation('relu'))

model.add(MaxPool2D(pool_size=(3,3), strides=(2,2), padding='valid'))

model.add(BatchNormalization())

############## 3rd layer (CONV + batchnorm) #####################

model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))

model.add(Activation('relu'))

model.add(BatchNormalization())

################## 4th layer (CONV + batchnorm) ####################

model.add(Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))

model.add(Activation('relu'))

model.add(BatchNormalization())

############### 5th layer (CONV + batchnorm) #######################

model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), padding='same', kernel_regularizer=l2(0.0005)))

model.add(Activation('relu'))

model.add(BatchNormalization())

model.add(MaxPool2D(pool_size=(3,3), strides=(2,2), padding='valid'))

model.add(Flatten())

################## 6th layer (Dense layer + dropout) ####################

model.add(Dense(units = 4096, activation = 'relu'))

model.add(Dropout(0.5))

######################## 7th layer (Dense layers) ####################

model.add(Dense(units = 4096, activation = 'relu'))

model.add(Dropout(0.5))

################## 8th layer (softmax output layer) ####################

model.add(Dense(units = 1000, activation = 'softmax'))

model.summary() # Print summary of model.

Additional Information

AlexNet was trained for 90 epochs, with a batch size of 128, and the initial learning rate was set to 0.01 (with a momentum of 0.9). When the validation error stops improving, the learning rate was divided by 10 and hence set to 0.001. Once again the Stochastic Gradient Descent was utilized as the optimizer and the categorical cross-entropy loss was utilized.

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=np.sqrt(0.1))

optimizer = keras.optimizers.sgd(lr = 0.01, momentum = 0.9)

model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

model.fit(X_train, y_train, batch_size=128, epochs=90, validation_data=(X_test, y_test), verbose=2, callbacks=[reduce_lr])

Finally, to perform inference on the trained AlexNet on the test dataset:

predicted = model.predict(X_test)

4 . Summary

In this first series of blogs we have introduced some problems with flattening image input as 1d vector and passing them as inputs to the vanilla neural network, motivating the introduction of Convolutional Neural Networks (CNNs). We provide a quick introduction of the two main components of the CNNs: the 2D convolutional layer and the pooling layer, followed by delving into the LeNet-5 architecture and showcasing how it can be implemented in keras. Finally, we visited AlexNet as a deeper CNN network that can extract more complex features and has enhanced learning capacity that required millions of parameters, and introduced some regularization techniques that are utilized, namely weight regularization, dropout, data augmentation, and batch normalization. All of the introduced concepts so far have a simple keras implementation.

References

[1] Y. LeCun, “The mnist database of handwritten digits,” http://yann. lecun. com/exdb/mnist/, 1998.

[2] M. Elgendy, Deep learning for vision systems. Manning, 2020.

[3] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.

[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 2002.

[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.

[6] F. Alam, F. Ofli, M. Imran, T. Alam, and U. Qazi, “Deep learning benchmarks and datasets for social media image classification for disaster response,” in 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM). IEEE, 2020, pp. 151–158.

[7] Y. Chen, X. Guo, Y. Pan, Y. Xia, and Y. Yuan, “Dynamic feature splicing for few-shot rare disease diagnosis,” Medical Image Analysis, vol. 90, p. 102959, 2023.

[8] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in International conference on machine learning. pmlr, 2015, pp. 448–456.

Page updated

Google Sites

Report abuse