In this series of a three-part blogs ("A Historical but Still Technically Basic Look at CNNs") we are going to discuss briefly about some historical development (but still with some basic level of technicality) of Convolutional Neural Networks (CNNs), the most widely utilized deep learning network for image-related AI tasks, starting from LeNet but branching out to popular established models such as AlexNet, VGG16/VGG19, InceptionNet, GoogLeNet and ResNet. We will not discuss models utilizing encoder and decoder blocks, instead delegating it to future blog post series. In the second blog post of this series we will delve into VGG16/19, InceptionNet and GoogLeNet.
The contents of the blog are as followed:
1) VGG16 and VGG19
2) InceptionNet and GoogLeNet
3) Summary
1) VGG16 and VGG19
The VGGNet series of CNN network was developed by the Visual Geometry Group at Oxford University in 2014 [1] that introduced no new components apart from even more convolution, pooling and dense (fully connected layers) than the AlexNet (See previous blog post). There are a total of 16 weight layers (13 convolutional layers and 3 fully connected layers). However, unlike AlexNet, it has a simpler and uniform architecture in terms of having convolutional and pooling layers with the same hyperparameters. Specifically:
All convolutional layers have a kernel size of 3 x 3, strides value of 1, and same padding value throughout.
All pooling layers have a size of 2 x 2 and strides value of 2.
Using the 3 x 3 kernel consistently allowed the model to extract finer detail of the input image's features compared to AlexNet which utilized a larger kernel of 11 x 11 and 5 x 5. Using smaller kernel convolutional layers which are stacked together repeatedly is shown to increase the depth of the network while using fewer training parameters, hence allowing more complex features to be learned at a lower computational cost. In fact, using more 3 x 3 convolutional layers in sequence enables more non-linear activation functions like ReLU and makes the decision-making process more discriminative.
The VGGNet architecture consisted of stacked 3 x 3 convolutional layers along with 2 x 2 pooling layers inserted after several convolutional layers. 3 fully connected layers followed for which the last layer utilized a softmax activation function. This is shown in the top figure of Fig.1.
Fig.1: (Top) The architecture for the VGG16 network that indicates the kernel size for each convolution and the number of neurons for the fully connected layers (4096) and softmax layer (1000). (Bottom) The same architecture but with the feature maps at each stage revealed more conspicuously. The top figure is from Fig 5.8 of "Deep Learning for Vision System" by Mohamed Elgendy [2], and the bottom figure is from https://lekhuyen.medium.com/an-overview-of-vgg16-and-nin-models-96e4bf398484.
VGG16 yielded a total of 138 (specifically 138,357,544, see Fig.2) million parameters. In addition, two regularization techniques were utilized: The L2 weight regularization and the Dropout regularization. (Please see the first blog for more detail). Specifically, a weight decay value of 5 x 10^{-4} was used for the weight regularization, and the dropout value was set at 0.5 in the first two fully connected layers.
The training for the model was done using a mini-batch gradient descent optimizer with momentum of 0.9, with the learning rate set to 0.01 and decreased by 10 when the validation accuracy stops improving. The loss function utilized was categorical cross-entropy, with a batch size of 256 and 74 epochs.
The Keras implementation of the main VGG16 network architecture (cross-check the codes with Fig.1), is as shown below.
from keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D
from tensorflow.keras.layers import Dropout, Dense, Flatten
############################ The VGG16 architecture ###############################
model = Sequential()
# block #1
model.add(Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same', input_shape=(224,224, 3)))
model.add(Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #2
model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #3
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #4
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #5
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #6 (classifier)
model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))
### Summary of VGG16 ###
model.summary() # See Fig.2.
Fig.2: The Keras summary table for the VGG16 model. The total parameters utilized is 138,357,544.
VGG19 has a similar architecture than that of VGG16, but consists of 19 weight layers (16 convolutional layers and 3 fully connected layers). The total number of trainable parameters utilized is hence bigger, specifically 143,667, 240 parameters (see Fig.3). The red fonts denote the modification to the VGG16 code, which all entails adding additional convolutional layers, 2 which have 512 filters, and 1 having 256 filters.
from keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D
from tensorflow.keras.layers import Dropout, Dense, Flatten
############################ The VGG19 architecture ###############################
model = Sequential()
# block #1
model.add(Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same', input_shape=(224,224, 3)))
model.add(Conv2D(filters=64, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #2
model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #3
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
# The additional convolutional layer 1
model.add(Conv2D(filters=256, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #4
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
# The additional convolutional layer 2
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #5
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
# The additional convolutional layer 3
model.add(Conv2D(filters=512, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same'))
model.add(MaxPool2D((2,2), strides=(2,2)))
# block #6 (classifier)
model.add(Flatten())
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(4096, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1000, activation='softmax'))
### Summary of VGG19 ###
model.summary() # See Fig.3
Fig.3: The Keras summary table for the VGG19 model. The total parameters utilized is 143,667,240.
2) InceptionNet and GoogLeNet
The Inception Network (InceptionNet) was proposed in 2014 by the Google research group ("Going Deeper with Convolutions") [3] which aims to build an even deeper CNN while utilizing lesser computational resources. GoogLeNet is one version of the Inception Network that was used for ILSVRC in 2014 . It utilized 22 layers, deeper than that of the VGG16/19 while consuming lesser parameters (by 12 times from ~138 million parameters in VGGNet to ~13 million parameters) and achieved more accurate results significantly. The proposed network implemented a new element known as the inception module.
The (naive) inception module is composed of several convolutional layers with different kernel sizes. Specifically, it is a concatenation of four layers (1 x 1, 3 x 3, 5 x 5 convolutional layer and one 3 x 3 max pooling layer). This is illustrated in Fig.4. The 1 x 1 convolutional layer has a filter size of 64, the 3 x 3 convolutional layer has a filter size of 128, the 5 x 5 convolutional layer has a filter size of 32, and the max pooling has a stride of 1. All the convolutional layers and the max pooling layer utilized the same padding.
The concatenated output (with total 64+128+32+32 = 256 filters) is to be passed to the next layer for processing. Specifically, the authors designed the inception network that mimicked closely with the classical CNN feature extractor architecture, but replaced some convolutional layers with the inception module, as shown in Fig.5.
The naive inception module can be implemented in Keras as follows (this assumes it takes in a 224 x 224 x 3 image as input, but remember that the inception module takes the intermediate feature maps as input instead):
from keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D
from tensorflow.keras.layers import Concatenate
inputs_prev = keras.layers.Input((224,224,3))
# 1 x 1 convolution layer
conv1 = Conv2D(filters=64, kernel_size=(1,1), strides=(1,1), activation='relu', padding='same')(inputs_prev)
# 3 x 3 convolution layer
conv2 = Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same')(inputs_prev)
# 5 x 5 convolution layer
conv3 = Conv2D(filters=32, kernel_size=(5,5), strides=(1,1), activation='relu', padding='same')(inputs_prev)
# Max Pooling layer
MP1 = MaxPool2D((3,3), strides=(1,1), padding='same')(inputs_prev)
# Concatenate
concat = keras.layers.Concatenate(axis=-1)([conv1,conv2,conv3,MP1])
naive_inception_module = Model(inputs=[inputs_prev], outputs=[concat])
naive_inception_module.summary() # See Fig.6
As you might have observed, the previously described inception module is "naive". Why is that so? This is because the module will incur a larger computational cost when used in conjunction with the main inception network. To address this issue, dimensionality reduction techniques can be used. In this case a series of 1 x 1 convolutional layers is implemented in conjunction with the other model components, as shown in Fig.7. This layer is added before the convolutional layers with larger kernels like the 3 x 3 and 5 x 5 layers so as to reduce the number of computational operations. Using this modification, we can see even more clearly that the inception module handles multi-scale feature information extraction, processing and aggregation. This allows the next layer to better abstract features from the different scales simultaneously. Many papers in computer vision have utilized multi-scale feature aggregation for the respective task (see an example of detection [4] , classification [5], denoising [6]).
More specifically, the 1 x 1 convolution layer is also known as the bottleneck layer (see Fig.8).
The actual inception module can now be implemented in Keras as follows (red font indicates additional changes from the naive inception module):
from keras.models import Sequential
from tensorflow.keras.layers import Conv2D, MaxPool2D
from tensorflow.keras.layers import Concatenate
inputs_prev = keras.layers.Input((224,224,3))
# The 1 x 1 convolution layer (No change)
conv1 = Conv2D(filters=64, kernel_size=(1,1), strides=(1,1), activation='relu', padding='same')(inputs_prev)
# The 3 x 3 convolution layer (1 x 1 convolution added a priori)
conv2_add = Conv2D(filters=16, kernel_size=(1,1), strides=(1,1), activation='relu', padding='same')(inputs_prev)
conv2 = Conv2D(filters=128, kernel_size=(3,3), strides=(1,1), activation='relu', padding='same')(conv2_add)
# The 5 x 5 convolution layer (1 x 1 convolution added a priori)
conv3_add = Conv2D(filters=16, kernel_size=(1,1), strides=(1,1), activation='relu', padding='same')(inputs_prev)
conv3 = Conv2D(filters=32, kernel_size=(5,5), strides=(1,1), activation='relu', padding='same')(conv3_add)
# Max Pooling layer (1 x 1 convolution added after it)
MP1 = MaxPool2D((3,3), strides=(1,1), padding='same')(inputs_prev)
MP1_add = Conv2D(filters=16, kernel_size=(1,1), strides=(1,1), activation='relu', padding='same')(MP1)
# Concatenate
concat = keras.layers.Concatenate(axis=-1)([conv1,conv2,conv3,MP1_add])
inception_module = Model(inputs=[inputs_prev], outputs=[concat])
inception_module.summary() # See Fig.9
One might think that shrinking the dimensionality of the feature map would adversely affect the model's performance. Szegady et al. ran multiple experiments with their proposed model and found that if the reduction layers are done in moderation, the dimensionality can be reduced without drastically affecting the model's performance while enhancing the computational efficiency.
As mentioned earlier, the GoogLeNet is nothing but a specific model configuration of the inception module and its overall architecture is depicted in Fig.10.
Fig.4: The (Naive) Inception module components. The image is from Fig 5.11 of "Deep Learning for Vision System" by Mohamed Elgendy.
Fig.5: A pictorial comparison of a classical CNN architecture with the InceptionNet that has the inception modules incorporated. The image is from Fig 5.10 of "Deep Learning for Vision System" by Mohamed Elgendy.
Fig.6: Keras summary and parameters usage for the naive inception module.
Fig.7: The Inception module with dimensionality reduction technique. The image is from Fig 5.13 of "Deep Learning for Vision System" by Mohamed Elgendy.
Fig.8: A more detail description of why the 1 x 1 convolutional layer is also known as the bottleneck layer. The image is extracted from Chapter 5"Deep Learning for Vision System" by Mohamed Elgendy.
Fig.9: Keras summary and parameters usage for the actual inception module.
Fig.10: The GoogLeNet main architecture which includes the inception modules. 2 of such modules are utilized after the second main max pooling layer, 5 are used after the third main max pooling layer, and 2 are used again in the fourth max main pooling layer. The diagram is from https://d2l.ai/chapter_convolutional-modern/googlenet.html.
In summary, the GoogLeNet is comprised of three parts:
The first part is identical to the LeNet and the AlexNet in that it contains a series of convolutional layers and max pooling layers. Specifically, it comprised the 7 x 7 conv -> 3 x 3 max pooling -> 1 x 1 conv -> 3 x 3 conv -> 3 x 3 max pooling.
The second part comprises 9 inception modules in the order 2x inception modules -> 3 x 3 max pooling -> 5x inception modules -> 3 x 3 max pooling - > 2x inception modules.
The third part is the classifier part of the network containing the globla average pooling layer, the fully connected layer and the softmax layer.
First we define the inception module function:
def inception_module(x, filters_1x1, filters_3x3_reduce, filters_3x3, filters_5x5_reduce, filters_5x5, filters_pool_proj, name=None):
conv_1x1 = Conv2D(filters_1x1, kernel_size=(1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
### 3 × 3 route = 1 × 1 CONV + 3 × 3 CONV ###
pre_conv_3x3 = Conv2D(filters_3x3_reduce, kernel_size=(1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
conv_3x3 = Conv2D(filters_3x3, kernel_size=(3, 3), padding='same', activation='relu', kernel_initializer=kernel_init,bias_initializer=bias_init)(pre_conv_3x3)
### 5 × 5 route = 1 × 1 CONV + 5 × 5 CONV ###
pre_conv_5x5 = Conv2D(filters_5x5_reduce, kernel_size=(1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(x)
conv_5x5 = Conv2D(filters_5x5, kernel_size=(5, 5), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(pre_conv_5x5)
### pool route = POOL + 1 × 1 CONV ###
pool_proj = MaxPool2D((3, 3), strides=(1, 1), padding='same')(x)
pool_proj = Conv2D(filters_pool_proj, (1, 1), padding='same', activation='relu', kernel_initializer=kernel_init, bias_initializer=bias_init)(pool_proj)
output = concatenate([conv_1x1, conv_3x3, conv_5x5, pool_proj], axis=3, name=name)
return output
Then first part of the architecture can be implemented in Keras as follows:
# Remember to import the relevant library packages here #
input_layer = Input(shape=(224, 224, 3))
kernel_init = keras.initializers.glorot_uniform()
bias_init = keras.initializers.Constant(value=0.2)
x = Conv2D(64, (7, 7), padding='same', strides=(2, 2), activation='relu', name='conv_1_7x7divide2', kernel_initializer=kernel_init, bias_initializer=bias_init)(input_layer)
x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_1_3x3divide2')(x)
x = BatchNormalization()(x)
x = Conv2D(64, (1, 1), padding='same', strides=(1, 1), activation='relu')(x)
x = Conv2D(192, (3, 3), padding='same', strides=(1, 1), activation='relu')(x)
x = BatchNormalization()(x)
x = MaxPool2D((3, 3), padding='same', strides=(2, 2))(x)
The second part of the architecture then follows:
#### The first 2 inception modules with 1 max pool ###
x = inception_module(x, filters_1x1=64, filters_3x3_reduce=96, filters_3x3=128, filters_5x5_reduce=16, filters_5x5=32, filters_pool_proj=32, name='inception_3a')
x = inception_module(x, filters_1x1=128, filters_3x3_reduce=128, filters_3x3=192, filters_5x5_reduce=32, filters_5x5=96, filters_pool_proj=64, name='inception_3b')
x = MaxPool2D((3, 3), padding='same', strides=(2, 2))(x)
#### The next 5 inception modules with 1 max pool ###
x = inception_module(x, filters_1x1=192, filters_3x3_reduce=96, filters_3x3=208, filters_5x5_reduce=16, filters_5x5=48, filters_pool_proj=64, name='inception_4a')
x = inception_module(x, filters_1x1=160, filters_3x3_reduce=112, filters_3x3=224, filters_5x5_reduce=24, filters_5x5=64, filters_pool_proj=64, name='inception_4b')
x = inception_module(x, filters_1x1=128, filters_3x3_reduce=128, filters_3x3=256, filters_5x5_reduce=24, filters_5x5=64, filters_pool_proj=64, name='inception_4c')
x = inception_module(x, filters_1x1=112, filters_3x3_reduce=144, filters_3x3=288, filters_5x5_reduce=32, filters_5x5=64, filters_pool_proj=64, name='inception_4d')
x = inception_module(x, filters_1x1=256, filters_3x3_reduce=160, filters_3x3=320, filters_5x5_reduce=32, filters_5x5=128, filters_pool_proj=128, name='inception_4e')
x = MaxPool2D((3, 3), padding='same', strides=(2, 2), name='max_pool_4_3x3divide2')(x)
#### The last 2 inception modules with 1 max pool ###
x = inception_module(x, filters_1x1=256, filters_3x3_reduce=160, filters_3x3=320, filters_5x5_reduce=32, filters_5x5=128, filters_pool_proj=128, name='inception_5a')
x = inception_module(x, filters_1x1=384, filters_3x3_reduce=192, filters_3x3=384, filters_5x5_reduce=48, filters_5x5=128, filters_pool_proj=128, name='inception_5b')
Finally, for the last part of the architecture:
x = AveragePooling2D(pool_size=(7,7), strides=1, padding='valid')(x)
x = Dropout(0.4)(x)
x_output = Dense(10, activation='softmax', name='output')(x) # 10 assuming we are training it on the CIFAR-10 dataset [7].
googLenet = Model(input_layer, x_output)
The author found that adding a 7 x 7 global average pooling layer improved the accuracy (top-1) by about 0.6%. A dropout layer with probability of 0.4 also helps reduce overfitting.
Additional Information
The number of epochs is set to 25, with a batch size of 256. The initial learning rate is set to 0.01, with a decay rate schedule of 0.04 for every 8 epochs. The stochastic gradient descent optimizer is used with a momentum value of 0.9, and the categorical cross-entropy loss function was used.
import math
import keras
epochs = 25
initial_lrate = 0.01
def decay(epoch, steps=100): # The decaying learning rate.
initial_lrate = 0.01
drop = 0.96
epochs_drop = 8
lrate = initial_lrate * math.pow(drop, math.floor((1+epoch)/epochs_drop))
return lrate
lr_schedule = keras.callbacks.LearningRateScheduler(decay, verbose=1)
sgd = keras.optimizers.SGD(learning_rate=0.01, momentum=0.9, nesterov=False)
googLenet.compile(loss='categorical_crossentropy', optimizer=sgd, metrics=['accuracy'])
googLenet.fit(X_train, y_train, batch_size=256, epochs=epochs, validation_data=(X_val, y_val), callbacks=[lr_schedule], verbose=2, shuffle=True) # Training.
y_pred = googLenet.predict(X_test) # Inference
3) Summary
In this blog we have covered some sophisticated CNN models that aimed to improve the classification accuracy score while reducing the computational costs. Specifically, two models were discussed; The VGG model and its two variations, and the InceptionNet, for which the GoogLeNet is a variation of. The VGG utilized stacks of 3 x 3 convolution and a few pooling layers, which simplified the CNN architecture relative to the AlexNet while requiring lesser computational demand. The InceptionNet is a series of CNN architecture that incorporated the inception module, focusing on parallel convolution and pooling and concatenating the resultant feature maps together, hence enabling multi-scale feature information extraction, processing and aggregation. The InceptionNet required even lesser training parameters than the VGG network while being deeper than the VGG network.
References
[1] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[2] M. Elgendy, Deep learning for vision systems. Manning, 2020.
[3] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[4] Q. Chen, X. Meng, W. Li, X. Fu, X. Deng, and J. Wang, “A multi-scale fusion convolutional neural network for face detection,” in 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2017, pp. 1013–1018.
[5] X. Huo, G. Sun, S. Tian, Y. Wang, L. Yu, J. Long, W. Zhang, and A. Li, “Hifuse: Hierarchical multi-scale feature fusion network for medical image classification,” Biomedical Signal Processing and Control, vol. 87, p. 105534, 2024.
[6] S. Li, Y. Chen, R. Jiang, and X. Tian, “Image denoising via multi-scale gated fusion network,” IEEE Access, vol. 7, pp. 49 392–49 402, 2019.
[7] A. Krizhevsky, G. Hinton et al., “Learning multiple layers of features from tiny images,” 2009.