ResNet 18 is a convolutional neural network that is trained on the ImageNet dataset. It was introduced in the paper – ‘Deep Residual Learning for Image Recognition ’ published in 2015 by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ResNet 18 is a smaller version of the ResNet network, which won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015.
The key innovation of the ResNet architecture is the use of "skip connections," which allow the network to learn residual functions instead of trying to learn the desired mapping from input to output directly. This makes it easier for the network to train and improve its accuracy and allows it to use much deeper network architectures without suffering from the vanishing gradient problem.
ResNet 18 has 18 layers, including convolutional layers, pooling layers, and fully connected layers. It is commonly used as a starting point for other, more complex computer vision tasks. In our application, we will utilize the features in the high-resolution full-slide images of the blood clots to train the model and perform binary classification.
The table in the figure on the left is adapted from ‘Deep Residual Learning for Image Recognition ’ published in 2015 by Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
The Confusion matrix shows the model performance over the clot image dataset after 20 epochs. It promises an accuracy of 73.75%, an AUC of 0.7331, and an F1 score of 0.7658.
DenseNet is a convolutional neural network architecture described in the paper 'Densely Connected Convolutional Networks' that was developed by Gao Huang, Zhuang Liu, Kilian Q. Weinberger, and Laurens van der Maaten in 2016. It is a type of deep learning architecture that is particularly well-suited for image classification tasks. In our application since we are dealing with a binary classification task, we have chosen to investigate the DenseNet model performance.
Dense blocks, transition layers and skip connections in the DenseNet architecture make it a powerful and effective deep learning architecture. DenseNet concatenates the feature maps produced by each convolution layer, and this allows the reuse of feature maps from earlier layers. It reduces the total number of parameters used to tune the model. DenseNet allows a smoother flow of information and thus prevents bottlenecks.
From training the DenseNet model on the blood clot image dataset, we obtained the following results:
F1: 0.701
Accuracy: 72.4%
AUC: 0.730
VGG16 is a type of Convolutional Neural Network created in 2014 for the ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). It is a popular architecture used in computer vision for tasks like object detection. VGG16 consists of 16 layers, shown in the image from the left side. The input image size is 224 x 224 pixels, followed by a series of convolution layers, max-pooling layers, and at the end, three fully connected layers.
For this project, the architecture was created with the use of Tensorflow and Keras libraries from Python. The input was not altered; however, the size of the output layer was modified so that the model adapts to the binary classification of the blood clot images at hand. The images were reshaped to have a shape of 224 x 224 pixels and passed to the model with random modifications of data augmentation to avoid overfitting.
Three experiments were performed for a different number of epochs: 15, 30, and 50 epochs. A comparative chart with performance metrics for each experiment is shown on the second image to the left. We can see that as the number of epochs increases, the AUC score increases; however, for the rest of the metrics, there is a decrease from 15 to 30 epochs and later an increase for the last experiment. From all three runs, the model trained on 50 epochs performs the best. Our best model can distinguish between classes at a reasonable degree and be able to classify correctly around 66.7% of the test images.
Finally, the training accuracy and loss for our last experiment are shown in the third picture. We can see how the Binary Cross Entropy loss for the VGG16 model decreases drastically in the first ten epochs, and after this, the decay trend persists for the rest of the training process. On the other hand, the training accuracy increases at a faster rate during the first 20 epochs, and later on, it does not experience major changes. We can observe how the accuracy stabilizes around 0.75 for the last ten epochs of the training process.
Convolutional neural networks (CNNs), first developed in the late 1980’s revolutionized the field of image classification by introducing a new architecture that was specifically designed to work with two-dimensional data, such as images. Prior to the development of CNNs, most image classification methods relied on hand-crafted features, which required a lot of time and effort to create and often did not produce good results.
CNNs, on the other hand, are able to automatically learn features from the input data, which makes them much more effective for image classification.
This particular model was implemented using - Adam optimization, Dropout regularization and L2 regularization and gives an accuracy of 66% and loss is 0.23. The picture on the left shows the variation of loss and accuracy over 100 epochs.
Vision Transformer(ViT) is inspired by the transformer architecture initially used for natural language processing. ViT is specially developed for computer vision tasks like object recognition and classification.
Through this project, we try and implement ViT to perform binary classification on an image dataset. The core mechanisms in the architecture are positional encoding, attention, and self-attention. Here is an intuitive way to think about the concepts:
Attention: What part of the image should I focus on?
Self-Attention: How important is a specific part in the image with respect to other parts in the image?
The transformer learns these attention scores to associate the different parts/features in the image as it learns over the training set. This idea of developing a relationship between neighboring features in an image is important for medical diagnosis and, in our case, image classification. The implementation of ViT was done using Tensorflow and Keras libraries from Python. The size of the input and output layer was modified so that the model adapts to the binary classification of the blood clot images at hand.
To generate a proof of concept, we experimented with compressed 100 x 100px size images. Experimentation: Three experiments were performed for different numbers of epochs: 20,50 and 100 epochs. The model performed poorly with the low-resolution images, but we believe that with the original high-resolution images and better computational capacity, we can run the training on full-size images and obtain better results.
This implementation was inspired from:
https://keras.io/examples/vision/image_classification_with_vision_transformer/