CS539 Machine Learning - Prof. Kyumin Lee
Edith Gomez, Yash Garje, Riddhi Thakkar, Jinesh Rajasekhar, Hitesh Bhojwani
Github Repository: https://github.com/riddhithakkarr/ML-Image-Classification
What is an ischemic stroke?
The Mayo Clinic defines Ischemic stroke as a condition where the blood supply to the brain is reduced, preventing brain tissue from getting oxygen and nutrients.[1] This causes the brain cells to die in minutes, and this condition could lead to permanent brain damage or even death. A stroke is a medical emergency, and prompt treatment is crucial. Early action can reduce brain damage and other complications. Here are a few statistics from the center for disease control that highlight the gravity of the problem at hand[2]:
In 2020, 1 in 6 deaths from cardiovascular disease was due to stroke.
Every 40 seconds, someone in the United States has a stroke.
Every 3.5 minutes, someone dies of a stroke.
About 185,000 strokes—nearly 1 in 4—are in people who have had a previous stroke.
About 87% of all strokes are ischemic strokes, in which blood flow to the brain is blocked.
Why is it essential to know the origins of the blood clot?
Etiology refers to the study of the cause/origin of a disease. Classification of etiology enables clinicians to confidently assign a course of treatment and prevent a recurrent stroke. This method has an impact of proactively treating and preventing 23% of the total ischemic stroke cases that happen to be recurrent. Additionally, preventing recurrence also means significantly improving the patient’s chance of survival which further adds to the need for this tool. Clinicians shall use this method in histopathologic laboratories and medical facilities to determine etiologies and examine clot composition.
Brinjikji W, Nogueira RG, Kvamme P, et al. propose a stacked ensemble framework that uses quantified histological characteristics of the blood (number of RBC, WBC, fibrin, and platelet density) as the feature input. They used the H2o.ai R package for implementation, and their model resulted in a five-fold cross-validated AUC of 0.55. The area under the precision-recall curve is 0.33.
Our approach aims to overcome the shortcomings of their method by optimizing the inputs. Instead of considering the extracted quantitative blood histology characteristics as inputs to the machine learning model, we implement image analysis on the full slide images using multiple image classification algorithms to investigate the efficacy of the image analysis/computer vision approach to classify the etiology.
To develop an Artificial Intelligence-based etiology classification algorithm. This tool uses machine learning and deep learning algorithms to analyze high-resolution whole-slide digital pathology images of blood clots and classify the etiology to either CE (Cardioembolic - i.e. originating from the heart) or LAA (Large Artery Atherosclerosis- i.e., originating from the plaque in the inner lining of an artery).
The dataset into consideration has been identified from Kaggle, published by the Mayo Clinic under the competition name –“Mayo Clinic –Strip AI: Image classification of Stroke Blood Clot Origin.” The dataset contains 1,158 files(images) with over 390Gb of high-resolution whole-slide digital pathology images. Each slide depicts a blood clot from a patient that had experienced an acute ischemic stroke.
To get more insights into the dataset and the variable's behaviors, an Exploratory Data Analysis(EDA) was performed. For this purpose, the team used a CSV file provided by Kaggle with the dataset, and some additional information about the images was added to the file. Graph and their explanations can be found in the EDA section of this webpage. Please click on the EDA tab on the top right of the webpage to explore further.
As a team, we decided to implement the following algorithms:
CNN 3 layers
ResNet
VGGnet
DenseNet
Vision Transformers
The performance of the above-mentioned five algorithms was investigated, and a comparative analysis for the same is presented. For more detail about each implementation and results, visit the methodologies tab or click the title of this section.
The main idea of classification relies solely on determining the histological signature of the retrieved blood clot (thrombi). The challenge lies in identifying the right machine-learning pipeline for biomedical image analysis of this kind. Extensive data handling and efficient pipeline designing techniques would significantly bring novelty to the work. Building the knowledge to handle unique data formats (pertinent to the medical domain), large image file sizes and resolutions, and the number of available pathology slides shall be a good learning experience.
Imbalanced Dataset
Imbalance in the dataset was handled using data augmentation
Techniques of random rotation and flipping of images across horizontal or vertical axis were implemented
The final dataset was balanced with 1239:1368 ratio between the binary classes.
Image size and memory space constraints
The images were resized to 400x400 pixel dimensions to reduce memory requirement
The original dataset of ~395Gb was reduced to ~62MB
The following table shows the comparative performance of all 5 models. We observe that the ResNet18 performs the best with 73.75% accuracy and 0.7331AUC and 0.7652 as its F1 score. It outperforms the performance of the ensemble model described in [3] which had an accuracy of 55%.
Learned to navigate Machine Learning resources like Kaggle and accustomed ourselves to libraries like Scikit learn, Tensorflow, Keras, Pytorch, and OpenCV.
We implemented data augmentation techniques to balance out the datasets.
The original dataset was ~390 Gb, and to enable easy handling, we had to resize the dataset. Images were reduced to 400x400 pixel size, ~62MB.
This above step reduced the image resolution but allowed us to implement the Machine Learning algorithms on our local machines and experiment with them.
We deployed Five different models on the dataset and analyzed their performance based on Accuracy, AUC, and F1 score metrics.
We observed that the ResNet performed the best among all at an accuracy of 73.75%
The transformer implementation worked but performed poorly.
We infer that due to memory limitation, we had used the compressed low resolution, which retained very few crucial features in the image. Thus, the transformer couldn't perform a good job. However, given the larger memory capability, transformers could give better performance if we use full-size-high-resolution images to train the model.
Train the models on the high-resolution images
Train the models on more data
Test Vision Transformer performance on the full-size images
Integrate image segmentation models to quantize the size and shape of the clots
Test the versatility of the model for other medical cell/ specimen classification tasks.
“Stroke - Symptoms and causes,” Mayo Clinic. https://www.mayoclinic.org/diseases-conditions/stroke/symptoms-causes/syc-20350113 (accessed Nov. 21, 2022).
CDC, “Stroke Facts | cdc.gov,” Centers for Disease Control and Prevention, Oct. 14, 2022. https://www.cdc.gov/stroke/facts.htm (accessed Nov. 21, 2022).
W. Brinjikji et al., “Association between clot composition and stroke origin in mechanical thrombectomy patients: analysis of the Stroke Thromboembolism Registry of Imaging and Pathology,” Journal of NeuroInterventional Surgery, vol. 13, no. 7, pp. 594–598, Jul. 2021, doi: 10.1136/neurintsurg-2020-017167.
C. Tchito Tchapga et al., “Biomedical Image Classification in a Big Data Architecture Using Machine Learning Algorithms,” J Healthc Eng, vol. 2021, p. 9998819, May 2021, doi: 10.1155/2021/9998819.
M. J. van der Laan, E. C. Polley, and A. E. Hubbard, “Super Learner,” Statistical Applications in Genetics and Molecular Biology, vol. 6, no. 1, Sep. 2007, doi: 10.2202/1544-6115.1309.
S. Fitzgerald et al., “Orbit image analysis machine learning software can be used for the histological quantification of acute ischemic stroke blood clots,” PLOS ONE, vol. 14, no. 12, p. e0225841, Dec. 2019, doi: 10.1371/journal.pone.0225841.
A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” arXiv, Jun. 03, 2021. doi: 10.48550/arXiv.2010.11929.
A. Vaswani et al., “Attention Is All You Need.” arXiv, Dec. 05, 2017. doi: 10.48550/arXiv.1706.03762.
R. J. Ramteke and K. Y. Monali, “Automatic Medical Image Classification and Abnormality Detection Using K-Nearest Neighbour,” International Journal of Advanced Computer Research, vol. 2, no. 4, pp. 190–196, Dec. 2012.
“Deep kNN for Medical Image Classification | SpringerLink.” https://link.springer.com/chapter/10.1007/978-3- 030-59710-8_13(accessed Oct. 13, 2022).
G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, "Densely Connected Convolutional Networks," 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261-2269, doi: 10.1109/CVPR.2017.243.
Jiang, Z.-P.; Liu, Y.-Y.; Shao, Z.-E.; Huang, K.-W. An Improved VGG16 Model for Pneumonia Image Classification. Appl. Sci. 2021, 11, 11185. https://doi.org/10.3390/app112311185
Sarvamangala, D.R., Kulkarni, R.V. Convolutional neural networks in medical image understanding: a survey. Evol. Intel. 15, 1–22 (2022). https://doi.org/10.1007/s12065-020-00540-3
Image used on the webpage Courtesy : https://www.prevention.com/health/a32264828/coronavirus-blood-clots/