Brain Imaging Group
Jason Kennedy, Kenza Slaoui, Simeon Kolev, Suryadyuti Baral, and Tanay Bali
Description
The dataset we will be using for our research is a combination of 3 datasets from 3 different sources. Images across these datasets are not homogenous in size, angle, or general positioning and will need to be pre-processed to remove unusable images, remove excess margins, and standardized to unit scale. The dataset includes 7023 black and white brain MRIs with each image classified into one of 4 different classes; No Tumor, Glioma, Meningioma, and Pituitary. The dataset can be found at Brain Tumor MRI Dataset | Kaggle.
Objective
The application of this research is aimed to improve proper brain tumor detection and diagnosis. We will be using one model to identify the presence and type of tumorous growths in the brain across 4 separate subsets corresponding to the 4 different classes of data, No Tumor, Glioma, Meningioma, and Pituitary.
Our goal for this project is to first organize and rescale the collection of various brain MRIs in our datasets to better visualize the data and gain some intuition regarding the methods and applications of neuroimaging. Similar machine learning research projects have been done in the past and we intend to replicate and improve on previous results using the material taught in STOR 565.
EDA
Exploratory data analysis revealed that a PCA will likely be very effective at improving model efficiency and reducing computation cost while still explaining a significantly large proportion of the variance in the data. The PC scree plot between PC1 and PC2 shows that certain (x,y) coordinate combinations are negatively correlated PC1 and tumor presence. The pixel intensity distribution shows that our images contain large portions of light pixels, normally distributed gray pixels, and a small cluster of black pixels. Intuitively, a large portion of our data is simply neural tissue (light), bone and various densities of brain folds (light gray - dark gray), and tumorous growths (black), which supports our initial summary results.
Techniques
Our main classification methods heavily rely on a Convolutional Neural Network (CNN) model and experimentation with dimension reduced classification trees. We will be using PCA and various machine learning techniques to better understand our data and explain findings from our initial exploratory data analysis. This will be used to corrugate our research plan and strategize the application of complex machine learning models to our data to reach above 90% classification accuracy.