The Exploratory Data Analysis was performed using the CSV file information provided by Kaggle, combined with additional information from resized images since the original images required high computational power to be visualized.
To get an idea about how the classes are represented, a bar plot of the number of images per class was created. In total, the dataset contains 754 images distributed in two classes. The CE class contains 547 images, while the LAA class has around 207 images. This shows that the dataset is not balanced and needs to be balanced before training classification models on this data.
To address the imbalance of the dataset, the team used data augmentation to create new images based on random alterations (rotation and flipping) of the original images. With this process, the final dataset comprised a total of 2,607 images, with 1,239 images from the CE class and the remaining 1,368 images from LAA class.
After resizing all the images to a dimension of 400x400 pixels, the image memory usage of the complete dataset was reduced significantly. The original set of images occupies a total of 395GB of memory, which made it difficult to work with the original size of the images.
With the resized images, we can see that the distribution of memory usage has a normal distribution, with the majority of the images between 30 and 40 KB. The adjacent histogram shows a visual representation of the count of images vs. the image size.
From the graph on the left, we can observe that the majority of patients with clot images have only one image. This means that the dataset provided has a variety of diagnoses since each patient has unique qualities that provide a different diagnosis and blood clot sample. Nonetheless, some patients have more than one image in the dataset, with a few of them with up to four images at a time. However, the proportion of patients with more than one image is significantly small compared with the ones with a unique image.
According to the data description, the center ID identifies the medical center where the blood clot slide was obtained. By making a count of the number of images per Center, we found that the majority of the images are provided by the medical center number 11, with more than 175 images. The following most common center is the one with ID 4, with over 75 images. We can see that there is a big gap between the most common medical center and the rest. The reason why center 11 is more common has not been found.
After visualizing the distributions of the image sizes per medical center, we see a relation between the number of images per center and the size of the images provided. Those centers with a lower number of images tend to provide images that occupy less memory and vice-versa. However, it is important to highlight that the center with the greatest number of images is not the one with the highest median memory usage but is the one that has a broader range of memory usage variety on their images.
Another point to notice is that medical centers 3, 6, 8, 9, and 11 have outliers, images that occupy a larger amount of memory than the rest of their images.
In addition, the number of images per Class and Center ID was visualized. As seen previously on the first graph, the CE class is dominant. For almost all medical centers, the number of images classified as CE is greater than the ones classified as LAA. There is only one exception, which is medical center 3, with more LAA images than CE, however, the total number of images for this center is low compared to the rest.