Due to the limited size of the dataset, we apply the typical data augmentation operations that are widely applied in the computer vision community. We use imgaug to do the following 6 operations on the images:
We choose these 6 operations by purpose because they can keep the original size and relative positions of the pixels. Furthermore, there is no need for cropping the images since we only rotated the image 270° or 90° which is randomly chosen either one for the final data augmentation operations. And one example of data augmentation is shown in Figure 1. There are 3605 training images with data augmentation and 33 testing images without data augmentation which means in total we have 3638 images.
In our project, we mainly use 3 object detection models, Faster R-CNN, SSD and YOLO. We will discuss each model below. So far, after the data augmentation, we have 3605 images for training which needs more time to finish. We will evaluate the model performance after trainings are finished.
Single Shot Detection(SSD) is a lightweight and fast model for object detection. It utilizes feature pyramid formed by CNN’s down-sampling of feature map to do object detections at different scales. Five modules are used to generate feature maps for placing anchor boxes, predicting their categories and fine-tuning offsets of these anchor boxes. Following figure shows the basic structure of the SSD.
The first module is the base network module which has the same height and width with input image; the second to fourth modules are downsample blocks which half height and width of input images of each module, this is for multiscale detection; the fifth module is a global maximum pooling layer that reduces the height and width to 1. To generate anchor boxes for each pixel position, we use 5 scales and 3 ratios(ratio of height and width). Instead of using all the permutations of scales and ratios values, we use 5+3-1 =7 anchor boxes to reduce training time.
We use the SSD module provided by GluonCV, a Deep Learning Toolkit for Computer Vision on top of MXNet. For input dataset, we use the .rec format file to compress the images and labels to decrease the memory usage for training and the .idx file to implement random batch access of images.
You Only Look Once (YOLO) is a state-of-the-art, real-time object detection model which takes images as input and outputs predicted bounding boxes, class probabilities, and confidence values indicating how confident that a box contains an object. We use the latest version of YOLO, YOLO V3. To be specific, YOLO V3 divides the input image into a SxS grid, if there is a center of an object within one specific grid cell, YOLO V3 will first predict a bounding box and a confidence value for that bounding box, and then output the probability of the contained object in the bounding box belonging to a certain class. figure below shows the structure of YOLO V3. We use YOLO V3, provided by Joseph Chet Redmon, to detect bleeding sites contained in images which can process images at 30 FPS on a Pascal Titan X GPU.
Since we are still waiting for the model to be properly trained, we explore more papers to get a sense about our proposed solution. It turns out a recently published paper by scientists from Johns Hopkins University use the same idea we proposed in the project proposal to improve their model performance and they call this method model fusion and show this is a promising solution to medical image object detection problems.
The paper gives a more clear theoretical sense of our method when we were unsure of whether our model would work. The common part with our model in their paper is the 2D detector part where they combine SSD, Faster R-CNN and R-FCN to do vertebrae localization as shown in upper figure below. Their reason to use three object detection models is to generate more detected objects to help nearest neighboring clustering methods to better tune the final prediction results. This method overcomes detection errors or random fluctuations in individual images.
By carefully reading their paper, we better propose that our crowding voting methods can work with the following two reasons:
1) Each model will show different performances for images from different patients, we can choose to add weights to the overall model which will result in higher performance with the voting mechanism.
2) As each medical image only provides limited information about objects, there is a high possibility that the bleeding site is missed when only one single model is used. Thus, combining the detecting results of the three models can conduct a more convincing and reliable detection.
After conducting a literature review, we are confident that our proposed method should work better than a single detection model. Since the training in the hyperparameter search is vital for the success of deep learning methods, we will try to get more computing resources to train the model and forming the final voting mechanism.