Mid-Term Report

Data Augmentation:

Due to the limited size of the dataset, we apply the typical data augmentation operations that are widely applied in the computer vision community. We use imgaug to do the following 6 operations on the images:

We choose these 6 operations by purpose because they can keep the original size and relative positions of the pixels. Furthermore, there is no need for cropping the images since we only rotated the image 270° or 90° which is randomly chosen either one for the final data augmentation operations. And one example of data augmentation is shown in Figure 1. There are 3605 training images with data augmentation and 33 testing images without data augmentation which means in total we have 3638 images.

Model Training:

In our project, we mainly use 3 object detection models, Faster R-CNN, SSD and YOLO. We will discuss each model below. So far, after the data augmentation, we have 3605 images for training which needs more time to finish. We will evaluate the model performance after trainings are finished.

Fast R-CNN: Faster R-CNN is a classic object detection model which is proposed by Kaiming He and his colleagues in 2016. Unlike SSD or YOLO discussed below, Faster R-CNN is a two-stage detecting method which divides the object detection task into two stages, the region proposal stage and the bounding box classification and regression stage. For region proposal stage, Faster R-CNN uses a unique neural network for the region proposal task which is called Region Proposal Network (RPN) and then the proposed candidate bounding boxes will be fed into another neural network to decide the class and better tune the position of the bounding box. The major performance improvement comes from sharing some convolutional layers together to fasten the calculation of models as shown in figure below. Faster R-CNN achieves both fast speed detection and relatively higher accuracy. We use the Faster R-CNN module provided by ChianerCV, a deep learning based computer vision library built on top of Chainer.

SSD :

Single Shot Detection(SSD) is a lightweight and fast model for object detection. It utilizes feature pyramid formed by CNN’s down-sampling of feature map to do object detections at different scales. Five modules are used to generate feature maps for placing anchor boxes, predicting their categories and fine-tuning offsets of these anchor boxes. Following figure shows the basic structure of the SSD.

The first module is the base network module which has the same height and width with input image; the second to fourth modules are downsample blocks which half height and width of input images of each module, this is for multiscale detection; the fifth module is a global maximum pooling layer that reduces the height and width to 1. To generate anchor boxes for each pixel position, we use 5 scales and 3 ratios(ratio of height and width). Instead of using all the permutations of scales and ratios values, we use 5+3-1 =7 anchor boxes to reduce training time.

We use the SSD module provided by GluonCV, a Deep Learning Toolkit for Computer Vision on top of MXNet. For input dataset, we use the .rec format file to compress the images and labels to decrease the memory usage for training and the .idx file to implement random batch access of images.

YOLO:

You Only Look Once (YOLO) is a state-of-the-art, real-time object detection model which takes images as input and outputs predicted bounding boxes, class probabilities, and confidence values indicating how confident that a box contains an object. We use the latest version of YOLO, YOLO V3. To be specific, YOLO V3 divides the input image into a SxS grid, if there is a center of an object within one specific grid cell, YOLO V3 will first predict a bounding box and a confidence value for that bounding box, and then output the probability of the contained object in the bounding box belonging to a certain class. figure below shows the structure of YOLO V3. We use YOLO V3, provided by Joseph Chet Redmon, to detect bleeding sites contained in images which can process images at 30 FPS on a Pascal Titan X GPU.

Model Fusion:

Since we are still waiting for the model to be properly trained, we explore more papers to get a sense about our proposed solution. It turns out a recently published paper by scientists from Johns Hopkins University use the same idea we proposed in the project proposal to improve their model performance and they call this method model fusion and show this is a promising solution to medical image object detection problems.

The paper gives a more clear theoretical sense of our method when we were unsure of whether our model would work. The common part with our model in their paper is the 2D detector part where they combine SSD, Faster R-CNN and R-FCN to do vertebrae localization as shown in upper figure below. Their reason to use three object detection models is to generate more detected objects to help nearest neighboring clustering methods to better tune the final prediction results. This method overcomes detection errors or random fluctuations in individual images.

By carefully reading their paper, we better propose that our crowding voting methods can work with the following two reasons:

1) Each model will show different performances for images from different patients, we can choose to add weights to the overall model which will result in higher performance with the voting mechanism.

2) As each medical image only provides limited information about objects, there is a high possibility that the bleeding site is missed when only one single model is used. Thus, combining the detecting results of the three models can conduct a more convincing and reliable detection.

Conclusion:

After conducting a literature review, we are confident that our proposed method should work better than a single detection model. Since the training in the hyperparameter search is vital for the success of deep learning methods, we will try to get more computing resources to train the model and forming the final voting mechanism.

Report abuse