Object detection: Faster R-CNN with Region Proposal Networks method

1. Objective

In our final project, we want to talk about faster R-CNN. We wonder how fatser R-CNN can reduce the running time of the detection networkds. We will try to build a model that is able to detect dices with the faster R-CNN algorithm. The model can recognize the face up value of the dices and display the information around the objects.

2. Procedure

What Is R-CNN?

In 2012, Convolutional Neural Network (CNN) has gained great success. People started to think about the possibility to apply it on object detection to expand the usage of CNN. R-CNN is the first publication to take advantage of CNN to object detection successfully. After release R-CNN, researchers kept improving it and finalize two revisions of Fast R-CNN and Faster R-CNN. Faster R-CNN is able to make real-time object detection come true. The implementation of R-CNN relies on a technique named selective search. By using this technique, a image will be divided into multiple small regions and these small regions will be recursively combined into many portions of regions. Those portions of regions are called regional proposal. These regional proposals are then fed into a CNN to do feature extraction. A SVM (Support Vector Machine) classifier is used to determine the class of the regional proposal. Once the regional proposals are classified, the bounding boxes are sent to a linear regressor to make them more precise.

Fast R-CNN and Faster R-CNN Comparison

Potentially doing computation on overlapped regions for many times, passing all the regional proposal to the CNN for feature extraction takes a bit of time. Fast R-CNN resolves that problem by applying a technique called Region of Interest Pooling to avoid passing all the region proposals to the CNN. It also removed the SVM classifier and the boudning box regressor and conbined everthing to one neural network to avoid training and running three separate models.

Fast R-CNN made a huge improvement on process speed. However, it still can't reach to real-time object detection. It needs selective search to generate all the regional proposals and that is a really slow process in the entire pipeline. The next revison Faster R-CNN takes the advantage of CNN to extract features for generating regional proposals and achieve the real-time object detection.

Implementation

Collecting Images

Since there is not enough data online, we decided to collect data by ourselves. We took 243 images of dices containing roughly the same numbe of ones to sixs. The dataset is then splitted to two sets, 188 images for training and 55 images for testing. The images we took have different number of dices with different number facing upward on a variety of backgrounds.

Labeling Images

We labeled all the images by ourselves using a labeling software called LabelImg. Six-sided dices were used to reduce the required number of images for our data set as well as the complexity of labeling process. Therefore, every dice in each image was labeled in a group of 6 (1 through 6) based on its facing up side.

training data example


labeled data


Train the Model

We then used Tensorflow Object Detection API to train the faster R-CNN object detection model. We trained the model for about 3 hours, 30000 steps.

3. Result

Successful Detection


Error Detection


Our model, in the end, can be used to detect both images and videos that contain 6-sided dice, and the result of detection overall is satisfactory. However, since we are detecting dice, and dice have 6 sides, it is sometimes hard for the network to know which side is facing upward. For example, if the angel of the image taken reveals both side two and side three of the dice, the neural network might think it as three because it sees 5 dots. If our purpose is purely detecting dice and recognizing the number, the model can fulfill our purpose. However, if our purpose is instead, detecting which side is facing upward, we need to place the camera directly on top of the dice to minimize the visibility of the sides that are facing horizontally. As you can see in the image above, the model works quite well when the camera is directly facing down. And because of the limiting resources, the model sometimes may have wrong detection just like the one showed above.

4. Analysis

In this project, only 200+ data images are used training faster R-CNN network, while training time is about three hours and the accuracy can reach about 80 percent. Therefore, we can see that the faster R-CNN method is fast and trainable with a reasonable amount of computational resources. Because of the time limit, we may try to train other network instead of faster R-CNN in the future so that we can have them compared to see if the faster R-CNN with regional proposal network method is faster and more efficient.


5. Reference

    1. The use of CNN models in the subcortical visual pathway https://ieeexplore.ieee.org/abstract/document/222799/metrics#metrics [1]
    2. Faster R-CNN: Towards Real-Time Object Detection with Regional Proposal Networks https://papers.nips.cc/paper/5638-faster-r-cnn-towards-real-time-object-detection-with-region-proposal-networks.pdf [2]
    3. ImageNetClassificationwithDeepConvolutional Neural Networks https://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf [3]