In this project, we adopted the faster RCNN framework developed by MATLAB. We tested several different CNN core structure on top of the frame work for object detect task and compared the results to see if we can find an affordable way to achieve good precision using CPU on personal computer. We built one 19 layer CNN and compared it with the Alexnet and VGG-16.
The input of training including original images and a table containing information of ground truth label of objects (cars, trucks, and pedestrians). The first column of table is the filepath+filename of each image; the second column of table is the bounding box coordinates of cars; the third column of table is the bounding box coordinates of trucks; and the last column of table is the bounding box coordinates of pedestrians. Each bounding box is an array of 4 numbers: [xmin, ymin, width, height] of information of object.
The output of the model for each image will contain three information: the bounding box coordinates, the label, and the confidence score of the prediction.
At first, we wanted to do the training with 4000 images at once for each model, but our computer cannot take all the computation. In the end, we ended up with train 300 images each time, saving the trained network, and keep training until we go through all the training samples, which is similar to transfer learning. The difference between two models is that our model trained only with training samples, but the model using pretained alexnet contains information acquired from previous training with a lot more samples for feature extraction and classification.
Our model has 19 layers for CNN and alexnet has 25 layers for CNN part. The architecture of neural network is showing below:
CNN Architecture of AlexNet
CNN Architecture of our model
There are four main steps for the training: