INTRODUCTION
I.A. Overview
Computers, these days have the power to do much more the humans can imagine. It is very interesting and motivating to understand how they can look at a photograph and tell what is in it[1]. Only a few years ago, this technology of image recognition was science fiction, but now due to so many technological advancements and innovations, it has become a part of many software applications. For example, if we pass in a picture of cat, the computer should generate a label saying ‘cat’ because that is the main object that appears in the picture. In the past few years, researchers have made huge breakthroughs in image recognition thanks to neural networks. Now, because of neural networks, it is possible to recognize objects and photographs with very high accuracy.
Fig1. General representation of Neural Networks
A neural network consists of separate nodes called neurons, which are arranged into a series of groups called layers. Nodes in each layer are connected to the nodes in the following layer and data flows from the input to the output along these connections as shown in figure 1. Each individual node is trained to perform a simple mathematical calculation, which feeds its result to all the nodes connected to it[2]. Each node tweaks the value it receives slightly and passes its result onto the next node. For example, we could use a neural network to do simple addition. To use it to do addition, we pass in two values in the input layer, which we want to add, let us say 2 and 3. The network will give us the result in the output layer as 5. However, neural networks are not limited to doing simple operations like addition[3]. With data flowing across the entire network comprising of many layers, the neural network is able to model much complex operations.
The main goal of this project is to develop a custom image recognition system using neural network. Firstly, the project introduces how to design a neural network architecture, which is capable of recognizing objects appearing in the photograph. Then the developed model is trained with thousands of images so it can tell the difference between different kinds of objects, like dogs and airplanes. Further efforts will also show how to use transfer learning to leverage pre-trained neural networks to build object recognition systems more quickly and with less training data.
I.B. Model specifications
This project uses a software framework called Keras to code the neural networks. It is a high-level library for building neural networks in Python with only few lines of code. Each node of the neural network is connected to every other node in the following layer. Such type of networks are called as densely connected layer networks[4]. Densely connected layers are the most basic kind of layer in the neural network. Using Keras we can also customize how each layer works and with a corresponding activation function. Before values flow from the nodes in one layer to the next, they pass through an activation function. Activation functions decide which inputs from the previous layer are important enough to feed to the next layer. The project uses a combination of Softmax and RELU.
The final step of defining a neural network is to compile it by calling model.compile. This tells Keras that defining the model is complete and can now build it out of memory. While compiling, it is also essential to pass in the optimizer algorithm and the loss function. The optimizer algorithm is further used to train the neural network and the loss function calculates how the training process measures the correctness of neural networks predictions. This will then create a complete neural network that can be trained to solve very simple classification problems. But to recognize objects and images, there is a need to create much larger neural networks with much larger input layers and more complex layer types.
I.C. Significance
Images are stored on a computer in the form of series of individual color pixels. Each color pixel is made up of a mix of three colors red, green and blue.
Fig 2. Matrix representation of image
The color intensities are stored in a form of matrix. Each pixel or element of the matrix is just a number in between 0 to 1 that represents how intense the color should be at that point with the bright points being closer to 1 and the darker points being closer to 0. That means that each color channel is really just a two dimensional array of integers with numbers indicating each pixel in the image. If we layer the three color channels on top of each other, we will get an image which can be considered as a three dimensional array. A more general way to represent the images is in 8 bits. Hence, the values can be modified to vary from 0 to 255 rather than using 0 to 1.
Let's assume that we want recognize a 256x256 pixel image[5]. But even with the small image, we need 256 x 256 x 3 input nodes in the neural network which comes out to be 196000 input nodes. Further, each layer of the neural network will use even more nodes and the number of nodes in the entire neural network will quickly grow into the millions[6]. Therefore, it is evident that using neural networks for image processing is very computationally intensive as processing an image requires sending it through a neural network of millions of nodes. The proposed project will tackle this issue and will look into techniques like max pooling and dropout to reduce the computation.
DATA
In order for neural networks to perform accurately, it is required to supply large amount of training data. Hence, the project uses the CIFAR-10 dataset[7]. This dataset includes 60000 32x32 color images in 10 classes with 6000 images per class as shown in figure 3. The entire dataset is subdivided into 50000 training images and 10000 testing images. Each image in the dataset also includes a matching label, which represent the ground truth.
Fig 3. CIFAR -10 data set
The network considered in the project is build using dense layers, convolution layers, max pooling and dropout layer. Implementation of each layer is described in the next section.
IMPLEMENTATION
Fig 4. Algorithm flowchart
Fig 5. Simplistic network without any additional layers
The neural network should be able to identify the object in the image even if it is not centered i.e. it should recognize objects in any position. This effect is called as translational invariance. The simplistic model as shown is Figure 5 is not able to achieve this task and the solution is to add a new type of layer to our neural network called the convolutional layer. Unlike a normal dense layer, where every node is connected to every other node, this layer breaks apart the image in a special way so that it can recognize the same object in different positions. This layer breaks the image into small, overlapping tiles, which is done by passing a small window over the image. Each element in the array represents where a certain pattern occurs. However, because we are checking each tile of the original image, it does not matter where in the image a pattern occurs. This 3D array is fed into the next layer of the neural network. It will use this information to decide which patterns are most important in determining the final output. Adding a convolutional layer makes it possible for neural network to be able to find the pattern, no matter where it appears in an image. The schematic is represented in Fig 6.
Fig 6. Neural network after addition of convolution layer
After the convolutional layer comes the max pooling. Max pooling is the layer, which will scale down the output of the convolutional layers by keeping only the largest values and throwing away the smaller ones. This makes the neural network more efficient by throwing away the least useful data and keeping the most useful data. The only parameter that is needed to be passed is the size of the area that we want to pool together. For instance, using a two pixel by two-pixel pool size will divide the image up into two by two squares and only take the largest value from each two by two region. This will therefore reduce the size of the image while keeping the most important values. The idea is that we are still capturing roughly where each pattern was found in our image, but we are doing it with 1/4 as much data. This will get nearly the same end result, but with a lot less work for the computer to do in the following layer of the neural network. The schematic is represented in Fig 7.
Fig 7. Neural network after addition of max pooling
Before going to the output, it is of great advantage to add a dropout layer. One of the problems with neural networks is that they can tend to memorize the input data instead of actually learning how to tell different objects apart. We can force the neural network to try harder to learn without memorizing the input data. To do this, some of the data is randomly thrown away by cutting some of the connections between the layers. This is called dropout. The only parameter needed to be passed in is the percentage of neural network connections to randomly cut. The proposed model uses 25%. By randomly cutting different connections with each training image, the neural network is forced to try harder to learn. This method is called dropout because it is just letting some of the data drop out of the network randomly. The schematic is represented in Fig 8.
Fig 8. Neural network after addition of max pooling
With all the layers in position, Keras will proceed to compile it. Since there are 10 different possible categories for the CIFAR data set, the model uses categorical crossentropy. Apart from that Adam or Adaptive Moment Estimation is used as an optimization algorithm for training the neural network.
TRAINING SPECIFICATIONS
To train the data it is important to define the optimal batch size, which tells how many images are fed into the network at once during training. A smaller number might take a long time and might not ever finish whereas a very large number may run out of memory. The proposed model uses a batch size of 60. One full pass through the entire training data set is called an epoch. For this project, 30 passes through the training data set are considered. The more passes through the data gives more chance for neural network to learn but it might take longer time to train. With all layers set up and training performed, the model can predict or test random images. The model is developed such that for image input image, the output will be a label along with its likelihood of correct prediction.
RESULTS AND DISCUSSION
Fig 9. Working model
Fig10. Layer details
Figure 9 represents the entire working model. As discussed previously, the model is a combination of various layers like convolution, max pooling, dropout etc. Figure 10 shows the layer by layer analysis and details of our model. We can see that the dropout and max pool layer didn’t add any parameters to the system.
The model was trained from scratch based on the specifications listed in the previous section. 10 images belonging to respective classes were chosen at random from internet. Fig.11 shows the likelihood of each image achieved as a result of prediction using the trained neural network. It can be seen that the model does an excellent job in predicting all of the images. The likelihood of all the 10 images are more than 83% with 7 of them being above 99%. For all the images the likelihood lies in between 83.7% to 100%
Fig 11. Likelihood of randomly supplied images
The results were compared with the pre existing model such as VGG16 VGG19 and ResNet50. Table 1 and Table 2 corresponds to the possible predictions made by each of the network with the probability of the highest likelihood placed at the top of each column of the table. All of the three networks are pre trained and are very extensive in predicting the details of the images. For instance, Table 2 shows the possible breed of the dog. Interesting thing to note here is that the likelihood of the plane image is higher as compared to the dog image. This trend is also observed in the manually developed neural network and is shown in Fig 11. The details of likelihood of each category are specified in supplementary.
Table 1. Pre trained model comparison for airplane image
Table 2. Pre trained model comparison for dog image
CONCLUSION
The proposed model works efficiently in predicting and labeling the images. Since the model was developed and trained from scratch, it took a long time to complete training. However, this can be improves by further adding dropout layers. This is left for future explorations. The model also works in line with the pre trained model and better in some cases. Overall, the developed model is suitable for image recognition system and can be trained with more exhaustive data set to predict images which does not lie in CIFAR data set.
REFERENCES
M. Koziarski and B. Cyganek, “Image recognition with deep neural networks in presence of noise - Dealing with and taking advantage of distortions,” Integr. Comput. Aided. Eng., vol. 24, no. 4, pp. 337–349, Jan. 2017, doi: 10.3233/ICA-170551.
“What Is a CNN?” Accessed: Dec. 13, 2020. [Online]. Available: www.cadence.com.
M. I. Quraishi, J. P. Choudhury, and M. De, “Image recognition and processing using artificial neural network,” in 2012 1st International Conference on Recent Advances in Information Technology, RAIT-2012, 2012, pp. 95–100, doi: 10.1109/RAIT.2012.6194487.
B. B. Traore, B. Kamsu-Foguem, and F. Tangara, “Deep convolution neural network for image recognition,” Ecol. Inform., vol. 48, pp. 257–268, Nov. 2018, doi: 10.1016/j.ecoinf.2018.10.002.
J. Fu, H. Zheng, and T. Mei, “Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition,” 2017.
S. Xie, A. Kirillov, R. Girshick, and K. He, “Exploring Randomly Wired Neural Networks for Image Recognition,” 2019. Accessed: Dec. 13, 2020. [Online]. Available: https://github.com/facebookresearch/RandWire.
“CIFAR-10 and CIFAR-100 datasets.” https://www.cs.toronto.edu/~kriz/cifar.html (accessed Dec. 13, 2020).
APPENDIX