Classifying Dogs vs Cats

Convolution Neural Networks (CNNs)

What are CNNs?

Convolution Neural Networks (CNNs) are a deep learning technique that is used for image classification, object detection, and segmentation. There are three main points to understanding how a CNN works. These include: local receptive fields, shared weights and biases, and activation pooling.

Local Receptive Fields

All neural networks have a hidden layer. These transform input layers into output layers. We do not need to know exactly what the hidden layer is. It is usually pre-trained and considered a black-box object.
In a CNN, a small region of input connects to the neurons in the hidden layer. These regions are called local receptive fields.
The local receptive field is translated using convolution to create a feature map of the images passed through.
Below is an image explaining the concept of hidden layers.

Source: https://ai.plainenglish.io/my-notes-on-neural-networks-adf3e49657f8?gi=58d21f40acae

Weights and Biases

In a CNN each neuron has weights and biases. The model is trained using these weights and biases. The system can then be used to classify the test data.
In a CNN, the weights and biases are the same for all neurons in a single hidden layer. This means that each hidden layer detects one feature in the images. This is really helpful, because translation of an image won't affect detection. For example, in our project when we resized and recolored the network, this will not affect the CNN. It will still be able to detect specific features of both cats and dogs.

Activation and Pooling

Activation is used to transform the output layers.
Pooling is used to reduce the dimensionality of the output layers to make it a single output.
Both of these are used to simplify the model, reducing the number of parameters needed to make a decision.

Together these three ideas allows us to create all the layers (input, hidden, and output) on the CNN.

Application

CNNs are a very important technique that we are using to classify cats and dogs. It not only covers concepts from within class (convolution, weights, and biases) but also goes further to introduce new topics such as hidden layers and neural systems in networks. We are using CNNs to increase the accuracy of our model, since they can detect multiple features even if an image has been translated, resized, and filtered.

One layer CNN Model

We started exploring CNN by using a simple single layer CNN with 32 filters followed by a max pooling layer modeled after VGG (Visual Geometry Group) architecture. We used a batch size of 64 images and 20 epochs. As you can see below, our one layer CNN model is built up of 32 3 x 3 convolution filters followed by a 2 x 2 max pooling layer. We then have a flattening layer and two "dense" activation layers.

We trained and tested this model on full color images and we used our entire data set: 20,000 train images and 5,000 test images as discussed in our "data" section of the website. Our results from this simple test are promising - we saw a test loss of 0.578 and a test accuracy of 72.80%. When fitting training data, we plotted the loss/accuracy per epoch (see graphs below). As you can see, as the number of epochs increased to 20 the accuracy continued to increase for training and validation and loss decreased. This is promising as well!

We also plotted the confusion matrix associated with this model. The label 0 corresponds to cats and the label 1 corresponds to dogs. As you can see on the left, our model predicted 3,140 cats overall and 1,860 dogs overall so our model is overpredicting cats. The model mispredicted 1000 dogs as cats and only 360 cats as dogs.

VGG-16 CNN Model

We have begun to implement one type of CNN model and chose to implement VGG-16 which is a well researched CNN developed in 2014 at the University of Oxford and has "achieved 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes" (source). Two of the included classes were in fact cats and dogs so we choose to use the pre-trained model to save computation time and power. As you can see below, we loaded the pre-trained model without the top layer and then assembled the top layer to work for our input size. We also implemented early stop and learning rate reduction so training would be faster. We referenced this site while coding our VGG16 model.

We had a lot of issues getting our VGG-16 model to run on our entire dataset as it has so many more layers compared to our previous models. Therefore we decided to decrease the number of train and test images to 10,000 train images (5,000 cats and 5,000 dogs) and 2,000 test images (1,000 cats and 1,000 dogs). We ran with 25 epochs and a batch size of 128 and we kept the images full color. Our results were great even with the reduced number of images! We saw a test loss of 0.14901 and a test accuracy of 94.050%.

We again plotted the confusion matrix associated with this model. The label 0 corresponds to cats and the label 1 corresponds to dogs. As you can see to the left, the model mispredicted cats and dogs relatively equally with 65 cats mispredicted as dogs and 54 dogs mispredicted as cats. This is an improvement over our previous model.

When fitting training data, we plotted the loss/accuracy per epoch (see graphs below). As you can see, as the number of epochs increased to 25 the accuracy continued to increase for training and validation and loss decreases for both training and validation.

All CNN Code

Page updated

Report abuse