Data
Database
Our initial database was sourced from Kaggle (https://www.kaggle.com/c/dogs-vs-cats/data). The dataset includes both test and train data, divided into two folders. The training data includes 12,500 images of dogs and 12,500 images of cats along with their correct class in the image title (e.g. dog-8837.jpg). The test data includes 12,500 images of dogs and cats without their correct classifications (e.g. 1.jpg). Due to the test data not being labeled and the fact the we want to avoid classifying such a large number of files to test the accuracy of our prediction models, we have made modifications to the database to suit our needs.
Our Training and Testing Data
Our train data will include 80% of the training data provided by Kaggle (10,000 cats and 10,000 dogs). Our test data will include the other 20% (2,500 cats and 2,500 dogs). We will use the model created based on the training data to determine if it can correctly predict the already classified test data.
Resizing and Filtering
All images in our dataset are not the same size or dimension, which is necessary for our model creation process. Therefore, we have arbitrarily resized every image to be 200 x 200 x 3 pixels. This can possibly lead to errors as resizing a rectangular picture of a dog to square could distort the image and potentially make it look more like a cat, for example. However, for the sake of our project we assumed resizing the images would not distort the images significantly.
Each image is also in full color. However, since cats and dogs have similar colorways we do not expect this to be a prominent feature in classification. Therefore, to save memory in computations, we have added a greyscale filter to our data for some computations. It will be specified which model uses the full-color data and which use greyscale data.
The images below shows the before and after of the resizing and filtering of a cat image.