Data

Database

Our initial database was sourced from Kaggle (https://www.kaggle.com/c/dogs-vs-cats/data). The dataset includes both test and train data, divided into two folders. The training data includes 12,500 images of dogs and 12,500 images of cats along with their correct class in the image title (e.g. dog-8837.jpg). The test data includes 12,500 images of dogs and cats without their correct classifications (e.g. 1.jpg). Due to the test data not being labeled and the fact the we want to avoid classifying such a large number of files to test the accuracy of our prediction models, we have made modifications to the database to suit our needs.

Our Training and Testing Data

Our train data will include 80% of the training data provided by Kaggle (10,000 cats and 10,000 dogs). Our test data will include the other 20% (2,500 cats and 2,500 dogs). We will use the model created based on the training data to determine if it can correctly predict the already classified test data.

Resizing and Filtering

All images in our dataset are not the same size or dimension, which is necessary for our model creation process. Therefore, we have arbitrarily resized every image to be 200 x 200 x 3 pixels. This can possibly lead to errors as resizing a rectangular picture of a dog to square could distort the image and potentially make it look more like a cat, for example. However, for the sake of our project we assumed resizing the images would not distort the images significantly.

Each image is also in full color. However, since cats and dogs have similar colorways we do not expect this to be a prominent feature in classification. Therefore, to save memory in computations, we have added a greyscale filter to our data for some computations. It will be specified which model uses the full-color data and which use greyscale data.

The images below shows the before and after of the resizing and filtering of a cat image.

Before

After

Images from dataset: https://www.kaggle.com/c/dogs-vs-cats/data

Preprocessing

We preprocessed our data using MATLAB and saved it as "80_20_DATA.mat" to allow us to load the data faster when working. During preprocessing, we loaded all train data and sorted them into cat and dog matrices where each image is 200 x 200 x 3 pixels. Download to view our "80_20_DATA.mat" file below.

80_20_DATA.mat

Additionally, as we progressed in our project we found it necessary to save the scaled images as .jpeg files into folders to make it easier to create our CNN model. We saved our images into train and validation folders where each had respective cat and dog subfolders. View our data folder below.

Matlab preprocessing code is located here: preprocessing code