In general, building any supervised learning ML model involves two phases:
The Training Phase: This is where the model is given training data to learn from. (What you just implemented.)
The Testing Phase: While the ML model may do well on classifying the data it trained on, what we really want to know is how well will it do at classifying data it did not train on?
It is sort of like testing humans - sure, you can directly repeat the examples your teacher shows you in class, but can you apply the concept to a new problem? In the real world, we're not training an ML model so that it can identify the things we have already identified (the training data already has labels), we're building the model so that it can receive new examples, that don't have a label yet, and make a classification on those.
So, while the ML model may look like it is doing a very good job at its task as it is training - maybe it is able to correctly classify 90-100% of its training examples correctly - what we then need to do is test the trained model on examples it did not train on and see if it can now classify those correctly.
A simple way to do this is to hold back some of your data as the test set. We can pull some of the records out of the training set, set them aside and do not let the algorithm train on them. Then, once the model is finished training on the provided training set, we can test it using the held-out records.
We will pass the test records into our model, get a predicted classification for each of them, and then because our records are labeled, we can compare the predictions to the actual labels and see how many of the test set our model classified correctly.
The test accuracy is a much better metric to use to evaluate how well it is working, instead of the training accuracy, as the test accuracy tells us how well our model will perform on data it did not train on.
What issues could we possibly have if we test our ML model using the holdout method?
Think about this for a minute and discuss it with your partner, before looking at the issues identified below.
Less data for training (and less data for testing)
If we set aside part of our data for testing, we now have less data to train on. What if our dataset isn't particularly large? Now we've just reduced our training set size even further.
And conversely, we'd ideally like to be able to test our model on as much data as we can, but needing to partition out a training set means that we have less data to test on.
Some records never get trained on (and some never get tested on)
By holding out records for the test set, those records are never used for training. But what if those records hold information that we want our model to learn?
And conversely, the records that are in the training set are never used for testing. We want to make sure we are evaluating our model as well as we can, on as many examples as we can.
Data representation could be skewed
A class or feature overrepresented in one set may be underrepresented in the other set. For example, if the task is to classify dogs vs cats, then we could happen to pull out a test set that contains mostly dogs in it. This may also mean that because we pulled so many dog records out for the test set, then the training set has many more cats in it than dogs. We don't want to train on a bunch of cats and then test on a bunch of dogs.
A similar thing could happen with a particular feature (rather than class). For example, maybe the dataset is medical records and the classes are people with the disease and people without the disease. It could happen that we pull out a test set that contains mostly females, reducing the number of females in the training set. While male/female is not the class label, we still don't want to train on mostly male records and then test on mostly female records.
Varying performance & evaluation of the model
If we are using test accuracy - how many of the test set did the classifier classify correctly - as our measure of how well this classifier works, then this is all hinging on which records got selected for the test set vs the training set.
If it just so happens that most of the 'dog' records got put in the test set, so the model trained on mostly cats, then it is going to get many of the test set wrong and it will look like this is a bad solution to our dogs vs cats problem. However, if it just so happened that an equal number of dogs and cats got put in the test set, and the model trained on more equal numbers of cats and dogs, then it could get more of the test set correct and it would look like this is a very good solution to our dogs vs cats problem.
And we don't want our conclusion of whether or not this is working to rely so heavily on which records got selected for which set.
In fact, we could easily imagine an unethical scenario where someone cherry-picks their test set to include records that are easy to classify, making their ML model appear to perform very well on unseen data.
So, again, we do not want our assessment to rely on how this train/test split goes down.
So then what do we do about it?
Instead of testing with a single held-out test set, we can perform what is called a K-fold cross validation.
The way this works is, we split the dataset into K partitions, where K is some number (3, 5, 10, whatever you want).
In fold 1, we will hold out partition 1 as the test set and the other K-1 partitions will make up the training set. We will train on the training partitions, then test on the test partition, and record how many of the test records our model classified correctly.
We will then repeat, but in fold 2, we hold out partition 2 as the test set and the other K-1 partitions will make up the training set. We will again train on the training partitions, then test on the test partition, and record how many of the test records our model classified correctly.
We will repeat for K number of folds, until each partition has been held out as the test set.
It looks like this:
Once every fold is complete, we average the results from all folds to give us our final evaluation.
By doing this, we have addressed all of our issues with the holdout method above!
By systematically shifting the test set each time, we have ensured that every record has been tested on at some point. And also that every record has been trained on. (Issues #1 and #2)
By repeating the train/test process over K folds, we are essentially repeating the holdout method K number of times. So maybe fold 1 happened to have a bad train/test split - maybe a skewed data representation with too many dogs in the test set - so those results weren't great. Ok, well, let's try it again with a different test set. So we move on to fold 2. Now this train/test split might have it's own issues or coincidences going on. Ok, well, then let's try it again with yet a different test set. And we repeat K number of times, with K different test sets, so that we are not basing our evaluation all on one train/test split which could have some sort of coincidental issue going on, instead we are basing our evaluation on an average of many trials with many different train/test sets. (Issues #3 and #4)
But wait...
In each fold, we ended up with a different model. Because each fold's training set is different, then our model ends up with different weights, depending on which training records it was given. So we've trained K different models that all have different weights from each other!
Which one do we use?!
None of them!
What we did with the K-fold cross validation was to evaluate how well this is going to work for this problem.
Over K different trials, when I train a perceptron to learn rocks from mines, on average, it looks like it will correctly identify rocks and mines ~95% of the time. No matter what specifically was in the training set, or what specifically was in the testing set, because I changed that up for each trial, it looks like using this technique (perceptron) on this problem (rocks vs mines) will give me a 95% accuracy, on average.
Now that we have evaluated or validated the process - we have shown that this is going to work - then we can make the final model.
To make the final model, we run the training one final time, using the entire dataset to train on. We do not need to hold out a test set because we already tested, via cross validation. We proved this was going to work well, over numerous trials. So now let's do it, and let's train it on everything we've got. And this is the model that will be used in real life.
Implement K-fold cross validation to properly evaluate your perceptron on unseen data. We will do this for the binary perceptron (with the sonar dataset).
In the file perceptron_binary.py, add another function called cross_validate with the following function header:
def cross_validate(dataset, n_folds, n_epoch):
In this function, you should split the dataset into K partitions (K = n_folds). Then for each fold, train a perceptron classifier on the training partitions and test it on the test partition. Hold on to how many of the test records it got correct. Then repeat for all K folds.
This function should display the accuracy for each of the individual folds, and then the average of those accuracies. For example, with a 5-fold cross validation, it could look like this:
Folds: [87.8, 70.7, 78.0, 68.2, 68.2]
Mean Accuracy: 74.63%
In run_perceptron_binary.py, comment out the last line of the file, train(dataset, n_epoch), and instead call cross_validate(dataset, n_folds, n_epoch). You can set n_folds to 5, or whatever you want.