K-Nearest Neighbors (or KNN) is another classification algorithm. The idea behind KNN is that in order to predict a class label for a record, we could find the records in our training data that are most similar to this new one we are trying to predict for - these most similar records are the nearest neighbors. We can look at the labels of the nearest neighbors and if they are mostly class A, then we predict this new record is class A.
The "K" in K-Nearest Neighbors refers to how many neighbors we want to consider. Do we want to find just the 1 most similar record to this one and use its class label as our prediction? Or do we want to take a look at the 15 most similar records to this one and see what the majority of them are labeled? K is a variable that you set, based on how many neighbors you want it to use.
To find the nearest neighbors, the algorithm uses a distance metric, typically Euclidean distance, to determine which training data points are most similar to the point it is classifying.
This model is considered a lazy learner. A lazy learner is a machine-learning algorithm that does not do any training to learn a model. With the perceptron, it had to learn a weight vector, by training on the training data, in order to make predictions for new data. With KNN however, the algorithm needs to find the training records that are most similar to the new record. As we do not know what the new record is ahead of time, we can not prepare ahead of time by training, and we must do all of the calculations right when we receive the new record.
You will implement the KNN algorithm to classify the sonar rock/mine dataset.
To implement a KNN classifier, you will code the following methods:
In this method, you will be calculating the distances from point to each of the other points in the data set. All of the distances should be added to a list, and that list is what should be returned.
You will be use Euclidean Distance, which is the following formula:
If we had two records with three features:
Record A: [2000, 30, 100]
Record B: [2200, 35, 120]
The Euclidian distance between these two records would be computed as follows:
Implement get_distances(point, data) which computes the Euclidian distance from point to all of the other records in data and returns a list of all of these distances.
You can check whether your have implemented this correctly by running:
python3 run_knn.py
You will know you have done it correctly if you see this printout:
Distances: [1.4142135623730951, 2.8284271247461903, 4.242640687119285]
One issue that we have with a K-Nearest Neighbor Model is that the features may not be on the same scale as each other. For example, let us say that we have two features of a city - the first feature is the average summer temperature and the second feature is the population. Let's say we have the following two records: [80, 50000], [75, 60000] and we compute the Euclidian distance between them.
When we compute the Euclidian distance between these two records, the population feature dominates the distance calculation because the magnitude of the population values is so much larger than the temperature feature values. This can effectively make temperature irrelevant in the algorithm. And we don't want that to happen.
So we need to scale our data so that every feature is on the same scale and will be used equally.
We will implement a min-max scaler. The min-max scaling formula is:
x is the original feature value
min(x) is the minimum value for this feature over all data records
max(x) is the maximum value for this feature over all data records
z is the scaled value
You will scale each feature/column in the dataset, one at a time, replacing all values in that column with their scaled values.
Example
Let's consider one feature/column from our dataset and see how we would scale that:
Identify the min and max values in the column.
min(x) = 50, max(x) = 300
Apply the scaling formula to each value in the column:
And so on...
So our scaled data becomes:
Then repeat this for every feature/column. The min and max will be different for each column.
Training set vs Testing set
You need to scale both the training set and the testing set, but this is important: Pull the min and max for the feature/column from only the training set numbers. Then use that min and max to scale both the training set and the test set values.
Return both the scaled training set and the scaled testing set. You can return two things from a function by just listing them both: return a, b
You can check if you have implemented this correctly by running:
python run_knn.py
And you should get the following output:
Normalized train data: [[0.0, 0.0], [0.5, 0.5], [1.0, 1.0]]
Normalized test data: [[0.5, 0.0], [1.5, 1.5]]
In this method is where you will implement the logic of the KNN algorithm.
You are given a train_set, a test_set and a k to use. What you want to do is predict a label for every record in the test set and return a list of all of the predicted labels, as well as a list of all the actual labels.
So, for each record in the test set, find its k nearest neighbors in the training set, see what the neighbors' labels are, and the majority label becomes the prediction.
Let's walk through it in pseudocode:
run_knn(train_set, test_set, k):
#first, scale the data
normalize_data()
#start a list of predictions
preds = []
#go through each test record and get its k nearest neighbors
for record in test_set:
#compute distances from record to all the training records
#careful here! remember the label is in the last column
#do NOT pass the label into the distance calculation!
get_distances()
figure out which training records are the k closest ones
of the k nearest neighbors, which class label is the majority
this is the predicted label - add it to the preds list
return preds, actual_labels
You can run your code with python3 run_knn.py
A correct implementation should get the following output:
Predicted classes: [0, 1]
True classes: [0, 1]
And lastly, in run_CV() , you will be implementing the cross-validation loop as you did before. Within each iteration/fold, call run_knn() with the test_set partition and the train_set partitions. Compute the accuracy of each fold by comparing the returned predictions to the returned actual labels. Then average the accuracy over all folds.
Run python3 run_knn.py and correct code will print out:
Cross-validated accuracy: 0.5
Then, the actual sonar data file will be passed in to your code and you will see an accuracy of your KNN classifier on that dataset.