K-Nearest Neighbors (KNN) works on the basis of similarities. It works better when you have distinct data groups without too many variables involved because the algorithm is also sensitive to the dimensionality curse.
For an example showing how to use KNN, we can start with the digit dataset again. KNN is an algorithm that's quite sensitive to outliers. Moreover, you have to rescale your variables and remove some redundant information. In this example you use PCA.
To see this task in action, you reserve cases in tX and try a few cases that KNN won't look up when looking for neighbors.
KNN uses a distance measure to determine which observations to consider as possible neighbors for the target case. (p=2 means use the Euclidean distance)
A critical parameter that you have to define in KNN is k. As k increases, KNN considers more points for its predictions. You can experiment with changing the k value, as shown in the following example.
Through experimentation, you find that setting n_neighbors (the parameter representing k) to 5 is the optimum choice, resulting in the highest accuracy.
SVD on Homes Database
Using homes.csv, try to do the following:
Set the matrix A to be all the columns in homes. (You can use .values to make it numpy array). Then print it.
Perform SVD on matrix A. Then print out the matrix U, s, and Vh.
Try to delete the last 3 columns of matrix U. Adjust s and Vh accordingly. Then try to multiply all of them and see the difference with the original homes table.