kNN

k nearest neighbour

Given a new record x to predict, a KNN algorithm looks at the k nearest neighbours of x in the training data (using distance like Euclidean, L1, etc.), and the most frequent class (label) within the k neighbours is then assigned to x. The idea is super simple, i.e. majority voting. If most of the k neighbours are class A then x is also class A.

When a training data set is given, the model is determined (for a specific k). The boundaries between classes are already determined by the existing training data. The new preditions are simply checking which class region a new x falls into.

The selection of k depends on the data. Usually a larger k will look at a bigger search area and the boundary between classes tends to blur. When k is small, it focuses on only very local data points and the boundary is sharp. When k = 1, it looks at only the nearest neighbour.

Obviously the algorithm is very prone to noise. A few noise points in the training data can total screw up the prediction for data points near that noise area, especially when k is small.

When the class distribution is skewed, e.g. two classes of points are too close to each other. Then the dominating class will dominate the prediction of new x in that area, and the other class rarely has a chance. One solution is to apply a weight that is proportional to the inverse of the distance between two points.

When there is irrelevant features, the model can be totally screwed. The distance calculation considering irrelevant features impacts the prediction negatively.