Role of K-NN classifier in current data science

Introduction

Laymen explanation

The k-nearest neighbors (KNN) algorithm is a simple, supervised machine learningalgorithm that can be used to solve both classification and regression problems. It's easy to implement and understand. If you are interested to know its role in current data science, then this document helps.

Technical explanation

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

Special cases

What if tie happens(in number of points)?

It may happen that for a given point, K-nn finds that more than one class have same number of points. Which class should be selected then?

Considering that number of points in a class goes to infinite as total number of points goes to infinite, the approach would be to increase K by 1 if tie is encountered.

How to select K value?

Optimal K value depends on the dataset. Normally used approach is cross-validation. Also it is better to use odd value of K considering tie problem mentioned above.

Bayes decision rule based analysis helps to understand this problem in detail. Via mathematics, probability density(PDF) of the population is derived and based on this theorem, Bayes decision rule can be applied. Note that PDF is calculated via theorem and class prior probability can be derived using the n points and so we have all pre-condition to apply K-nn Bayes decision rule.

Value of k is hyperparameter and should be tuned using GridSearch or other search approaches

What about using k=1?

The error rate at K=1 is always zero for the training sample. This is because the closest point to any training data point is itself.

However, kNN with k=1 in general implies over-fitting, or in most cases leads to over-fitting. Note that you estimate your probability based on a single sample: your closest neighbor. This is very sensitive to all sort of distortions like noise, outliers, mislabelling of data, and so on. By using a higher value for k, you tend to be more robust against those distortions.

How to handle un-normalized data?

Use right distance metric. Mahanobis distance is a good choice. Refer the article here

Relevance with neural networks

Neural networks have achieved the state of the art in more domains than k-NN. For some cases, k-NN still gives better accuracy.
It is popular for running ML in edge devices(IOT) since it is simple and have good predictive power

Reference

https://youtu.be/DlQli0OCkf8

https://youtu.be/DlQli0OCkf8?t=1970

http://faculty.washington.edu/yenchic/18W_425/Lec7_knn_basis.pdf

https://discuss.analyticsvidhya.com/t/how-to-choose-the-value-of-k-in-knn-algorithm/2606/7

https://stats.stackexchange.com/questions/107870/does-k-nn-with-k-1-always-implies-overfitting

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-normalisation-in-machine-learning#TOC-Case-when-feature-scaling-is-not-needed

https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761

Page updated

Google Sites

Report abuse