Basakstat - K-Nearest Neighbour (KNN)

K-Nearest Neighbour (KNN)

What is KNN ?

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine learning algorithm that can be used to solve both classification and regression problems. KNN algorithms use data and classify new data points based on similarity measures (e.g. distance function). Classification is done by a majority vote to its neighbors.

Why is KNN used ?

KNN algorithm is one of the simplest classification algorithm and it is one of the most used learning algorithms. ... KNN is a non-parametric, lazy learning algorithm. Its purpose is to use a database in which the data points are separated into several classes to predict the classification of a new sample point.

How is KNN calculated ?

KNN works by finding the distances between a query and all the examples in the data, selecting the specified number examples (K) closest to the query, then votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).

Applications of KNN

Despite its simplicity, KNN can outperform more powerful classifiers and is used in a variety of applications such as economic forecasting, data compression and genetics. For example, KNN was leveraged in a 2006 study of functional genomics for the assignment of genes based on their expression profiles.

Example with R Code

Source Code: GRE Data

R Code:

# Librarieslibrary(caret)library(pROC)library(mlbench)
# Student Applications (Classification)data <- read.csv(file.choose(), header = T)str(data)data$admit[data$admit == 0] <- 'No'data$admit[data$admit == 1] <- 'Yes'data$admit <- factor(data$admit)
# Data Partitionset.seed(1234)ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))training <- data[ind == 1,]test <- data[ind == 2,]
# KNN ModeltrControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, classProbs = TRUE, summaryFunction = twoClassSummary)set.seed(222)fit <- train(admit ~ ., data = training, method = 'knn', tuneLength = 20, trControl = trControl, preProc = c("center", "scale"), metric = "ROC", tuneGrid = expand.grid(k = 1:60))
# Model Performancefitplot(fit)varImp(fit)pred <- predict(fit, newdata = test)confusionMatrix(pred, test$admit)
# Example-2 Boston Housing (Regression)data("BostonHousing")data <- BostonHousing str(data)
# Data Partitionset.seed(1234)ind <- sample(2, nrow(data), replace = T, prob = c(0.7, 0.3))training <- data[ind == 1,]test <- data[ind == 2,]
# KNN ModeltrControl <- trainControl(method = 'repeatedcv', number = 10, repeats = 3)set.seed(333)fit <- train(medv ~., data = training, tuneGrid = expand.grid(k=1:70), method = 'knn', metric = 'Rsquared', trControl = trControl, preProc = c('center', 'scale'))
# Model Performancefitplot(fit)varImp(fit)pred <- predict(fit, newdata = test)RMSE(pred, test$medv)plot(pred ~ test$medv)

Page updated

Google Sites