Programming 3 - kNN

GOAL : Learn about Knn, Rnn, Ann, and how to improve their performance.

Learning experience: This homework provided a comprehensive learning experience on K-Nearest Neighbors (KNN) within scikit-learn. From finding nearest neighbors to optimizing KNN with GridSearchCV, and exploring Radius-based and Approximate Nearest Neighbors, we covered key concepts. Evaluating ANN with recall and applying these techniques to a different dataset enhanced our understanding of scikit-learn's capabilities in machine learning tasks. Overall, it was an insightful journey into effectively utilizing KNN and scikit-learn.

working environment :

OS: Windows 11 home

CPU : intel i9-13900k

GPU : Nvidia RTX 4090

Python Version : 3.12.2

Development environment: jupyter notebook.

Chapter 15. K-Nearest Neighbors [1]

15.0 Introduction

The k-nearest neighbors (KNN) classifier is one of the simplest yet most commonly used classifiers in supervised machine learning. KNN is often considered a lazy learner; it doesn’t technically train a model to make predictions. Instead an observation is predicted to be the same class as that of the largest proportion of the k nearest observations.

For example, if an observation with an unknown class is surrounded by an observation of class 1, then the observation is classified as class 1. In this chapter we will explore how to use scikit-learn to create and use a KNN classifier.

While the KNN algorithm can be used for either regression or classification problems, it is typically used as a classification algorithm, working off the assumption that similar points can be found near one another.

For classification problems, a class label is assigned on the basis of a majority vote—i.e. the label that is most frequently represented around a given data point is used.

Regression problems use a similar concept as classification problem, but in this case, the average the k nearest neighbors is taken to make a prediction about a classification. The main distinction here is that classification is used for discrete values, whereas regression is used with continuous ones. However, before a classification can be made, the distance must be defined.

The goal of the k-nearest neighbor algorithm is to identify the nearest neighbors of a given query point, so that we can assign a class label to that point. In order to do this, KNN has a few requirements:

Determine your distance metrics

There are several distance measures that you can choose from :

Euclidean distance (p=2): This is the most commonly used distance measure, and it is limited to real-valued vectors. Using the below formula, it measures a straight line between the query point and the other point being measured. show in Figure 1.

Manhattan distance (p=1): This is also another popular distance metric, which measures the absolute value between two points. It is also referred to as taxicab distance or city block distance as it is commonly visualized with a grid, illustrating how one might navigate from one address to another via city streets. show in Figure 2.

Minkowski distance: This distance measure is the generalized form of Euclidean and Manhattan distance metrics. The parameter, p, in the formula below, allows for the creation of other distance metrics. Euclidean distance is represented by this formula when p is equal to two, and Manhattan distance is denoted with p equal to one. show in Figure 3.

Hamming distance: This technique is used typically used with Boolean or string vectors, identifying the points where the vectors do not match. As a result, it has also been referred to as the overlap metric. show in Figure 4

Defining k

The k value in the k-NN algorithm defines how many neighbors will be checked to determine the classification of a specific query point. For example, if k=1, the instance will be assigned to the same class as its single nearest neighbor. Defining k can be a balancing act as different values can lead to overfitting or underfitting. Lower values of k can have high variance, but low bias, and larger values of k may lead to high bias and lower variance. The choice of k will largely depend on the input data as data with more outliers or noise will likely perform better with higher values of k. Overall, it is recommended to have an odd number for k to avoid ties in classification, and cross-validation tactics can help you choose the optimal k for your dataset. [2]

In Scikit-Learn provide a function name :

sklearn.neighbors.KNeighborsClassifier : Classifier implementing the k-nearest neighbors vote. see more [3]

Figure 5 shows a ﬂowchart for KNN algorithm.

Figure 1 Euclidean distance formula

Figure 2 Manhattan distance formula

Figure 3 Minkowski distance formula

Figure 4 Hamming distance formula

Figure 5 A simple flowchart for the k-nearest neighbor modeling. [4]

15.1 Finding an Observation’s Nearest Neighbors

This section will find an observation’s k nearest observations (neighbors).

class sklearn.neighbors.NearestNeighbors(*, n_neighbors=5, radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=None), Unsupervised learner for implementing neighbor searches. [5]

In 1 primarily employs the k-nearest neighbors (kNN) algorithm to find the nearest neighbors of a given observation.

First load the necessary libraries, including those for dataset manipulation and k-nearest neighbors.

Then, we load the Iris dataset, which comprises four feature variables describing characteristics of iris flowers.

Next, we create a standardizer to normalize the features into a standard normal distribution with a mean of 0 and a standard deviation of 1. The features are then standardized accordingly to ensure comparability in distance calculations.

We proceed to instantiate a k-nearest neighbors model with k=2 and fit it to the standardized feature data. This means the model will identify the two nearest neighbors for each observation.

Subsequently, we create a new observation represented by the list new_observation, which contains four feature values. We aim to determine the nearest neighbors of this new observation.

Finally, we use the nearest_neighbors.kneighbors() method to find the distances and indices of the nearest neighbors of new_observation. The distances are stored in the distances variable, while the indices of the nearest neighbors are stored in the indices variable. With these indices, we can retrieve the actual feature values of these nearest neighbors from the standardized feature data.

Intuitively, distance can be thought of as a measure of similarity, so the two closest observations are the two flowers most similar to the flower we created.

How do we measure distance? scikit-learn offers a wide variety of distance metrics, including Euclidean and Manhattan distance. By default, NearestNeighbors uses Minkowski distance, Minkowski includes a hyperparameter, where p = 1 is Manhattan distance and p = 2 is Euclidean distance, and so on. By default in scikit-learn p = 2.

In 2 we can set the distance metric using the metric parameter, In 3 the distance variable we created contains the actual distance measurement to each of the two nearest neighbors

In addition, In 4 we can use kneighbors_graph to create a matrix indicating each observation’s nearest neighbors.

Firstly, we instantiate a k-nearest neighbors model with a k value of 3, using Euclidean distance as the metric. This model is fitted to the standardized feature data to identify the three nearest neighbors for each observation, including itself.

Next, we generate a list representing the three nearest neighbors (including itself) for each observation using the kneighbors_graph() method. Each sublist in this list represents an observation and its three nearest neighbors, including itself.

Then, we clean up the list by removing the 1s that indicate an observation being its own nearest neighbor, ensuring that each observation's nearest neighbors do not include itself.

Finally, we view the two nearest neighbors (excluding itself) of the first observation.

When we are finding nearest neighbors or using any learning algorithm based on distance, it is important to transform features so that they are on the same scale. This is because the distance metrics treat all features as if they were on the same scale, but if one feature is in millions of dollars and a second feature is in percentages, the distance calculated will be biased toward the former. In our solution we addressed this potential issue by standardizing the features using StandardScaler.

15.2 Creating a K-Nearest Neighbors Classifier

This section will learn when given an observation of unknown class, you need to predict its class based on the class of its neighbors.

In 19 primarily utilizes a k-nearest neighbors (kNN) classifier for prediction. Here's the explanation:

Firstly, the necessary libraries are loaded, including the k-nearest neighbors classifier and libraries for dataset manipulation.

Then, the Iris dataset is loaded, and the feature data (X) and target variable (y) are stored in respective variables.

Next, a standardizer is created to standardize the features into a standard normal distribution with a mean of 0 and a standard deviation of 1.

Subsequently, the feature data is standardized to ensure comparability in importance during training of the classifier.

Following that, a k-nearest neighbors classifier is instantiated with a k value of 5, and the standardized feature data along with the target variable are used for training.

Finally, two new observations, each containing four feature values, are created in new_observations. The classifier is then employed to predict the class of these two observations based on the trained model.

The first observation is predict to class 2 and second is predict to class 3.

In KNN, given an observation xu, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called xu’s neighborhood) based on some distance metric (e.g., Euclidean distance), then these k observations “vote” based on their class, and the class that wins the vote is xu’s predicted class.
In scikit-learn we can see these probabilities using predict_proba like In 20 we can see the three classes probabilities, The class with the highest probability becomes the predicted class.

KNeighborsClassifier contains a number of important parameters to consider.

First, metric sets the distance metric used. Second, n_jobs determines how many of the computer’s cores to use. Because making a prediction requires calculating the distance from a point to every single point in the data, using multiple cores is highly recommended. Third, algorithm sets the method used to calculate the nearest neighbors. While there are real differences in the algorithms, by default KNeighborsClassifier attempts to auto-select the best algorithm so you often don’t need to worry about this parameter. Fourth, by default KNeighborsClassifier works how we described previously, with each observation in the neighborhood getting one vote; however, if we set the weights parameter to distance, the closer observations’ votes are weighted more than observations farther away. Intuitively this make sense, since more similar neighbors might tell us more about an observation’s class than others.

Finally, because distance calculations treat all features as if they are on the same scale, it is important to standardize the features prior to using a KNN classifier.

15.3 Identifying the Best Neighborhood Size

This section will learn when you want to select the best value for k in a k-nearest neighbors classifier.

class sklearn.model_selection.GridSearchCV(estimator, param_grid, *, scoring=None, n_jobs=None, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', error_score=nan, return_train_score=False), Exhaustive search over specified parameter values for an estimator. [6]

In 23 primarily utilizes grid search cross-validation (GridSearchCV) to find the optimal number of neighbors (k value) for the k-nearest neighbors classifier. Here's the explanation:

Firstly, the necessary libraries are loaded, including those for k-nearest neighbors classifier, dataset, standardizer, pipeline, and grid search cross-validation.

Then, the Iris dataset is loaded, and the feature data (features) and target variable (target) are stored in respective variables.

Next, a standardizer is created to standardize the features.

Then, a k-nearest neighbors classifier (knn) is created with the number of neighbors (n_neighbors) set to 5.

Subsequently, a pipeline (pipe) is created, consisting of the standardizer and k-nearest neighbors classifier, enabling automatic standardization during training and testing.

Next, a search space of candidate parameter values (search_space) is defined, including different candidate values for the number of neighbors.

Following that, grid search cross-validation (GridSearchCV) is applied to the pipeline to train and evaluate the model with 5-fold cross-validation (cv=5) to find the best number of neighbors.

Finally, the best number of neighbors (k value) is extracted from the best estimator of the classifier.

The size of k has real implications in KNN classifiers. In machine learning we are trying to find a balance between bias and variance, and in few places is that as explicit as the value of k. If k = n, where n is the number of observations, then we have high bias but low variance. If k = 1, we will have low bias but high variance. The best model will come from finding the value of k that balances this bias-variance trade-off. In our solution, we used GridSearchCV to conduct five-fold cross-validation on KNN classifiers with different values of k. When that is completed, we can see the k that produces the best model. In 23 we can see the output k is 6.

15.4 Creating a Radius-Based Nearest Neighbors Classifier

This section will learn when given an observation of unknown class, you need to predict its class based on the class of all observations within a certain distance.

class sklearn.neighbors.RadiusNeighborsClassifier(radius=1.0, *, weights='uniform', algorithm='auto', leaf_size=30, p=2, metric='minkowski', outlier_label=None, metric_params=None, n_jobs=None), Classifier implementing a vote among neighbors within a given radius. [7]

In 24 primarily utilizes a Radius Neighbors classifier to predict the class of new observations. Here's the explanation:

Firstly, the necessary libraries are loaded, including those for the Radius Neighbors classifier, dataset, and standardizer.

Then, the Iris dataset is loaded, and the feature data (features) and target variable (target) are stored in respective variables.

Next, a standardizer is created to standardize the features.

Then, the feature data is standardized to ensure comparability in importance during training of the classifier.

Following that, a Radius Neighbors classifier (RadiusNeighborsClassifier) is instantiated with a radius of 0.5. This classifier predicts the class of new observations based on neighbors within the specified radius.

Finally, the classifier is used to predict the class of the new observation (new_observation).

The output shows that the new observation is predict to the class 3. same as 15.2 In 19.

In KNN classification, an observation’s class is predicted from the classes of its k neighbors. A less common technique is classification in a radius-based nearest neighbor (RNN) classifier, where an observation’s class is predicted from the classes of all observations within a given radius r.

In scikit-learn, RadiusNeighborsClassifier is very similar to KNeighborsClassi⁠fier, with the exception of two parameters. First, in RadiusNeighborsClas⁠sifier we need to specify the radius of the fixed area used to determine if an observation is a neighbor using radius. Unless there is some substantive reason for setting radius to some value, it’s best to treat it like any other hyperparameter and tune it during model selection. The second useful parameter is outlier_label, which indicates what label to give an observation that has no observations within the radius—which itself can be a useful tool for identifying outliers.

15.5 Finding Approximate Nearest Neighbors

This section will fetch nearest neighbors for big data at low latency.

In 9 utilizes the Faiss library for nearest neighbor search. Here's the explanation:

Firstly, the necessary libraries are imported, including Faiss, NumPy, and Scikit-learn for dataset manipulation.

Then, the Iris dataset is loaded, and the feature data (features) is stored in a variable.

Next, a standardizer is created to standardize the features.

Then, the feature data is standardized for compatibility with Faiss.

Following that, Faiss parameters are set, including the dimension of features (n_features), the number of lists for partitioning (nlist), and the number of nearest neighbors per query (k).

Then, an Inverted File (IVF) index is created, consisting of a quantizer and an index for flat search.

Next, the index is trained, and feature vectors are added to the index.

Then, a new observation, new_observation, is created.

Following that, the index is searched for the two nearest neighbors of the new observation, and their distances and indices are returned.

Finally, the feature vectors of these two nearest neighbors are displayed.

KNN is a great approach to finding the most similar observations in a set of small data. However, as the size of our data increases, so does the time it takes to compute the distance between any one observation and all other points in our dataset. Large scale ML systems such as search or recommendation engines often use some form of vector similarity measure to retrieve similar observations. But at scale in real time, where we need results in less than 100 ms, KNN becomes infeasible to run.

ANN helps us overcome this problem by sacrificing some of the quality of the exact nearest neighbors search in favor of speed. This is to say that although the order and items in the first 10 nearest neighbors of an ANN search may not match the first 10 results from an exact KNN search, we get those first 10 nearest neighbors much faster.

Notice this example returns the exact same output as the first recipe in this chapter. This is because we are working with very small data and using only three clusters, which makes it unlikely our ANN results will differ significantly from our KNN results.

In this example, we use an ANN approach called inverted file index (IVF). This approach works by using clustering to limit the scope of the search space for our nearest neighbors search. IVF uses Voronoi tessellations to partition our search space into a number of distinct areas (or clusters). And when we go to find nearest neighbors, we visit a limited number of clusters to find similar observations, as opposed to conducting a comparison across every point in our dataset.

How Voronoi tessellations are created from data is best visualized using simple data. Take a scatter plot of random data visualized in two dimensions, as shown in Figure 6-1.

Using Voronoi tessellations, we can create a number of subspaces, each of which contains only a small subset of the total observations we want to search, as shown in Figure 6-2.

The nlist parameter in the Faiss library lets us define the number of clusters we want to create. An additional parameter, nprobe, can be used at query time to define the number of clusters we want to search to retrieve nearest neighbors for a given observation. Increasing both nlist and nprobe can result in higher quality neighbors at the cost of larger computational effort and thus a longer runtime for IVF indices. Decreasing each of these parameters will have the inverse effect, and your code will run faster but at the risk of returning lower quality results.

Figure 6-1. A scatter plot of randomly generated two-dimensional data

Figure 6-2. Randomly generated two-dimensional data separated into a number of different subspaces

15.6 Evaluating Approximate Nearest Neighbors

This section you want to see how your ANN compares to exact nearest neighbors (KNN)

In 10 aims to evaluate the accuracy of Faiss index in finding the nearest neighbors of a given observation and assesses the result in terms of recall. Here's the explanation:

Firstly, the necessary libraries are imported, including Faiss, NumPy, and Scikit-learn for dataset manipulation.

Then, a variable k is set to represent the number of nearest neighbors to find, which is set to 10 here.

Next, the Iris dataset is loaded, and the feature data (features) is stored in a variable.

Then, a standardizer is created to standardize the features.

The feature data is standardized to ensure comparability in importance during training of the Faiss index.

Then, a k-nearest neighbors model with 10 nearest neighbors is created.

Following that, Faiss parameters are set, including the dimension of features (n_features) and the number of lists for partitioning (nlist).

Then, an Inverted File (IVF) index is created with specified parameters, including a quantizer and an index for flat search.

Next, the index is trained, and feature vectors are added to the index.

The parameter nprobe of the index is set to 1 to specify the number of probes for each query.

Then, a new observation new_observation is created.

The nearest neighbors of this new observation are then found using both the k-nearest neighbors model and the Faiss index.

Finally, the overlap between the nearest neighbors found by the k-nearest neighbors model and the Faiss index is calculated, and the recall is computed.

Recall @k is most simply defined as the number of items returned by the ANN at some k nearest neighbors that also appear in the exact nearest neighbors at the same k, divided by k. In this example, at 10 nearest neighbors we have 100% recall, which means that our ANN is returning the same indices as our KNN at k=10 (though not necessarily in the same order).

Recall is a common metric to use when evaluating ANNs against exact nearest neighbors.

Exercise : Use digits dataset to implementation KNN

In 2 implements a K-Nearest Neighbors (KNN) classifier to classify handwritten digits using the Digits dataset.

Firstly, the Digits dataset is loaded from sklearn.datasets.

Then, the dataset is split into training and testing sets, with 25% of the data allocated to the testing set, using the train_test_split function.

Next, the features are standardized to ensure that they have similar scales using the StandardScaler from sklearn.preprocessing.

After that, a KNN classifier is instantiated with 5 neighbors (n_neighbors=5).

Subsequently, the training set features and labels are used to train the KNN classifier.

Then, the trained model is used to predict labels for the test set features.

Finally, the accuracy of the model is calculated using the accuracy_score function, and it's printed out.

The accuracy is 0.9755.

And we tried to use GridSearchCV to find the best k like 15.3 did, but In 50 shows that the k = 3 is the best, and its accuracy is 0.9667. It is lower than before , but we actually did find the k = 3~10, so we tried to analyze the gridsearch, In 74 is the code we analyze, the below shows the dataframe, it shows that first of the rank_test_score is class 0 mean that equal to k = 3, and we focus on class 2 (k=5), we can see that the mean_test_score is lower than class 0, so i am confused ! and hasn't find the answer yet😅.

But In 80 we change the test_size=0.3, random_state=2 in split, and we get the dataframe like the below, we can see that class 2 is the rank 1, and its mean_test_score is 0.976, but when we use the function accuracy_score to get the accuracy, the accuracy is 0.957, also make me confused 😣.

Next we going to use RadiusNeighborsClassifier to classify, but when the radius = 0.5, we get a terrible accuracy, so we use GridSearchCV to find the best radius In 89 , we find radius = [0.5, 1.0, 1.5, 2.0, 5.0, 10.0, 20.0, 50.0,100.0] , and the best radius is 5.0 it accuracy is 0.906, so i think RadiusNeighborsClassifier is not good for this dataset.

Furthermore In 98 we used an approximate nearest neighbors (ANN) based search with Facebook’s faiss library like 15.5, and we get the accuracy is 0.975.

Last In 107 we can compute the recall @k nearest neighbors of the ANN as compared to the KNN, we compute k=5 and k=1, and its recall 0.9955 and 0.9755.

Reference :

[1] Chapter 15. K-Nearest Neighbors , Machine Learning with Python - Theory and Implementation

[2] What is the KNN algorithm? , IBM

[3] sklearn.neighbors.KNeighborsClassifier , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[4] Mahdiani MR, Khamehchi E, Hajirezaie S, et al., “Modeling viscosity of crude oil using k-nearest neighbor algorithm,” Advances in Geo-Energy Research, vol. 4, pp. 435-447, 2020.

[5] sklearn.neighbors.NearestNeighbors , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[6] sklearn.model_selection.GridSearchCV , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

[7] sklearn.neighbors.RadiusNeighborsClassifier , Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011.

Page updated

Google Sites

Report abuse