After completing this learning module, students will be able to:
Describe user behavior anomaly.
Explain when K-Nearest (KNN) classification algorithm should be applied for anomaly detection.
Apply unsupervised K-Nearest Neighbors (KNN) Classification for user behavior anomaly detection
Please take the pretest survey before moving on! The survey link is posted in D2L.
What are user behavior anomalies?
An anomaly is also known as an outlier. It can be identified as an event that is significantly different from those of normal instances. In order to find anomalies, algorithms must work under the assumption that these outliers are rare, and the features of the anomalies are significantly different from those of normal instances.
Thus, a user behavior anomaly is an action by an individual that is not within their typical behavior. These anomalies are found by comparing a user's current activities to their past activities. After a long enough period of time, a user should have enough activity to establish their own behavioral patterns. Once this pattern is learned for a user, it is possible to evaluate whether or not current actions are in line with a user's typical behavior.
Why is it important to detect user based anomalies
There are many reasons why it is important to detect these anomalies. A couple of useful applications for anomaly detection are: intrusion detection, fraud detection, anomaly based malware detection, and anomalies in video surveillance.
An example of fraud detection using anomalies could come from examining credit card histories. Every transaction under a person's credit card is recorded with certain attributes, with one of those attributes being location.
Suppose that every transaction from the last year of an user is made within Texas, and suddenly a transaction comes from within China. This should be marked as an anomaly because this does not follow their previously established behavior patterns. Thus, this transaction should be flagged and denied by the credit card company until further action can be taken by the account holder.
K-Nearest Neighbors
K-Nearest Neighbors(KNN) is a supervised machine learning algorithm that can be used to solve both classification and regression problems. This means that it relies on labeled input data to learn and produce an output. KNN operates based on the assumption that similar things exist in close proximity to each other. The algorithm calculates this by using a distance formula to calculate the distance between points on a graph.
In KNN, the value K is chosen by the programmer which is the number of data points that are assigned to each cluster. This number is the number of nearest neighbors that are used in the majority voting process. A good K to use for a dataset is typically the square-root of N where N is the number of data points in the set. As the K value is decreased, the stability of the predictions also decreases, and as the K value increases, the stability also increases.
KNN algorithm
The first step to the algorithm occurs when the users declares the K they would like to use for the dataset.
After this, for every entry, the distance between the query example and the current example are calculated, and added to the index of the example to an ordered collection.
Next, the ordered collections of distances and indices are sorted from smallest to largest by distance.
After that, the first k entries are selected from the collections, and the labels are retrieved.
Lastly, to classify the most commonly occurring instance of the K labels is returned.
Note: If you are not familiar with Google CoLab, please complete the M0 module to get started.
Basic KNN algorithm for classification
Copy and paste the following code into a new collab notebook.
from collections import Counter
import math
import numpy as np
import matplotlib.pyplot as plt
def knn(data, query, k):
neighbor_distances_and_indices = []
#For each example in the data, calculate the distance between the query and current vector
for index, example in enumerate(data):
distance = euclidean_distance(example[:-1], query)
#Add the distance and the index of the example to an ordered collection
neighbor_distances_and_indices.append((distance, index))
#Sort the ordered collection of distances and indices from smallest to largest by the distances
sorted_neighbor_distances_and_indices = sorted(neighbor_distances_and_indices)
#Pick the first K entries from the sorted collection
k_nearest_distances_and_indices = sorted_neighbor_distances_and_indices[:k]
#Get the labels of the selected K entries
k_nearest_labels = [data[i][-1] for distance, i in k_nearest_distances_and_indices]
#For classification, return the mode of the K labels
return k_nearest_distances_and_indices , mode(k_nearest_labels)
#used to find the most common occuring label in a set
def mode(labels):
return Counter(labels).most_common(1)[0][0]
#finds the distance between two points
def euclidean_distance(point1, point2):
sum_squared_distance = 0
for i in range(len(point1)):
sum_squared_distance += math.pow(point1[i] - point2[i], 2)
return math.sqrt(sum_squared_distance)
def plot(data):
x = []
y = []
for pt in data:
x.append(pt[0])
y.append(pt[1])
plt.scatter(x, y)
plt.show()
def main():
# Labeled classification data
# Column 0: attribute
# Column 1: lable
data = [
[1, 1],
[3, 1],
[5, 1],
[6, 1],
[10, 1],
[13, 1],
[17, 1],
[22, 1],
[26, 1],
[27, 1],
[30, 1],
[32, 1],
[37, 1],
[38, 1],
[41, 1],
[42, 1],
[43, 1],
[43, 1],
[46, 1],
[50, 0],
[51, 0],
[52, 0],
[53, 0],
[53, 0],
[53, 0],
[60, 0],
[63, 0],
[65, 0],
[65, 0],
[66, 0],
[68, 0],
[70, 0],
[75, 0],
[100, 1],
[103, 1],
[107, 1],
[110, 1],
[112, 1],
[117, 1],
[120, 1],
[121, 1],
[122, 1],
[122, 1],
[124, 1],
[125, 1]
]
# Question:
# Given the attribute's value what is the label of the query?
query = [48]
k_nearest_neighbors, prediction = knn(
data, query, k=5
)
print(prediction)
print(k_nearest_neighbors)
plot(data)
if __name__ == '__main__':
main()
The example above has 3 distinct clusters. The first around 1 and 45, the next around 50 and 75, and the last between 100 and 125. In order to classify the query, the k nearest neighbors are found using the Euclidian distance. Then the most frequent label out of the k-nearest labels is chosen as the label fore the new point.
The result from the code prints out the predicted label, and the k nearest distance with the indices too. In this example, the furthest distance from our query is 5 which is an acceptable range. This also produces a 2d scatter plot graph with all the data classified into clusters.
KNN for anomaly detection
When using KNN for anomaly detection the approach is different from the traditional supervised approach. In order to find the outliers, you must treat the dataset as an unsupervised dataset, meaning that the labels are not passed to the classifier. However, the labels should be kept for later in order to determine accuracy.
After the algorithm clusters the data, the indexes and distances need to be retrieved next. Then a graph with the indexes on the X axis and the mean of the k-distances on the Y axis is plotted. After the graph is plotted, it is up to the data scientist to visually determine a cut off point that constitutes an outlier and to create an index that contains all the outliers. This outlier index should then be compared to the labels in the dataset.
Pros and Cons of KNN Algorithm
Pros
The algorithm is simple and easy to implement
There is no need to build a model or tune parameters
Versatile algorithm, it can be used for classification, regression, or searching
Cons
Slower algorithm as the number of examples increases