After completing this learning module, students will be able to:
Describe network traffic
Explain K-Means Clustering algorithm
Apply unsupervised K-means clustering to identify detect ransomware
What is Ransomware?
Ransomware is a type of malware that uses encryption to hold a victim's personal data hostage for a random. If a user or organization is held victim to ransomware, their important data is encrypted so that it is inaccessible by them. This data can include files, databases, or applications. Ransomwares are often designed with the intention of maximum damage. This means that these malwares are often able to move across an entire network and can target databases and file servers, which makes it possible for them to paralyze an entire organization.
Who is the target of ransomware?
The largest target of ransomware is businesses because of their ability to cripple large amounts of users with a single download. The other primary reason why businesses are a large target is because they are able to pay larger ransoms than an individual person. These cybercriminals have already inflicted significant damage and expenses for businesses and governmental organizations and as a result have generate billions of dollars for themselves.
The secondary target of ransomware are stand alone users. However, these users are not preferred because they tend to be less lucrative. It is important to never pay for access to the encrypted data because this can permanently put a target on the back of the victim.
How does ransomware work?
Ransomware typically uses asymmetric encryption to lock a user out from files. This is a type of encryption that uses pairs of keys to encrypt and decrypt a file. The attacker generates the public and private keys uniquely for the victim on the attacker's server. The cybercriminal makes the private key available to the victim after a ransom is payed, or that is what is promised at least. Without having access to this private key, it is nearly impossible to decrypt and access the files.
Most of the time ransomware is distributed through email or through other targeted attacks. Malware needs an attack vector in order to establish its presence. After the presence is established, it will stay on the endpoint until its task is accomplished.
How to defend against ransomware
The best defense against ransomware is to create and secure backups. This makes it so if your device is infected, you can completely wipe your device and and reinstall all your files from the backup. These backups should not be accessible from the system. The backups should be created either in the cloud or on an external hard drive. It should be noted that creating backups will not prevent ransomware, it will just mitigate the risks.
In order to avoid ransomware and malware in general, it is important to practice safe internet searching. Safe internet searching includes not responding to anyone that you don't know, and only downloading applications from trusted sources. In addition, only secure networks should be used. If using a secure network is not an option, a VPN should be used in order to provide a secure connection.
K-Means Clustering
K-means clustering is one of the simplest and most popular unsupervised machine learning algorithms. This algorithm is a clustering algorithm that aims to partition n observations into k clusters where every observation belongs to the cluster with the nearest mean. The main goal of this algorithm is to group similar data points together and to try and discover the underlying patterns in the dataset.
The number of clusters K, is a user defined number of centroids that are desired to be found in the dataset. A good starting number for K when using this algorithm is the number of different classes that are contained in the data. A centroid is simply the location that represents the center of a cluster, where each data point is assigned to a cluster based on the distance it is from a centroid.
The steps of K-Means Clustering
Provide the number of clusters, K, to be generated by the algorithm.
Randomly choose K data points and assign each to a cluster, then categorize the data based on the number of data points.
Compute the cluster centroids.
Find the ideal Centroids. (This happens when data points are assigned to clusters and the centroids do not change.
In order to find the optimal centroids, the sum of squared distances between data points and centroids needs to be calculated. After this, each data point is allocated to the cluster that it is closest to. The centroid's location is then found by averaging all of the cluster's data points.
This algorithm uses Expectation-Maximization to build clusters. This method has a Expectation step and a maximization step. The Expectation step is when data points are assigned to the nearest clusters. The maximization step is when the centroid of each cluster is recalculated.
Pros and Cons of K-Means
Pros
Easy to implement
Runs quickly with high number of variables
An instances cluster can be changed during centroid re-computation
Cons
Due to the iterative nature of the algorithm, may get stuck on local optimum instead of global optimum
Data must be normalized in order to compare distances
Sensitive to rescaling.