Clustering is a form of unsupervised1 machine learning that can give information about the dataset's patterns, relationships, and/or underlying structure.
Foundationally, clustering is based on the similarity in the data (if applicable). And the idea is almost self-explanatory in the name; it’s the process of trying to create “clusters” or groups of data that are considered “similar.”
This foundational concept is easy to imagine with groups of points in 2-dimensional space. So, we say that a group of points form a “cluster” if the distance between each point is relatively small. So, the measurement of distance is the foundation of clustering models.
And so, we can think of our data as a collection of points in n-dimensional space, where n would be the number of columns, and each row is an individual “point.”
Using this picture of data, in data science, we often work with data that is higher than two or three dimensions. Even though our intuitive ability to cluster points in anything greater than three dimensions is limited, distance formulas in mathematics are not limited to 2D or 3D space (the formula just gets longer). However, while we have no mathematical limitation to calculate “distances” in spaces larger than three dimensions, the next challenge is when working with thousands and thousands of information (or data “points”).
So the purpose of clustering machine learning models is finding potential “similarities” in our data by doing the work of sifting through possibly thousands of points and comparing their distances to find if they form any clusters in a space often larger than 3 dimensions.
---
Footnotes:There are two types of clustering methods evaluated here in this project. One of the methods is K-means clustering, and the other is hierarchical clustering.
K-means clustering is a clustering method that utilizes something called "centroid,” which is what the model considers to be the “center” of a cluster in n-dimensional space. K-means requires the user to give the model the (“expected”) number of clusters.
Hierarchical clustering utilizes distances to …
Data that was used: NOAA API Data
Page explaining data prep and code
Documentation to explain the attribute labels in the GSOY data: GSOY attribute documentation (also written out in data prep page)
Data Cleaning: https://github.com/Rokkaan5/5622-PublishedCode/blob/main/data/API-NOAA/NOAA-clust-data-cleaning.py
Hierarchical Clustering in R: https://github.com/Rokkaan5/5622-PublishedCode/blob/main/Clustering/hier-clust-inR.qmd
KMeans Clustering in Python: https://github.com/Rokkaan5/5622-PublishedCode/blob/main/Clustering/noaa-cluster-kmeans.py