Both K-means and agglomerative clustering will produce cohesive groups equally spread in all directions. Reality can sometimes produce complex and unsettling results--groups may have strange forms far from the canonical bubble.
DBScan is another clustering algorithm based on a smart intuition that can solve even the most difficult problems. DBScan relies on the idea that clusters are dense, so to start exploring the data space in every direction and mark a cluster boundary when the density decreases should be enough. It automatically guesses the number of clusters and points out strange data that doesn't easily fit into any class.
Using DBScan, you won't have to set a K number of expected clusters; the algorithm will find them by itself. The algorithm requires you to fix two essential parameters:
eps: The max distance between two observations that allows them to be part of the same neighborhood.
min_sample: The minimum number of observations in a neighborhood that transform them into a core point.
No matter what the shape of the cluster, DBScan links all the neighborhoods together if they are near enough (under the distance value of eps). The data points that aren't associated with any group are treated as noisy points.
Getting back to the example, some data exploration can allow you to observe the results under the right point of view. First, using collections, we count the clusters.
Almost half the observations are assigned to the cluster labeled -1, which represents the noise (noise is defined as examples that are too unusual to group). Given the number of dimensions (30 uncorrelated variables from a PCA analysis) in the data and its high variability, many cases not naturally fall together into the same group.
Clustering on 32-dimension data
Using data.txt and ground_truth.txt, try to do the following.
As you did in the previous exercise, perform PCA and scaling on the data then print out the explained variance ratio. Decide the optimal number of n_components. Save the new data in a variable.
Perform DBScan on that data and try to adjust the eps and min_sample so the number noise (cluster labeled -1) is minimal.