Euclidean distance: dist ( )
Continuous variables
Jarccard distance: dist (data frame, method = "binary")
Categorical variables (e.g., true and false)
More than two categorical variables:
library(dummies)
x <- dummy.data.frame(spring2023)
dist(x, method= "binary")
Scale - standadization
Types of Linkages in Clustering: complete, single, average.
Hierarchical Clustering in R (use the hierarchical clustering method to group observations)
Visualizing the dendrogram (a tree diagram)
Cutting the tree
Application (more than 2 dimensions, continuous variables): market segmentation (i.e. use consumer characteristics to group them into subgroups).
What happens if you don't know the right value of the K in advance? - Visualize your plots.
Elbow plot: total within-cluster sum of squares
Silhouette analysis method
Silhouette width consists of two parts: S(I)
Within Cluster Distance: C(I)
Closer Neighbour Distance: N(I)
S(I) values: 1 indicates well matches, 0 indicates on the border between two clusters, and -1 indicates a better fit within the neighboring cluster.
Making sense of the k-means cluster