In order to find the clustering relationship among different sound features, we plot three bubble-scatter plot to see if there is a clustering.
In the cluster analysis, we aim to find out are there sub-categories living in the given categories, using “pop” as an illustration. In order to find the clustering within “pop”, we performed clustering analysis using k means clustering, hierarchical clustering, and DBSCAN. By the k-means distance plot and the silhouette coefficients, we observed a negative relationship between clustering quality and the number of cluster.
Based on the silhouette scores, hierarchical clustering with 2 final clusters produces clusters with best quality. However, overall, k-means clustering has a higher average silhouette score and a better fit. The highest silhouette is around 0.33, indicating that there may not be any subcategory.
From the following cluster plot, we can see that when k = 2, there is a clear distinction among two clusters. When the number of cluster increases, the shape and boundary of each cluster get blurred, showing suboptimal quality.
The hierarchical clustering plots show the similar result to k-means. When n = 2, the two clusters have clear boundaries. When n = 3, there are mixed up among clusters. When n = 5 and 20, all the points are mixed up and there is no clear distinction among clusters.
Using dbscan, we only get one cluster.
Result
Overall, the data(“pop”) has weak subcategories. From the 3-D clustering plots, we observe two potential subcategories formed by songs with higher energy, danceability, and acousticness, and their counterparties.