Clustering

K-means Clustering

Divides data into k clusters by minimizing the variance within each cluster

PROS

Simple and widely used
Efficient W/ Large Dataset (Small 'k')

CONS

Need 'k'
Sensitive outliers

Hierarchical Clustering

Builds a dendrogram tree based on the data by successively merging clusters or splitting them.

PROS

No 'k'
hierarchy clusters = multiple levels of granularity

CONS

Expensive [O(n^3)]
Not fit w/ Large Dataset

DBSCAN

Groups points that are closely packed and marks points in low-density areas as outliers.

PROS

No 'k'
Good at catching Outliers

CONS

Need tuning
Densities

Data Preparation

Because this is using same dataset with PCA, nothing was different from PCA preparation. Used a dataset containing patent-related information, including citation counts and other patent attributes. This dataset serves as labeled data, though for unsupervised clustering, removed non-numeric labels, using only citation-related numeric columns. Then using StandardScaler from Sklearn to normalize the citation data. This process ensures that each column has a mean of 0 and a standard deviation of 1, which is crucial for ensuring that all features contribute equally to the clustering process. Next, applied PCA to reduce the dimensionality of the data to three principal components. This step helps in simplifying the dataset while retaining most of the variance within the data. After PCA, the dataset retained 97.68% of its original variance.

Link to Code (github)

Silhouette score

According to the Silhouette Score, the optimal number of clusters here is 4. Obviously can see when k=2 is the perfect point of this silhouette scores.

k = 2

k = 3

k =4

k = 5

The highest silhouette score is for k = 2 with a score of 0.9649, indicating that the model with k = 2 clusters has the best separation between clusters compared to the others. This suggests that the data is best grouped into two distinct clusters, as adding more clusters results in lower silhouette scores, indicating either overlapping clusters or less clearly defined boundaries.

In summary, k = 2 is the best model based on the silhouette scores.

Dendrogram

Short vertical lines between points or small clusters indicate that those points or clusters are similar to each other. Conversely, longer vertical lines suggest greater dissimilarity between clusters being merged.

in this case, it seems like divided 4 times for creating small vertical lines. but by the area on x-side, this is recommendation of 2 clusters.

DBSCAN

eps=0.5, min_samples=5

Found one main cluster of data points in the 3D PCA space, which is a dense group of similar points.

Conclusion

K-Means Clustering and Silhouette Method:

K-Means is an effective clustering algorithm, but it requires the number of clusters to be specified beforehand. By using the Silhouette Method, we were able to test different values of k and select the optimal number of clusters based on the highest silhouette score, which measures how well each data point fits within its cluster.

This approach is particularly useful in situations where we don’t know the appropriate number of clusters beforehand, making the silhouette score a valuable metric for evaluating cluster quality.

Hierarchical Clustering:

Hierarchical Clustering offers a different perspective by building clusters in a nested, tree-like structure. This method doesn't require specifying k upfront, which can be useful for exploratory analysis.

While memory-intensive on larger datasets, this method helps visualize the hierarchy and relationships between data points through dendrograms or by directly clustering the data with a specific number of clusters.

DBSCAN (Density-Based Spatial Clustering):

DBSCAN is advantageous when clusters have irregular shapes or when we want to detect outliers, as it doesn’t require specifying k and can handle noise. This contrasts with K-Means and hierarchical clustering, which are more sensitive to outliers.

DBSCAN can be especially useful in identifying dense regions in the data and marking sparse regions as noise or outliers.

Page updated

Google Sites

Report abuse