From the cleaned dataset, I used a dataset containing patent-related information, including citation counts and other patent attributes. This dataset serves as labeled data, though for unsupervised clustering, removed non-numeric labels, using only citation-related numeric columns. Then using StandardScaler from Sklearn to normalize the citation data. This process ensures that each column has a mean of 0 and a standard deviation of 1, which is crucial for ensuring that all features contribute equally to the clustering process. PCA was applied twice: once with 2 components and once with 3 components. This helped in reducing the dimensionality of the data for 2D and 3D visualizations while retaining most of the variance.
Explained variance (2D): 89.40%
Explained variance (3D): 97.68%
To retain at least 95% of the data, We determined that 3 principal components were sufficient to retain over 95% of the variance. and top 3 eigenvalues were : [3.51671657 0.95403248 0.41399137]