In order to begin exploring the textual content previously prepared, clustering serves as a pivotal tool in text mining tasks; clustering is an unsupervised learning technique that allows one to find groupings and patterns within datasets. These groups and patterns shed light on underlying structures which may not always be apparent without this analysis. With Clustering, the goal is to organize the text data into meaningful clusters, giving more insight into the data and its characteristics.
To perform clustering there are several steps involved, beginning with the data preprocessing and preparation done before where the news, Reddit, and Medium data underwent lemmatization and vectorization. These steps transformed the data from textual to numerical for further analysis using k-means and hierarchical clustering. With the k-means algorithm, a crucial step is testing for the optimal number of clusters to define – this was done using Silhouette analysis to try different numbers of clusters and analyze if an object is well matched to its own cluster and poorly matched to neighboring clusters.
Regarding expected discoveries, it could be anticipated, based on the nature of clustering, to discover thematic groupings surrounding blockchain discussions online. Too, examining the clusters across sources can help to identify unique characteristics of each platform’s discourse, and clustering the data allows for the discovery of trends and patterns across platforms such as the possibility of one platform demonstrating more polarized clusters than another. By revealing the hidden structures within the platforms’ data, comprehension of the data will be improved and the groundwork will be laid for further analysis.
As previously mentioned, it is a crucial step in the clustering process to prepare the data appropriately so that the format matches the algorithms and nature of clustering. In order to do this for the news, Reddit, and Medium data, the previously lemmatized and count vectorized csv files were called upon. These files were used as the data had already been converted from text to numeric data through the count vectorization process.
News, Reddit, and Medium Lemmatized and Count Vectorized Data to be Used for Clustering
Clustering Approaches
For the K-means clustering approach, a range of values for k were experimented with to identify the optimal number of clusters; this range spans from 2 to a maximum that is reasonable for the dataset size, not exceeding a value of 7. The Silhouette method was then utilized to assess the quality of the clusters formed by different values of k. This method calculates the Silhouette score, which measures how similar an object is to its own cluster compared to other clusters. The score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. Through this analysis, the "best k" is identified as the one that maximizes the average Silhouette score across all data points.
Hierarchical clustering (HCLUST) was conducted to offer a different perspective to the findings from the K-means clustering approach. Unlike K-means, hierarchical clustering does not require the number of clusters to be specified, but instead, it creates a dendrogram, a tree diagram that shows the arrangement of the clusters formed at every step of the clustering process. To determine the optimal number of clusters from the dendrogram, we looked at the largest vertical distance that doesn't intersect any of the clusters' horizontal lines. This method is somewhat subjective but can be quantified by setting a threshold for the maximum distance or by looking for significant gaps in the link distances.
The comparison between K-means and hierarchical clustering methods provides a more insightful view of the data's underlying structure. Ideally, both methods should suggest a similar number of clusters, reinforcing the robustness of the analysis. However, discrepancies might occur due to the different assumptions and functions of the algorithms. K-means assumes clusters are spherical and tends to find clusters of similar sizes, whereas hierarchical clustering can accommodate a wider variety of cluster shapes and sizes.
Specific Outcomes
K-means clustering was applied to datasets from NewsAPI, Reddit, and Medium, exploring various values of k to identify the optimal number of clusters:
NewsAPI Dataset:
K-means clustering was performed with the k values of 5, 6, and 7 where the Silhouette scores were 0.137, 0.150, and 0.146, respectively. This indicates that the k value of 6 is the best for the data due to the value having the highest Silhouette score which indicates that objects were more similar to each other and less similar to neighboring clusters.
Hierarchical clustering was also performed with the same k values where the dendrograms visually demonstrated that the k value of 5 with the height being about 0.7 in the hierarchical structure.
Reddit Dataset:
For the Reddit data’s K-means clustering analysis, the k values of 3, 4, and 5. The Silhouette scores for each k-value, in the same order, were 0.459, 0.478, and 0.442; this tells us that the k-value of 4 is optimal for the Reddit data given the highest Silhouette score and the previously mentioned meaning behind the score.
Too, hierarchical clustering was performed on the same values of k, and the analysis recommended k = 3 with the height in the dendrogram being about 0.8.
Medium Dataset:
The Medium analysis included clustering with k = 2, k = 3, and k = 4, with the Silhouette scores being 0.304, 0.290, 0.164 in the same order. These Silhouette scores show that the optimal k-value for the Medium data is 2 for intra-cluster likeness and neighboring unlikeness.
Along with k-means, hierarchical clustering was done on the same k values and the k value of 2 was optimal with the height in the dendrogram being about 0.9.
Regarding the differences in clustering techniques during this analysis, in the NewsAPI dataset, hierarchical clustering suggested a smaller optimal k-value than K-means, potentially indicating more generalized groupings or broader categories within the data. This might reflect hierarchical clustering's ability to capture more nuanced relationships in the data. For the Reddit dataset, hierarchical clustering recommended a smaller k-value than K-means, suggesting that Reddit discussions might naturally form a few broad groups with more subtle distinctions captured by the hierarchical method. Lastly, in the Medium dataset, both methods agreed on the optimal k-value, indicating clear and distinct clustering that both methods could easily identify.
The clustering analyses across the three different datasets for NewsAPI, Reddit, and Medium demonstrate the variability and complexity of public discourse on the security and ethical implications of blockchain technology. While NewsAPI and Medium articles tend to group into fewer, broader categories, Reddit discussions exhibit a higher diversity of viewpoints, warranting a greater number of clusters. These findings not only underline the complicated nature of blockchain discussions but also showcase the effectiveness of clustering techniques in categorizing text data for deeper insights. The optimal cluster numbers for each dataset indicate the different characteristics of discourse on different media platforms which reflects the platform-specific engagement and perspectives on blockchain technology.