Results of K-means Clustering for K=3,4,5:
Results of K-means Clustering for K=3,4,5:
This graph shows the clustering that happened when the K value is taken as 5. As we can see, there is some overlap in the clusters. In the google colab notebook attached with code (here), we can see that there are clusters with conjunctions and nouns that are being repeated multiple times. Clearly, 5 is not the right number of clusters to be considered for this data. From both elbow and silhouette methods (as can be seen here), we can see that the right value for K is 3. Reducing the value of K by 1 each time to see the difference, it is evident that the clusters are forming better with every trial and when K=3, the clustering is most accurately done. The comparison between the clusters with 3 different K values (3, 4 and 5) can be inferred from the plots below. (Plots are in the order of K=5, K=4, K=3 respectively).
Results of Hierarchical Clustering:
The dendrogram shown here can be utilized for helping to visualize the hierarchical relationships among news articles based on their textual content. Initially, textual features such as word frequencies or TF-IDF scores were extracted from the news articles. These features were then used to compute the similarity or dissimilarity between pairs of articles. Hierarchical clustering algorithms are applied to group similar articles together, forming a dendrogram where each leaf node represents an article, and internal nodes represent clusters of articles. By inspecting the dendrogram, we can identify clusters of news articles with similar content, potentially revealing patterns or clusters of fake news articles that share common characteristics. This provides a visual aid for understanding the structure of the clustered data, aiding in the identification and analysis of potential fake news clusters. The value of h is chosen to be 0.92 approximately. This horizontal line gives us the best results by selecting 3 clusters. This result also matches with the result obtained in the K-means clustering algorithm.
Comparison between K-Means and Hierarchical Clustering:
K-means clustering and hierarchical clustering are two prominent techniques in the field of unsupervised learning, each with distinct characteristics and applications. K-means clustering partitions a dataset into a predefined number of clusters, aiming to minimize the within-cluster sum of squares. It operates by iteratively assigning data points to the nearest cluster centroid and updating centroids based on the mean of the data points assigned to each cluster. K-means is computationally efficient and works well with large datasets, making it suitable for scenarios where the number of clusters is known in advance and computational efficiency is a priority. However, K-means is sensitive to the initial choice of centroids and is susceptible to local optima, which may result in suboptimal clustering solutions.
On the other hand, hierarchical clustering organizes data into a hierarchical tree-like structure, known as a dendrogram, by recursively merging or dividing clusters based on their similarity or dissimilarity. Agglomerative and divisive are the two main approaches to hierarchical clustering, with the former starting with individual data points as clusters and merging them iteratively, while the latter begins with all data points in a single cluster and divides them recursively. Hierarchical clustering does not require a predefined number of clusters, allowing for a more flexible and adaptive clustering process. It is particularly useful for exploratory data analysis and visualization, as the dendrogram provides insights into the hierarchical relationships and structures within the data. However, hierarchical clustering can be computationally intensive, especially for large datasets, and its results may be sensitive to the choice of distance metric and linkage method.
The choice between K-means and hierarchical clustering depends on the specific characteristics of the dataset and the goals of the analysis. K-means clustering is preferred when the number of clusters is known in advance, and computational efficiency is crucial. It is suitable for large datasets and can handle high-dimensional data efficiently. In contrast, hierarchical clustering is advantageous when the number of clusters is unknown or when exploring hierarchical relationships within the data is desired. It is useful for gaining insights into the structure and organization of the data, particularly in cases where the underlying data distribution is complex or when interpretability is paramount. Overall, both K-means and hierarchical clustering are valuable tools in clustering analysis, each offering distinct advantages and applications depending on the specific requirements of the task at hand.
Conclusion:
To sum it all up, I have successfully determined that employing three clusters, as determined by both the elbow and silhouette methods, is optimal for the task at hand. This decision was further reinforced by the implementation of hierarchical clustering in R (Despite some denseness in the dendrogram likely because of the large volume of data). The convergence of results from K-means clustering and Hierarchical clustering techniques not only validates the chosen approach but also underscores the robustness of the clustering methodology employed in this project.
Moving forward, these findings provide a solid foundation for the development of more refined fake news detection algorithms. By partitioning the dataset into three distinct clusters, I can now explore the distinguishing characteristics of each cluster and identify key features associated with fake news articles. Additionally, the hierarchical clustering results offer valuable insights into the hierarchical relationships and structures within the data, guiding further investigation and analysis. Overall, these clustering analyses pave the way for more targeted and effective strategies.