Clustering

Clustering is a fundamental technique in machine learning and data analysis that aims to organize unlabeled data into groups, or clusters, based on the inherent similarities among data points. It is an unsupervised learning method, meaning that it does not require labeled data to train a model. Instead, clustering algorithms explore the structure of the data and group similar data points together while maximizing the dissimilarity between different clusters. The primary goal of clustering is to uncover hidden patterns or structures within the data, which can provide valuable insights into the underlying characteristics of the dataset.

At its core, clustering relies on the concept of distance or similarity measures to quantify the similarity between data points. Various distance metrics, such as Euclidean distance, Manhattan distance, or cosine similarity, can be used to measure the distance between data points in the feature space. These distance metrics help clustering algorithms determine how close or dissimilar data points are from each other, forming the basis for grouping them into clusters. By iteratively optimizing the assignment of data points to clusters, clustering algorithms aim to find a partitioning of the data that maximizes intra-cluster similarity and minimizes inter-cluster dissimilarity.

Figure 1 - Clustering Example

Figure 2 - Various Clustering Distance Metrics

In the scope of this project, clustering can be a valuable tool for uncovering underlying themes, patterns, and sentiments within the data. By applying clustering algorithms to these datasets, the aim is to identify distinct clusters of articles or comments that share similar topics, opinions, or sentiments. For instance, clustering news articles can help identify common themes or narratives prevalent in media coverage of climate change, such as discussions on policy, scientific research, environmental activism, or climate-related events. Similarly, clustering Reddit comments can reveal different community perspectives, ranging from debates on climate science and policy to discussions on climate change mitigation and adaptation strategies.

Using clustering on the gathered datasets, the goal is to gain insights into the diversity of viewpoints, opinions, and narratives surrounding the topic. By clustering similar articles or comments together, the project work can identify common trends, emerging topics, and areas of consensus or contention within the discourse on climate change. Additionally, clustering can help in summarizing large volumes of text data, making it more manageable and interpretable for further analysis.

Data Preparation

Clustering algorithms, unlike many other machine learning models, operate solely on unlabeled numeric data. This means that clustering algorithms do not require any predefined labels or target variables to train the model. Instead, they focus on identifying natural groupings or clusters within the data based solely on the intrinsic structure of the data itself.

In clustering, the input data is usually represented as a set of numeric features, where each data point corresponds to a vector in a multidimensional space. These features can represent various attributes or characteristics of the data points, such as numerical measurements, frequencies, or other quantitative representations. By representing the data in this numeric format, clustering algorithms can calculate distances or similarities between data points, which are then used to group similar data points together into clusters.

The numeric nature of the data is crucial for clustering algorithms to compute distances or similarities between data points accurately. Common distance metrics used in clustering include Euclidean distance, Manhattan distance, and cosine similarity, among others. These distance metrics quantify the dissimilarity or similarity between data points based on their feature values. By comparing these distances or similarities, clustering algorithms can identify clusters of data points that are close to each other in the feature space, indicating similarity in their underlying characteristics. Therefore, the numeric representation of the data is fundamental for clustering algorithms to effectively uncover meaningful patterns and structures within the data.

Figure 3 - Sample data for Clustering

Since text data is inherently non-numeric, it needs to be transformed into a numerical format where each word or term is represented by a numeric value. This numeric representation enables clustering algorithms to measure the similarity or dissimilarity between text documents based on the frequency or occurrence of words. Figure 3 showcases the sample data which was further processed and then used for clustering.

Also, converting text data to numeric format allows one to leverage the rich information contained within the text for clustering analysis. Various techniques can be used to convert text data into numeric vectors, such as bag-of-words, TF-IDF (Term Frequency-Inverse Document Frequency), word embeddings, or topic modeling. These techniques capture different aspects of the textual information, such as word frequencies, semantic meanings, or topic distributions, and represent them as numerical features. By encoding text data in this way, clustering algorithms can identify clusters based on the underlying themes, topics, or similarities in the content of the text documents.

Link for Sample Data

Code

K-means Clustering - Python

Hierarchical Clustering - R (hclust)

Results

K-means

First, we calculated the silhouette scores for different values of k and visualized them to help determine the optimal number of clusters. The silhouette score ranges from -1 to 1, where a higher score indicates better-defined clusters.

Once we determine the optimal k value, we visualized the clustering results using t-SNE (t-distributed Stochastic Neighbor Embedding) for dimensionality reduction and scatter plots.

Figure 4 - Silhouette Score

As the number of clusters (k) increases, the Silhouette Score is generally improving. The upward trend suggests that increasing the number of clusters enhances the quality of the clustering. We chose the value k=7.

Figure 5 and 6 - k=3 and k=4 clusters

The scatter plots in Figures 5-9, depict the outcome of applying t-SNE (t-Distributed Stochastic Neighbor Embedding) to high-dimensional data. Each dot represents a data point, and its position on the plot reflects its t-SNE features. For figures 5 and 6, we have two plots side by side: one with three clusters (k = 3) and the other with four clusters (k = 4).

The different color dots correspond to distinct clusters. As one moves from left to right along the x-axis (t-SNE Feature 1), the data points transition from one cluster to another. The t-SNE algorithm has grouped similar data points together, emphasizing their intrinsic relationships. By increasing the number of clusters, we gain finer granularity in capturing data patterns.

The choice of the number of clusters (k) impacts the granularity of our analysis. Fewer clusters (k = 3) provide a broader view, while more clusters (k = 7) allow us to capture finer nuances.

Figure 7 and 8 - k=5 and k=6 clusters

Figure 7 below showcases the cluster distribution with optimal value of k=7

Figure 9 - k=7 cluster

Hierarchical Clustering

Figure 10- Dendogram 1

A dendrogram is a tree-like diagram that illustrates the arrangement of clusters produced by hierarchical clustering. In figure 10, the dendrogram seems to represent data points. Each article is represented by a blue line at the base of the dendrogram. The red lines connect various levels of clustered articles, showing how they are grouped based on similarity.

In Figure 11, data points are grouped based on their similarities. Each data point is represented by a vertical line at the base of the dendrogram and hence the labels are now shown due to their sheer volume. The height along the Y-axis represents the distance between clusters. Lower heights indicate more similar clusters, while higher heights indicate dissimilar clusters. The red, blue, and green lines connect various levels of clustered data points, showing how they are grouped based on similarity. Different colors highlight distinct primary clusters before they merge into larger clusters.

Figure 11- Dendogram (k=7)

Both k-mean and hierarchical clustering suggest an optimal k value of k=7.

Conclusion

Through application of clustering techniques, we were able to identify distinct clusters of articles and comments that share common themes, sentiments, and topics. These clusters revealed the multifaceted nature of discussions on climate change, encompassing a wide range of topics such as policy, scientific research, activism, and mitigation strategies. Moreover, the clustering analysis helped to summarize and organize large volumes of text data, facilitating a deeper understanding of the complex discourse surrounding climate change.

Furthermore, the clustering results shed light on the various viewpoints and stances prevalent within the discourse on climate change. By grouping similar articles and comments together, we were able to discern patterns of consensus and contention, as well as emerging topics and trends. This information is invaluable for policymakers, researchers, and stakeholders seeking to navigate the complex landscape of climate change discussions and formulate informed decisions and strategies.

Page updated

Report abuse