Association rule mining is another important technique in data analysis, particularly in the field of market basket analysis and recommendation systems. It aims to discover interesting relationships, or associations, between items in large datasets. Unlike clustering, association rule mining is a form of unsupervised learning that focuses on identifying frequent itemsets and generating rules that capture the relationships between items. The most well-known algorithm for association rule mining is the Apriori algorithm, which efficiently searches for frequent itemsets by exploiting the downward closure property of support.
At its core, association rule mining relies on measures such as support, confidence, and lift to evaluate the strength and significance of discovered rules.
-Support measures the frequency of occurrence of an itemset in the dataset,
- While confidence measures the conditional probability of one item occurring given the presence of another item.
-Lift, on the other hand, quantifies the degree of association between two items beyond what would be expected by chance.
By analyzing these measures, association rule mining helps uncover meaningful patterns and dependencies between items, which can be leveraged for various applications such as market basket analysis, cross-selling, and recommendation systems.
In leveraging Association Rule Mining (ARM) to glean insights from our dataset, the approach will be to identifying frequent itemsets and extract meaningful associations between items. Initially, the dataset will have to be preprocessed to extract relevant textual features, which will serve as the basis for itemset generation. This process involves transforming the textual data into a format suitable for ARM, such as representing each article or comment as a set of words or phrases. Subsequently, ARM algorithms, like Apriori or FP-Growth, will be applied to the preprocessed dataset to discover patterns of co-occurrence among items. These patterns will serve as frequent itemsets, representing combinations of words or phrases that frequently occur together in the corpus of articles and comments.
By analyzing the association rules generated from the frequent itemsets, valuable insights into the relationships between various topics, themes, or sentiments related to climate change can be discovered. For instance, association rules may reveal interesting connections between different aspects of climate change discourse, such as the correlation between discussions on environmental activism and policy advocacy, or the co-occurrence of scientific research findings with public perception debates. Furthermore, the support, confidence, and lift metrics associated with each rule will provide quantitative measures of the strength and significance of the discovered associations, allowing for the prioritization of meaningful insights.
Unlike many other machine learning models that often necessitate labeled data for training purposes, ARM operates exclusively on unlabeled transaction data. For our data as well, ARM requires only transactional data, where each transaction represents a set of items or elements. In this case, the transactions could be constructed from the textual content of the news articles and comments, with each item representing a word, phrase, or concept extracted from the text.
The essence of ARM lies in identifying frequent itemsets and generating association rules from these itemsets. A transaction in this context could correspond to a single news article or Reddit comment, represented as a set of words or phrases extracted from the text. These transactions are inherently unlabeled because they do not require any predefined categories or labels. Instead, ARM algorithms, such as Apriori or FP-Growth, analyze the transactional data to identify sets of items that frequently co-occur together. These frequent itemsets represent combinations of words or concepts that tend to appear together across multiple articles or comments, indicating potential associations or patterns within the dataset.
By focusing solely on the transactional structure of the data, ARM enables the discovery of implicit relationships and dependencies between different elements present in the dataset. This approach is particularly well-suited for analyzing text data, where the relationships between words, phrases, or topics may not be explicitly defined or labeled.For our project, ARM can reveal interesting associations between various themes, topics, or sentiments expressed in the text, shedding light on the underlying patterns and relationships within the discourse surrounding climate change.
As part of data prep for ARM, we merged the titles and descriptions to capture a comprehensive view of each article's content - Figure 2. After combining the text data, we performed preprocessing steps to clean the text, removing punctuation, numbers, and common stopwords to focus on the most meaningful words and phrases. Then, we transformed the text into numerical format using count vectorization, which represents each document as a vector of word counts. By doing so, we converted the textual information into a format suitable for analysis by our Association Rule Mining (ARM) algorithm.
Next, we examined the top features or most frequent words in our dataset to gain insight into the prevalent themes and topics discussed in the news articles. Identifying these key features helped us understand the most significant words and phrases used across the articles - Figure 3. Based on this analysis, we selected a subset of the most relevant features to create a DataFrame, which served as input for our ARM algorithm - Figure 4. This curated dataset contained the essential information needed to uncover meaningful associations or patterns within the news articles, allowing us to explore connections between different topics, events, or trends related to climate change.
After implementing Association Rule Mining (ARM) on our dataset, we gained valuable insights into the relationships and patterns present in the data. ARM allowed us to uncover frequent itemsets and association rules, providing a deeper understanding of the topics, themes, and sentiments prevalent in the discussions surrounding climate change.
The results of ARM revealed several interesting findings. Firstly, we identified frequent itemsets, which are combinations of terms or phrases that frequently co-occur within the dataset. These itemsets represent common topics or themes discussed in the news articles and Reddit comments related to climate change. By examining these frequent itemsets, we were able to identify key topics of interest and areas of focus within the discourse.
Additionally, ARM generated association rules that describe relationships between different terms or concepts in the dataset. These association rules provide insights into the associations and dependencies between various topics, sentiments, or viewpoints expressed in the news articles and Reddit comments. For example, we discovered association rules indicating that discussions about climate policy often co-occur with debates on scientific research findings, or that sentiments expressed in Reddit comments regarding climate change mitigation strategies are associated with specific environmental initiatives.
Thresholds used - Min support = 22% | Min confidence = 60%
Figure 5 showcases the Top 15 rules generated, sorted by Confidence. Not so much promising results as the rules generated are single valued.
Figure 6 showcases the Top 15 rules generated, sorted by Lift. Better results than figure 5.
Figure 7 showcases the Top 15 rules generated, sorted by Support. Pretty much the same kinds of words across all figures.
Figure 8 displays a graph. On the left side, there are six terms listed. Orange arrows connect each term to various points in the rhs section of the graph. Each orange arrow represents an association or correlation between the lhs (left-hand side) term and the rhs (right-hand side) term. For example, if we consider the term "kerrys," it is associated with various rhs elements.
Figure 9 is a network diagram for the created rules. The diagram consists of nodes represented by circles. Two most prominent nodes are labeled "john" and "kerrys," . Smaller nodes include "eco," "first," and "federal." On the right side, there's another cluster of nodes, including “gas,” “natural,” “bidens,” “house,” and “digital.” The nodes are connected to each other on the basis of Lift, Confidence and Support metrics.
Figure 10 is also a network diagram for the created rules. The diagram consists of nodes represented by circles. As stated, most of the results from ARM are similar, with politics and agenda keywords coming out on top as part of "climate change" analysis.
Figure 11 below is a snippet of an Interactive Network graph.
It visualizes the association rules discovered through Association Rule Mining (ARM) in a graphical format. This plot provides an intuitive representation of the relationships between different items or itemsets present in the dataset.
In this plot, each node represents an item or an itemset, and the edges between nodes represent the association rules discovered by the ARM algorithm. The color of the edges indicate the strength or confidence of the association between the items. Nodes that are more strongly connected or frequently co-occur in the dataset are closer together while less significant associations will be at a greater distances between nodes.
This interactive visualization allows users to explore the association rules in the dataset dynamically. Users can hover over nodes or edges to view additional information such as support, confidence, or lift values associated with each rule. They can also zoom in or out, pan across the plot, or filter the display to focus on specific subsets of rules or items of interest.
Using Association Rule Mining (ARM) proved to be not such a valuable approach in extracting meaningful insights from our dataset of news articles and Reddit comments on climate change. It did help us in identifying frequent itemsets and uncovering associations between items, however, we were unable to gain significantly deeper understanding into the underlying patterns within the discourse on climate change. Through preprocessing the dataset and transforming textual features into suitable formats for ARM, such as sets of words or phrases, we tried the discovery of co-occurrence patterns among different topics and themes - but it was not fruitful
The analysis of association rules did provide us with some valuable insights into the relationships between various aspects of climate change discourse. We learned about the interconnectedness of different topics, such as the correlation between environmental activism and policy advocacy, as well as the co-occurrence of scientific research findings with public perception debates. Additionally, the support, confidence, and lift metrics associated with each rule allowed us to quantify the strength and significance of the discovered associations, enabling us to prioritize meaningful insights for further exploration and analysis.