Association Rule Mining (ARM) is a data mining technique used to discover interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using different measures of interestingness. The primary purpose of ARM is to find frequent patterns, associations, or causal structures among sets of items or objects in transaction datasets.
Key Measures in ARM
Support: Support provides insight into the frequency of occurrence of an itemset within the total number of transactions. It helps identify how often a particular item or set of items appears in the dataset. The higher the support, the more frequent the itemset is. Support is crucial in filtering out less relevant or rare itemsets that may not offer actionable insights.
Confidence: Confidence measures the reliability of the rule {A} → {B}, i.e., how often item B appears in transactions that contain item A. Confidence is an indicator of the strength of the association. It helps in identifying how likely it is that items will be purchased together, offering valuable insights for strategies like cross-selling.
Lift: Lift evaluates the strength of an association rule by measuring how much more likely item B is to be purchased when item A is present, compared to how often B would be purchased independently. Lift helps assess whether the occurrence of item A influences the occurrence of item B. A lift value greater than 1 suggests a strong positive association between A and B.
RULES
Association rules in Association Rule Mining (ARM) represent relationships or patterns within transaction data, expressed in the form {A} → {B}, where A is the antecedent (the itemset that appears first) and B is the consequent (the itemset that tends to follow). These rules capture co-occurrence relationships between itemsets, identifying patterns that show if one set of items (A) is present, another set (B) is likely to also be present in the same transaction.
For example, in a retail setting, the rule {Bread, Butter} → {Milk} suggests that customers who buy bread and butter are also likely to purchase milk. This insight can be highly valuable for businesses to drive decision-making processes, such as product placement or targeted promotions.
The Apriori algorithm is one of the most popular algorithms for mining frequent itemsets for boolean association rules. It uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found.
Apriori uses a breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support (i.e., if an itemset is frequent, all of its subsets must also be frequent) to reduce the size of the candidate itemsets.
WORKING OF APRIORI ALGORITHM
1. Frequent Itemset Generation:
Candidate Generation: Initially, the algorithm identifies all individual items in the dataset and calculates their support (the percentage of transactions that contain the item). Items that meet the minimum support threshold are considered frequent itemsets.
Candidate Pruning: For the next iteration, the algorithm generates new itemsets by combining the previously identified frequent itemsets. These itemsets are of increasing size (starting with pairs, then triples, etc.). Itemsets that do not meet the minimum support threshold are discarded (pruned).
This process repeats, increasing the size of the itemsets by 1 at each iteration, until no more frequent itemsets can be generated.
2. Association Rule Generation:
Once all frequent itemsets have been identified, the algorithm generates association rules from these itemsets. An association rule is typically written as {A} → {B}, meaning that when itemset A appears, itemset B is likely to appear as well.
For each frequent itemset, the algorithm considers all possible ways to divide it into two non-overlapping subsets: a left-hand side (antecedent) and a right-hand side (consequent). For each possible rule, the algorithm calculates key metrics:
Support: How frequently the itemset appears in the dataset.
Confidence: How often items in B appear in transactions that contain A.
Lift: The ratio of observed support to expected support if A and B were independent.
3. Pruning by Metrics:
The rules are pruned by applying minimum thresholds for confidence, support, and/or lift. Only rules that meet these thresholds are considered valid and are retained for further analysis.
DATA PREPERATION
The objective of preparing the dataset for Association Rule Mining (ARM) is to transform continuous numerical data into a categorical format that facilitates the discovery of associations between different items.
Binning of Numerical Variables: Numerical variables within the dataset, including ash content, heat content, price, quantity, and sulfur content, were converted into categorical variables using a process known as binning. Binning divides the continuous range of a variable into distinct intervals, subsequently converting these intervals into categories. In this case, quartile-based binning was employed, which divides the data into four parts based on distribution quartiles. These parts correspond to the categories: "Low," "Medium," "High," and "Very High."
Discretization Process: The cut function in R was used to categorize each numerical variable. This method assigns each numerical value to a specific category based on the quartile it falls into. For instance, both the ash content and heat content variables were discretized into the categories "Low," "Medium," "High," and "Very High," according to their respective value ranges.
Creation of Transactional Data: Following the discretization of numerical data, the dataset was transformed into a transactional format, similar to the data used in market basket analysis, which is commonly employed in ARM. Each row in the dataset represents a unique transaction, with categorical descriptions indicating the levels of variables such as sulfur content or coal quantity.
This process ensures that the data is structured appropriately for ARM, allowing for effective association rule discovery.
RESULTS AND ANALYSIS FROM ARM:
The exploration through Association Rule Mining identified significant relationships between different data attributes, focusing on deriving the top rules based on support, confidence, and lift. These metrics are crucial in understanding item associations within large datasets.
Top Rules Identification
Specific thresholds were set to ensure the relevance and strength of the rules derived from the data:
Support Threshold: 0.001 (minimum percentage of transactions where the itemset appears)
Confidence Threshold: 0.8 (minimum expected likelihood of item B being purchased when item A is purchased)
Lift Threshold: 1 (minimum measure to ensure that the presence of item A increases the likelihood of item B's presence)
Using these thresholds, the top 15 rules for each measure were extracted: support, confidence, and lift. These rules help pinpoint the strongest and most frequent associations within the dataset, providing actionable insights for decision-making.
VISUALIZATIONS AND INSIGHTS
The network graph above illustrates how different items are interconnected based on the association rules. This network graph visually represents the data and highlights the most robust connections with thicker, more prominently colored lines.
Support Analysis: Focuses on the frequency of itemsets, showing which items commonly appear together in the dataset.
Confidence Analysis: Provides insights into the reliability of the implications of the rules. High confidence rules indicate a strong likelihood that the consequent item is purchased when the antecedent is present.
Lift Analysis: Examines how much more likely items are to be bought together than expected if they were statistically independent. High lift values indicate a strong, positive association between the itemsets.
LEARNINGS
The Association Rule Mining (ARM) analysis of the coal dataset uncovered significant insights into the relationships between key attributes such as coal rank, ash content, sulfur content, heat content, and price. The study revealed patterns and associations that are crucial for understanding coal characteristics and distribution across various states.
Key Attribute Associations: The ARM analysis identified strong associations between coal properties. For instance, coal with high ash content and low heat content is commonly found in regions like Texas and Wyoming, highlighting regional variations in coal quality that can impact logistics, transportation, and operational planning for coal-based energy generation.
Influence of Coal Rank on Quality: The analysis clarified the relationships between coal rank (e.g., bituminous, lignite) and other properties, such as sulfur content and price. Lignite coal, characterized by high sulfur and ash content, is consistently linked to lower prices due to its lower energy efficiency and environmental concerns. These insights are valuable for improving coal sourcing and usage decisions.
Insights into Regional Distribution: The results identified state-specific patterns where particular types of coal with attributes like high sulfur content or low heat content are more prevalent. These regional associations can inform logistics strategies, compliance with environmental regulations, and long-term coal supply chain planning.
Improved Understanding of Market Dynamics: The analysis highlighted how market variables such as price and quantity relate to coal quality. High-sulfur coal was frequently associated with lower prices and larger quantities, indicating a correlation between environmental regulations, demand, and pricing.
Support for Sustainable Energy Transition: The associations between coal quality and price offer insights that can aid in the transition to sustainable energy. By understanding the market for lower-quality, higher-emission coal, stakeholders can strategize more effectively for cleaner energy adoption while optimizing the coal supply chain to meet specific demand.
In conclusion, the ARM analysis provided valuable insights into coal sourcing, pricing strategies, and regional trends, offering crucial information for stakeholders in the coal industry to enhance decision-making in terms of efficiency and sustainability.