Association Rule Mining (ARM) is another widely used machine learning technique. It falls into the category of unsupervised learning, where labels are not provided, and the model trains autonomously without human intervention. The primary goal of ARM is to analyze the data, produce patterns, and check for their associations/correlations. It involves the process of finding the frequently brought items(frequent item sets) and generating laws that establish relationships among different items(association rules). Some of the common applications of ARM include:
Market Basket Analysis
Text Mining
Social Network Analysis
Pricing Optimization
Rules in ARM are defined as a set of relationships or associations between different itemsets(variables) present in the dataset. The rules are generally expressed in the form of antecedent and consequent expressing the associations between them. The rules aim to capture patterns and associations within the data.
For example,
In this case, the rule can be framed in this way:
If the weather is warm and the origin state is Florida, then there is a high likelihood that flights will be delayed.
This can also be written as {Warm, Florida} --> {Delay}
where {Warm, Florida} is antecedent and {Delay} is consequent.
There are 3 parameters or important Measures of Association Rule Mining:
The Apriori algorithm is one of the famous algorithms used in data mining to uncover frequent itemsets within a dataset and derive association rules from them. It is based on the 'Apriori Principle' which states:
If an itemset is frequent, then all the subsets of the itemset must also be frequent.
If an itemset is infrequent, then all its supersets will be infrequent.
The algorithm looks at different combinations of items, starting with one item and gradually adding more and pruning those that do not meet the minimum support threshold leaving only the more common itemsets. The process involves scanning the dataset multiple times to count the occurrences of candidate itemsets. The algorithm efficiently prunes the search space using the anti-monotonicity property of support.
Consider the following example of Apriori Algorithm:
Suppose the minimum support is set at 0.4. What Apriori does is initially take into account the support of individual items.
Support(Eggs) = 3/5 = 0.6
Support(Milk) = 2/5 = 0.4
Support(Bacon) = 2/5 = 0.4
Support(Tea) = 2/5 = 0.4
Support(Diaper) = 2/5 = 0.4
Support(Bananas) = 1/5 = 0.2
In this scenario, if the support of bananas does not meet the minimum threshold, Apriori prunes it. The algorithm then proceeds to create sets with other items that have surpassed the minimum support threshold.
Support(Eggs,Milk) = 2/5 = 0.4
Support(Eggs,Bacon) = 2/5 = 0.4
Support(Eggs,Tea) = 1/5 = 0.2
Support(Eggs,Diaper) = 1/5 = 0.2
Support(Milk,Bacon) = 1/5 = 0.2
Support(Tea,Milk) = 0
Support(Diaper,Milk) = 1/5 = 0.2
Support(Bacon,Tea) = 1/5 =0.2
Support(Bacon,Diaper) = 1/5 =0.2
Support(Tea,Diaper) = 0
In this instance, support of many itemsets does not meet the minimum threshold, Apriori prunes them. Subsequently, it creates a set with other items that have met the threshold, such as Eggs, Milk, and Bacon.
Support(Eggs, Milk, Bacon) = 1/5 = 0.2
and does not meet the minimum support threshold, the algorithm identifies the most frequently occurring itemsets, which, in this case, are Eggs and Bacon, as well as Eggs and Milk.
This demonstrates how the Apriori algorithm efficiently identifies the most frequently occurring itemsets in a dataset based on a specified minimum support threshold.
The utilization of ARM in this project is crucial for examining patterns related to flight delays resulting from adverse weather conditions. Specifically, ARM is employed to identify flights that exhibit delays and to discern the climatic conditions under which these delays are more likely to occur. The analysis is specifically conducted on the origin states and their corresponding weather delays, given the substantially higher proportion of weather-related delays at the point of departure compared to the destination. The project involves the discovery and visualization of distinct sets of rules that highlight the significant relationships between specific weather conditions in origin states and the occurrence of flight delays.