Every model and methodology demands particular data formats. Similarly, in Association Rule Mining (ARM), the data should be structured as transactions for effective processing. It is essential to ensure that the data is not in a continuous format; if it is continuous, discretization or binning is necessary. Additionally, it is crucial to drop the header of the data, meaning that there should be no column names in the dataset. This adherence to specific data format requirements is pivotal for the accurate and efficient application of ARM techniques to extract meaningful association rules from the dataset.
In this context of Association Rule Mining (ARM), a combination of Python and R is employed. Python is utilized for data preparation tasks, ensuring the dataset is appropriately formatted. On the other hand, R takes on the role of executing ARM, discovering association rules, and creating visualizations to represent these rules.
The dataset initially consisted of many columns, but the relevant features considered for analysis include the flight name, origin state, origin weather delay, and origin temperature. Given that the origin weather delay is continuous values, the column needs to be discretized or binned. For further clarity, the origin temperature columns are converted into corresponding labels followed by US.
Initial data before processing for ARM
A Python function for converting the temperature values to their corresponding labels followed by US is written and the continuous temperature values are discretized into corresponding labels. The grouping includes:
Below 0°F - Very Cold
0°F to 32°F - Cold
32°F to 50°F - Cool
50°F to 70°F - Mild
70°F to 85°F - Warm
85°F to 100°F - Hot
Above 100°F - Very Hot
Along with that another function is written to categorize delays into various groups:
0 or < 0 - No Delay
0 - 5 hrs - No Delay
> 5 hrs - High Delay
Following the discretization of continuous value columns and the necessary feature adjustments, the data stored in the data frame was subsequently converted to a CSV format after eliminating column names. This processed data is now in the required format and ready for utilization in Association Rule Mining (ARM).
Snapshot of Final Data