7. Unsupervised Learning (Association Analysis)

Association rule has a format of X (antecedent) -> Y (consequent) (rule support, confidence)
Support refers to % of transactions that contains X= P(X)
Rule support refers to the % of time X & Y appear together
Confidence refers to the likelihood that Y appears when X occurs

apriori(df, min_support=0.5, use_colnames=False, max_len=None, verbose=0, low_memory=False)
df refers to One-Hot-Encoded DataFrame or DataFrame that has 0 and 1 or True and False as values
min_support refers to the floating point value between 0 and 1 that indicates the minimum support required for an itemset to be selected (# of observation with item / total observation)
use_colnames allows to preserve column names # max_len refers to max length of itemset to be generated
verbose shows the number of iterations if >= 1 and low_memory is True. If =1 and low_memory is False , shows the number of combinations
low_memory=True should only be used for large dataset if memory resources are limited

association_rules(df, metric=’confidence’, min_threshold=0.8, support_only=False)

Example:

import pandas as pd

import numpy as np

from mlxtend.frequent_patterns import apriori, association_rules

import matplotlib.pyplot as plt

df = pd.read_csv("retail_dataset.csv")

#print(df.head())

items = (df["0"].unique()) #unique items

#print(items)

# Data Preprocessing (Convert the dataset to either 0 and 1 or True and False)

encoded_vals = []

for index, row in df.iterrows():

labels= {}

uncommon = list(set(items) - set(row))

common = list(set(items).intersection(row))

for uc in uncommon:

labels[uc] = 0

for com in common:

labels[com] = 1

encoded_vals.append(labels)

#print(encoded_vals[100])

# A one hot encoding is a representation of categorical variables as binary vectors

ohe_df = pd.DataFrame(encoded_vals)

#print(ohe_df)

# Applying Apriori

freq_items = apriori(ohe_df, min_support=0.2, use_colnames=True, verbose = 1)

print(freq_items)

# Mining Association Rules

rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)

print(rules)

# Visualizing results

plt.scatter(rules['support'], rules['confidence'], alpha=0.5) # Support vs Confidence

plt.xlabel('Support')

plt.ylabel('Confidence')

plt.title('Support vs Confidence')

plt.show()

#rules.to_csv("association_analysis.csv")

Google Sites

Report abuse