Association Rule Mining for Music Recommendation System with YouTube API.
Overview: Association Rule Mining (ARM) is a data analysis method used to uncover valuable relationships within large datasets. ARM operates on transactional data, where each transaction contains items (e.g., products in a shopping cart). It relies on three fundamental metrics:
Support: This metric quantifies the frequency of a rule's occurrence in the dataset, indicating how often associated items appear together.
Confidence: Confidence measures the likelihood of consequent items appearing when antecedent items are present, providing insight into the strength of the association.
Lift: Lift reveals how much more likely consequent items are to occur when antecedent items are present compared to their occurrence in isolation.
The Apriori algorithm, a widely-used ARM technique, identifies patterns by generating and refining candidate itemsets based on predefined support and confidence thresholds.
Data Prep.:The provided R code performs association rule mining (ARM) on a YouTube dataset to discover meaningful relationships between items or behaviors. It starts by reading transactional data from a CSV file and converting it into a suitable format for ARM. The Apriori algorithm is applied, allowing users to customize support and confidence thresholds. The code then displays and stores the top 15 association rules based on support, confidence, and lift, enabling insights into item associations. Additionally, it generates visualizations, such as network and matrix plots, to represent these associations graphically. This code is valuable for uncovering patterns and dependencies within YouTube data, aiding content recommendations and audience analysis. Dataset used is "youtube_data_final_1.csv"
Dataset:
Code.:
# Load the required packages
install.packages("arules")
library(arules)
install.packages("arulesViz")
library(arulesViz)
# Read the YouTube dataset (assuming you have it in a CSV file)
data <- read.csv("/Users/vishadhvilassawnt/Downloads/youtube_data_final_1.csv")
##or
##data <- read.csv("https://drive.google.com/file/d/1Nzy4066pW8BNdFXIwFCKtn83I1w3NMGb/view?usp=sharing")
# Convert the data to a transaction format
transactions <- as(data, "transactions")
# Run the Apriori algorithm
rules <- apriori(
transactions,
parameter = list(support = 0.02, confidence = 0.5), # Adjust support and confidence thresholds
control = list(verbose = TRUE) # To see the output
)
# View the resulting association rules
inspect(rules)
# Display the top 15 rules for support
##top_support_rules <- head(sort(rules, by = "support"), 15)
# Display the top 15 rules for confidence
##top_confidence_rules <- head(sort(rules, by = "confidence"), 15)
# Display the top 15 rules for lift
##top_lift_rules <- head(sort(rules, by = "lift"), 15)
# Getting the top 15 rules for support, confidence, and lift
top_support <- head(sort(rules, by = "support", decreasing = TRUE), 15)
top_confidence <- head(sort(rules, by = "confidence", decreasing = TRUE), 15)
top_lift <- head(sort(rules, by = "lift", decreasing = TRUE), 15)
# View the top rules
top_support
top_confidence
top_lift
print(top_support)
print(top_confidence)
print(top_lift)
# Create a network plot for the association rules
plot(top_support, method = "graph")
# Create a matrix plot for the association rules
plot(top_support, method = "matrix")
# Create a network plot for the association rules
plot(top_confidence, method = "graph")
# Create a matrix plot for the association rules
plot(top_confidence, method = "matrix")
# Create a network plot for the association rules
plot(top_lift, method = "graph")
# Create a matrix plot for the association rules
plot(top_lift, method = "matrix")
Explanation.:
This code performs Association Rule Mining (ARM) on a YouTube dataset using R. It follows these key steps:
Package Loading: To begin, essential R packages, namely "arules" and "arulesViz," are installed and loaded. These packages are pivotal for ARM and result visualization.
Data Retrieval: The code assumes the presence of a YouTube dataset in CSV format. It reads this dataset into the "data" variable.
Data Transformation: The dataset is converted into a transaction format suitable for ARM. Each row in this transformed format represents a transaction, with items within each transaction identified.
Apriori Algorithm: The Apriori algorithm is applied using the "apriori" function. It employs parameters such as minimum support (0.1) and minimum confidence (0.7) thresholds to filter and identify association rules.
Rule Inspection: The discovered association rules are examined using the "inspect" function. This inspection provides an understanding of the patterns and associations within the data.
Top Rule Identification: The code identifies and showcases the top 15 rules in terms of support, confidence, and lift. This is achieved by sorting and selecting the top rules using the "head" function.
Visualization: Two types of visualizations are generated for the association rules: a network plot and a matrix plot. These visual representations aid in comprehending the relationships and associations uncovered in the data.
In essence, this code reads, analyzes, and visually represents association rules within the YouTube dataset, offering valuable insights into underlying patterns and correlations among dataset items or attributes.
Threshold.:
In the provided R code, the following thresholds were used for association rule mining:
1. Support Threshold: 0.02 - This threshold represents the minimum level of support that an itemset or rule must have to be considered frequent. In this code, it means that an itemset or rule must appear in at least 2% of the transactions to be considered.
2. Confidence Threshold: 0.5 - This threshold represents the minimum level of confidence required for an association rule to be considered significant. A confidence of 0.5 means that the rule must have at least a 50% chance of being true given the antecedent.
These thresholds can be adjusted to discover association rules that meet specific criteria, such as higher support or confidence, depending on the analysis requirements and goals.
Results:
Output.:
Network and matrix plot for top_support
Network and matrix plot for top_confidence
Network and matrix plot for top_lift
Explanation.:
Two types of visualizations are generated to help users understand and interpret the discovered associations:
a. Network Plot: The code creates a network plot for the association rules. In this visualization, each node represents an item or itemset, and edges (lines) connect related items. The thickness or color of the edges can indicate the strength of the association between items.
b. Matrix Plot: Another visualization is the matrix plot. In this plot, association rules are represented as a matrix, where rows correspond to antecedent items, columns correspond to consequent items, and cells contain relevant metrics such as support, confidence, and lift. The color or intensity of cells can provide additional information about the rules' strength.
Interpreting the Output:
Here we see in the top_support that it has high lift but low support. This trend is similar to the top_confidence. In top_lift we see that the lift is the same for all and the majority of the support is low but there is one association that has a high support.
Top 15 Rules:
Top 15 rules for support and lift respectively
Top 15 rules for Confidence
Explanation.:
From the provided code and data, it appears that the Apriori algorithm is applied to a YouTube dataset to discover association rules. You set specific thresholds for support and confidence to filter the rules, and then you visualized the top 15 rules based on support, confidence, and lift.
The analysis provides insights into the relationships and patterns within the YouTube dataset. Specifically, the top rules with high support, confidence, and lift indicate strong associations between different items or attributes in the dataset. These rules can be valuable for understanding user behavior, content preferences, or other patterns relevant to YouTube content.
In the above tables the "support" table majority of support is around 0.33 but high values are there in the initial 4 columns. Confidence and lift is around 1 for the "support" table. Coverage is mostly 0.33. For the lift table, support is mostly 0.029. Coverage is 0.0379. For confidence, table support and coverage is varying.
Conclusion:
The code successfully identifies association rules in the YouTube dataset. The top rules with high support, confidence, and lift are extracted and visualized.
The association rules can be valuable for understanding patterns and relationships between different attributes or features in YouTube data.
Further analysis and interpretation of these rules may lead to actionable insights or recommendations for content creators or platform administrators.