Clustering for Music Recommendation System with YouTube API.
Overview: Clustering is a data analysis technique that groups similar data points together, unveiling hidden patterns within datasets. There are two main clustering approaches: partitional and hierarchical. Partitional methods, such as K-Means, divide data into distinct clusters, while hierarchical methods create a tree-like structure of nested clusters. Distance metrics like Euclidean or Cosine similarity quantify the dissimilarity between data points, determining how they are assigned to clusters. Clustering has various applications, including customer segmentation, content recommendation, and anomaly detection. It aids in data exploration by visually representing clusters through plots and dendrograms, providing valuable insights into data structures and supporting decision-making.
Data Prep.: In the provided R code, you begin by preparing your dataset for clustering analysis. You load essential packages for clustering, including 'arules', 'cluster', 'dplyr', and 'clValid'. Then, you read your dataset, stored in a CSV file, and perform several preprocessing steps.
First, you check the data types and select only the numeric columns using the 'dplyr' package. This step ensures that clustering algorithms operate on the appropriate data type.
Next, you standardize the numeric data, a crucial preprocessing step in clustering, ensuring all variables share the same scale to prevent any single variable from dominating the clustering process.
You define a range of 'k' values for K-Means and Hierarchical clustering, initializing vectors to store results. The 'for' loop iterates through each 'k' value, performing K-Means clustering and calculating silhouette scores and Within-Cluster Sum of Squares (WCSS) for each 'k'.
The silhouette method and elbow method help determine the best 'k' value. You use the silhouette method's highest score and the elbow method's "elbow point" as the optimal 'k'.
Finally, you perform K-Means clustering and hierarchical clustering with the optimal 'k', visualize the results, and print the cluster assignments.
The code's data preparation ensures that clustering algorithms can analyze the numeric attributes effectively, enabling you to uncover meaningful patterns and insights within your dataset.
Dataset used is "youtube_data_final_1.csv"
Dataset:
K-Means clustering using Python
Code:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Load CSV file into a DataFrame
data = pd.read_csv('youtube_data_final_1.csv')
# Select the columns to include as features
selected_features = data[['Views', 'Likes']]
# Create the feature matrix
feature_matrix = selected_features.values
# Assuming you have a feature matrix
kmeans = KMeans(n_clusters=5)
clusters = kmeans.fit_predict(feature_matrix)
# Add cluster labels to the DataFrame
data['Cluster'] = clusters
# Get cluster centers
cluster_centers = kmeans.cluster_centers_
# Create a scatter plot of the data points with different colors for each cluster
plt.figure(figsize=(8, 6))
scatter = plt.scatter(data['Views'], data['Likes'], c=data['Cluster'], cmap='rainbow', label='Youtube Video')
# Plot cluster centers as red stars
plt.scatter(cluster_centers[:, 0], cluster_centers[:, 1], s=100, c='black', marker='*', label='Cluster Centers')
# Extract unique cluster labels
unique_clusters = list(set(clusters))
# Create a legend
legend_labels = [f'Cluster {cluster}' for cluster in unique_clusters]
plt.legend(handles=scatter.legend_elements()[0], title='Clusters', labels=legend_labels)
plt.title('K-Means Clustering Results with Cluster Centers')
plt.xlabel('Views')
plt.ylabel('Likes')
plt.legend()
plt.show()
Explaination:
This Python code employs K-Means clustering to analyze a YouTube video dataset's viewer engagement patterns. The dataset, loaded as a Pandas DataFrame, consists of 'Views' and 'Likes' as the selected features for clustering.
By setting up K-Means with five clusters, the code uses the 'fit_predict' method to assign each video to one of these clusters based on its 'Views' and 'Likes.' Visual representation is achieved through a scatter plot, where each video point is color-coded according to its cluster. Cluster centers are denoted by black stars, signifying their central positions.
The code then extracts unique cluster labels to construct a clear legend, displaying labels like 'Cluster 0,' 'Cluster 1,' and so forth. This aids in easily identifying and distinguishing each cluster.
This visualization offers insights into viewer engagement trends among YouTube videos. Videos within the same cluster likely share similar characteristics, facilitating audience targeting and content strategy decisions. Videos in distinct clusters exhibit varying engagement patterns, which can inform content creators and analysts about the diverse preferences of their audience segments. Ultimately, this code provides a practical means of understanding and leveraging viewer engagement metrics for data-driven decision-making in the realm of YouTube content.
Clustering plot :
Explaination:
The provided code utilizes K-Means clustering to analyze a dataset of YouTube videos using two key features: "Views" and "Likes." It visually represents the clustering results through a scatter plot.
In the scatter plot, each data point corresponds to a YouTube video, with the x-axis representing the video's "Views" and the y-axis indicating the "Likes." The points are color-coded according to their cluster assignment, making it easy to discern which cluster each video belongs to. Additionally, red stars mark the cluster centers, serving as centroids for each cluster.
The interpretation of this output is as follows: The scatter plot reveals how K-Means has grouped videos into clusters based on their "Views" and "Likes." Videos in the same cluster share similarities in these features, while the cluster centers represent typical values within each group. This visualization provides valuable insights into audience engagement patterns and helps content creators tailor their strategies to different viewer segments.
In summary, this code offers a clear, visual understanding of how K-Means clustering has organized YouTube videos based on their "Views" and "Likes," aiding data-driven decision-making and content optimization.
Alternate Code using R:
# Load necessary libraries for clustering, silhouette analysis, and visualization
install.packages("arules")
library(arules)
library(cluster)
library(dplyr)
install.packages("clValid")
library(clValid)
install.packages("proxy")
library(proxy)
# Read the CSV file
data_matrix <- read.csv("/Users/vishadhvilassawnt/Downloads/youtube_data_final_1.csv")
#data <- read.csv("https://drive.google.com/file/d/1Nzy4066pW8BNdFXIwFCKtn83I1w3NMGb/view?usp=sharing")
# Check data types and remove non-numeric columns
numeric_data <- data_matrix %>%
select_if(is.numeric)
# Standardize the numeric data
scaled_data <- scale(numeric_data)
# Choose a range of k values for K-Means and Hierarchical clustering
k_values <- 2:10
# Initialize variables to store results
silhouette_scores <- vector()
wcss <- vector()
# Perform k-means clustering and silhouette analysis
for (k in k_values) {
# K-Means clustering
kmeans_result <- kmeans(scaled_data, centers = k, nstart = 25)
# Silhouette analysis
silhouette_scores[k] <- silhouette(kmeans_result$cluster, dist(scaled_data))
# Elbow method (calculate WCSS)
wcss[k] <- sum(kmeans_result$tot.withinss)
}
# Determine the "best k" from Silhouette method
best_k_silhouette <- which.max(silhouette_scores)
cat("Best k (Silhouette method):", best_k_silhouette, "\n")
# Determine the "best k" from Elbow method
elbow_point <- 3 # Adjust this value based on visual inspection of the plot
cat("Optimal k (Elbow method):", elbow_point, "\n")
# Perform K-Means clustering with the best k from Silhouette method
best_kmeans_result <- kmeans(scaled_data, centers = best_k_silhouette, nstart = 25)
# Convert cluster assignments to integers
cut_clusters_kmeans <- as.integer(best_kmeans_result$cluster)
# Perform hierarchical clustering with Cosine Similarity
hc_result <- hclust(dist(scaled_data, method = "cosine"), method = "ward.D2")
cut_clusters_hc <- cutree(hc_result, k = elbow_point)
# Perform hierarchical clustering with the optimal k from Elbow method
hc_result <- hclust(dist(scaled_data), method = "ward.D2")
cut_clusters_hc <- cutree(hc_result, k = elbow_point)
# Visualize the K-Means clustering results
plot(scaled_data, col = cut_clusters_kmeans, pch = 19, main = "K-Means Clustering Results")
# Visualize the Hierarchical clustering results
plot(hc_result, hang = -1, main = "Hierarchical Clustering")
plot(hc_result, main = "Hierarchical Clustering Dendrogram")
# Visualize the Hierarchical clustering results with horizontal stretch
# Visualize the Hierarchical clustering results with adjusted x-axis labels
par(mar = c(5, 5, 2, 2)) # Adjust the margin as needed
plot(hc_result, hang = -1, main = "Hierarchical Clustering", cex = 0.8, las = 2, cex.axis = 0.7)
# Print cluster assignments for K-Means and Hierarchical clustering
cat("K-Means Cluster Assignments:\n")
print(cut_clusters_kmeans)
cat("Hierarchical Cluster Assignments:\n")
print(cut_clusters_hc)
# Show the first few rows of your data (sample)
head(data_matrix)
Explaination:
This R code performs clustering and silhouette analysis on a dataset, specifically using K-Means and Hierarchical clustering methods. Here's an explanation of the code:
1. **Load Necessary Libraries:** The code starts by installing and loading the required R packages, including "arules" for data manipulation, "cluster" for clustering analysis, "dplyr" for data manipulation, "clValid" for cluster validation, and "proxy" for calculating distances.
2. **Read the CSV Data:** The code reads a CSV file named "youtube_data_final_1.csv" from the local file system. Alternatively, you can use the commented line to read the data from a Google Drive link.
3. **Data Preprocessing:** It checks the data types of columns and selects only the numeric columns for clustering. Then, it standardizes the numeric data using the `scale` function, which scales the data to have a mean of 0 and standard deviation of 1.
4. **Choosing K Values:** The code initializes a range of k values (number of clusters) for K-Means and Hierarchical clustering. In this case, it considers k values from 2 to 10.
5. **Initialize Variables:** It initializes empty vectors to store silhouette scores and within-cluster sum of squares (WCSS) for each k value.
6. **K-Means Clustering and Silhouette Analysis:** It performs K-Means clustering for each k value and calculates the silhouette score for each clustering result. The silhouette score measures how similar an object is to its own cluster compared to other clusters. The code stores the silhouette scores in the `silhouette_scores` vector.
7. **Elbow Method (WCSS):** It calculates the WCSS for each k value. The WCSS represents the sum of squared distances between data points and their assigned cluster centroids. The code stores the WCSS values in the `wcss` vector.
8. **Determine the Best K:** The code identifies the best k value using both the silhouette method (which maximizes silhouette scores) and the elbow method (which looks for an "elbow point" in the WCSS plot).
9. **Perform Clustering with the Best K:** It performs K-Means clustering with the best k value obtained from the silhouette method and assigns cluster labels to each data point.
10. **Hierarchical Clustering:** It performs hierarchical clustering using both cosine similarity and the optimal k value from the elbow method. The `hclust` function is used for hierarchical clustering.
11. **Visualization:** The code visualizes the results. It creates scatter plots for K-Means clustering results, dendrograms for hierarchical clustering, and a dendrogram with horizontal stretch for better visualization.
12. **Print Cluster Assignments:** Finally, it prints the cluster assignments for both K-Means and Hierarchical clustering.
This code helps you explore and visualize clustering results for your dataset and determine the optimal number of clusters (k) using both silhouette and elbow methods. It also provides a visual representation of the hierarchical clustering process.
Plots:
Hierarchical clustering using R
Code:
# Loading required libraries
library(data.table)
library(cluster)
library(proxy)
# Reading data from the CSV file (replace with your dataset URL)
data <- read.csv("https://drive.google.com/file/d/1Nzy4066pW8BNdFXIwFCKtn83I1w3NMGb/view?usp=sharing")
# Sampling a subset of data (adjust the fraction to your desired subset size)
subset_size <- 0.01 # 1% of the data
set.seed(123) # For reproducibility
sampled_indices <- sample(1:nrow(numeric_cols), nrow(numeric_cols) * subset_size)
sampled_data <- numeric_cols[sampled_indices, ]
# Scaling the data
scaled_data <- scale(sampled_data)
# Calculating cosine similarity
cosine_similarity <- proxy::simil(as.matrix(scaled_data), method = "cosine")
# Converting cosine similarity to distance
cosine_distance <- 1 - cosine_similarity
# Performing hierarchical clustering with different linkage methods
methods <- c("complete", "single", "average")
hclust_results <- lapply(methods, function(method) {
hclust(as.dist(cosine_distance), method = method)
})
# Plotting the dendrograms for each linkage method with clusters highlighted
for (i in 1:length(methods)) {
plot(hclust_results[[i]], main = paste("Hierarchical Clustering (", methods[i], ")"))
# Add rectangles to highlight clusters
rect.hclust(hclust_results[[i]], k = 4)
}
Explanation:
This R code performs hierarchical clustering on a dataset after performing data preprocessing and transformation. Here's an explanation of each part of the code:
1. **Loading Required Libraries**:
- The code starts by loading three libraries: `data.table`, `cluster`, and `proxy`. These libraries provide functions and tools for data manipulation, clustering, and calculating distances.
2. **Reading Data**:
- It reads data from a CSV file. The `read.csv` function is used for this purpose. You should replace the file URL with the URL of your dataset.
3. **Sampling a Subset of Data**:
- It samples a subset of the data for clustering. This is done to reduce the dataset size and speed up the clustering process. The `subset_size` variable determines what fraction of the data is sampled. In this code, 1% of the data is sampled.
4. **Scaling the Data**:
- The sampled data is scaled using the `scale` function. Scaling standardizes the variables to have mean 0 and standard deviation of 1. This step is often necessary when performing clustering because it ensures that all variables contribute equally.
5. **Calculating Cosine Similarity**:
- The cosine similarity between the scaled data points is calculated using the `proxy::simil` function with the "cosine" method. Cosine similarity measures the cosine of the angle between two vectors and is often used for text or high-dimensional data.
6. **Converting Cosine Similarity to Distance**:
- The code converts cosine similarity to distance by subtracting it from 1. This is a common transformation when working with hierarchical clustering because clustering algorithms typically use distances to measure dissimilarity.
7. **Performing Hierarchical Clustering with Different Linkage Methods**:
- The code performs hierarchical clustering using three different linkage methods: "complete," "single," and "average." These methods determine how the distance between clusters is calculated. The results are stored in the `hclust_results` list.
8. **Plotting Dendrograms**:
- The code iterates over the results of hierarchical clustering for each linkage method and plots dendrograms. Dendrograms are tree-like diagrams that show the hierarchical structure of the clusters.
- Rectangles are added to highlight clusters, with `k` specifying the number of clusters to highlight. In this code, `k` is set to 4, which means it will highlight four clusters.
The code is useful for exploring the clustering structure of a dataset using hierarchical clustering with different linkage methods and visualizing the results using dendrograms. It's important to adapt the code to your specific dataset and analysis goals by replacing the data source and adjusting parameters as needed.
Results:
Clustering output:
Explaination: Here we see that in these three dendogram the are some numbers which are in the same branch clustered together. Like in complete and average heirarchical clustering graph, 415 and 118 are clustered. Then 179 and 526 are clustered. These numbers are alos clustered in the single hierarchical dendrogram. There are 4 clusters in each dendrogram. These are same for complete and average hierarchical graph.
Conclusion:
Hierarchical Clustering Methods: The code explores three distinct hierarchical clustering linkage methods: "complete," "single," and "average." Each method has its unique characteristics and can yield different cluster structures.
Cluster Identification: The code identifies and highlights four clusters within each dendrogram. These clusters are determined by the results of hierarchical clustering and serve as a way to recognize distinct data groups.
Cosine Similarity: Cosine similarity is used as a measure of how similar or dissimilar data points are. This similarity metric is particularly useful for high-dimensional or text data, as it considers the angles between data vectors.
Data Scaling: Before calculating cosine similarity, the code scales the data. This preprocessing step is crucial to ensure that variables with varying scales do not disproportionately influence similarity calculations.
Dendrogram Visualization: Dendrograms are powerful visual representations for hierarchical clustering results. They illustrate how data points or objects form clusters based on their similarity or dissimilarity. The dendrogram's structure can reveal natural data groupings.
Impact of Linkage Methods: The choice of linkage method significantly affects cluster formation. "Complete" linkage tends to create compact clusters, "single" linkage can lead to elongated clusters, and "average" linkage strikes a balance between the two. Researchers must carefully select the most suitable linkage method based on their data's characteristics and analysis goals.