Clustering
Clustering
Clustering is a type of unsupervised machine learning in which similar data points are categorized into groups based on shared features.
This first image illustrates some of the measures of distance, such as Euclidean distance (shortest straight-line distance) and Manhattan distance (sum of the absolute differences in the axes direction), which are essential in the case of clustering algorithms. The metrics compute the distance or closeness between two points within a feature input space.
For clustering, the measure of distance is crucial in distinguishing data by the degree to which similar movies are clustered according to their attributes like budget, ratings, and popularity.
The image depicts how raw, unstructured data is transformed into useful clusters. We are employing clustering by removing the label "Movie_Category" to enable learning patterns without prior-class assignment in our project. By applying methods like K-Means, Hierarchical Clustering, and DBSCAN, we can discover hidden patterns in the data, categorizing movies based on their financial and crowd-sourced statistics. This is helpful in segmenting films into successful, average, or failed clusters, generating smart segmentation for further research. Clustering is an improvement of film success prediction based on identifying common features between successful films without categorization.
Before clustering, the dataset contains the three principal components (PC1, PC2, PC3) that were the outcome of having performed Principal Component Analysis (PCA) for reducing the dimensionality. This projection holds the largest variance of the data and keeps valuable information about the financial and audience-based success of the movies. But at this point, there is no label for how movies are clustered, and the data is merely a lower-dimensional version of the original data. Without cluster labels, movies remain unstructured in terms of similarity, and it is hard to analyze trends in their performance.
K-Means Clustering
The dataset was first transformed with the help of PCA to obtain dimensionality reduction along with retaining meaningful variance. The numerical variables were further standardized with the help of StandardScaler such that all variables received equal weightage while clustering.
With K-Means clustering applied, there is an extra "Cluster" column that specifies the group that each movie belongs to. This label is given based on the similarities of movies in the reduced feature space. The cluster labels categorize movies into various categories, e.g., hit, average, or flop at the box office, based on the financial and audience parameters. By grouping films into clusters, patterns and trends in the performance of films are more readily apparent, enabling more meaningful analysis of the determinants of success or failure in the film industry.
Hierarchical Clustering
The dataset transformed using PCA was used to perform hierarchical clustering via Ward's linkage, which tries to reduce cluster variance. A dendrogram was drawn in order to determine the appropriate number of clusters before labelling the dataset.
Following hierarchical clustering, the dataframe was substituted by a new "Cluster" column with comparable films sorted based on their principal components. Hierarchical clustering constructs a tree structure with inter-relations of clusters more easily interpretable. On the transformed dataframe, every film has a cluster label assigned to it (i.e., 0, 1, 2, or 3) representing in which group of clusters it is included. Differently from K-Means, hierarchical clustering does not involve the assignment of the number of clusters upfront and enables us to visualize various structures of the clusters with dendrograms. Hierarchical clustering ensures that films with similar financial and audience-based metrics are together, enabling us to make sense of patterns in the dataset.
DBSCAN
The PCA-reduced dataset was then standardized to retain the same scale before applying DBSCAN, which detects density-based clusters. DBSCAN does not require pre-defining the cluster number like in K-Means but rather employs eps (ε) and min_samples in order to determine dense regions as well as outliers.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) yielded a cluster different from that obtained through hierarchical clustering, as evident from the transformed dataframe. DBSCAN doesn't coerce every point into a cluster like hierarchical clustering. Instead, it identifies points that reside in dense clusters and labels them while labeling points that are noisy as "-1," meaning they are outliers. This method is particularly useful in identifying anomalies in movie performance, i.e., films that do not fit into the classical categories based on box office or public acceptance. Clusters generated using DBSCAN are more intuitive and can be of any shape and density, and hence it is a valuable technique to identify hidden patterns in the data.
The outputs of the K-Means clustering are graphed for three K values (2, 3, and 4), with each graph displaying data points after applying PCA for dimensionality reduction. The clusters are colored differently, and the centroids, which are the center points for each cluster, are designated with red X. In the K=2 graph, the data is divided into two broad clusters, likely dividing high-grossing films from poorer-grossing ones, with a high silhouette score of 0.758, suggesting well-separated clusters. In the K=3 graph, the third cluster divides mid-grossing films from very successful and poorly performing ones, with a silhouette score of 0.757, suggesting equal effectiveness to K=2. But when moving up to K=4, the more subtle divisions appear, perhaps dividing films by genre or budget levels, but the silhouette score decreases to 0.558, reflecting some overlap between clusters. The K=2 and K=3 models are the most discrete groupings and thus are the best models for this data.
Silhouette score, a measure of the quality of the clusters, was computed for different K values. The bigger the silhouette values, the more distinguished the clusters are. The maximum silhouette values were for K=2 (0.7583) and K=3 (0.7572), which indicate that these values contain most distinguished clusters with minimum overlap. As K increases, the silhouette scores decline, indicating the declining distinctness in cluster separation.
The dendrogram plot provides a hierarchical representation of the process of clustering, demonstrating how successively movies are merged into clusters as they become more alike. The X-axis is for single films, and the Y-axis is for distance (or dissimilarity) between clusters. More similar films at lower levels first coalesce into small, very similar groups—such as action-adventure movies or family animated films. As the hierarchy ascends, mid-level branches merge similar groups into one another, such as thrillers into crime films or fantasy into sci-fi. At the highest level, the biggest dissimilar clusters are merged, which might represent differences between high-budget blockbusters and specialty independent films. Hierarchical Clustering does not have to specify beforehand the number of clusters, in contrast to K-Means, and therefore is valuable in discovering relations among the data. But, as its computational complexity, it is not well suited for very large datasets. The dendrogram aids in determining a good value of K before applying K-Means in order to have a good clustering of the movies.
The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) plot demonstrates its capability to detect clusters of any densities and noise points. Unlike K-Means and Hierarchical Clustering, DBSCAN does not require the number of clusters in advance; instead, it is based on epsilon (eps), representing the maximum distance two points can be to be neighbors, and min_samples, representing the minimum number of points to form a cluster. DBSCAN on this data identified a dense cluster in the middle region with lots of similar movies grouped together based on their features and marked as some points of noise (-1 cluster label), which are outliers with distinguishing finance or audience-based features. This means that there are some movies with very disparate box office outcomes or genre breakdowns that are not part of well-separated clusters. DBSCAN is great at detecting non-spherical clusters and handling noise, which is one handy use case in detecting outlier movies that did exceptionally well or badly, compared to K-Means and Hierarchical Clustering, where there are stronger assumptions of cluster shape.
K-Means effectively clustered the data into well-delineated, easily separated groups with pre-specified counts of clusters, but the outcome was sensitive to selecting the best K. Hierarchical clustering, as revealed through the dendrogram, provides a non-parametric clustering technique where one does not need to specify the number of clusters beforehand, thereby being helpful for exploratory data analysis. DBSCAN does detect dense regions of data but doesn't detect clusters of varying densities, as can be seen from the significant number of noise points (-1 label). DBSCAN is appropriate for abnormal or outlier cluster detection, whereas K-Means and hierarchical clustering are optimal for well-defined structures.
Understanding how movies cluster based on different attributes is used to reveal industry trends, such as similarity between successful movies. Through budgeting, reception by the public, or trends in genres, clustering methods reveal what makes a movie successful at the box office and among critics. Clustering methods can be used in market segmentation, recommendation systems, and trend prediction in entertainment analytics.