Status : FINISHED
Exploratory Data Analysis | Python | Unsupervised Machine Learning | Hierarchical Clustering | EDA | Visualizations | ML Models | Customer Segmentation | Statistical Inferences | Business Consultancy
This section talks about basic overview of the project
Hierarchical Cluster Analysis Capstone Project is an integral part of 4 years Data Science with Python program from Coincent.ai
This is an independent Exploratory Data Analysis(EDA) project.
Deadline: 28th Feb, 2023
In machine learning, there are numerous algorithms that can be used to model data depending on various use cases, most of which fall under 3 categories: Supervised Learning, Unsupervised Learning and Reinforcement Learning.
Clustering comes under Unsupervised Learning.
What is Clustering?
As of general English meaning, Clustering is a group of similar things growing or held together, or a group of people or things that are close together.
And that’s really all that is in Machine Learning too. It is grouping of unlabeled data. Clustering is a machine learning ‘technique’ that groups the data we provide such that, datapoints within a cluster are more similar to each other (or we can say homogeneous) than the datapoints from other clusters. That is, datapoints from different clusters should be as dissimilar as possible.
We do clustering to understand the underlying structure in the data that may not be detectable at first sight.
Types of Clustering Technique
Clustering are basically of two types: Flat Clustering and Hierarchical Clustering
Flat Clustering: In flat clustering, the data is partitioned into a fixed number of non-overlapping clusters. Each data point belongs to exactly one cluster. Flat clustering algorithms typically require the user to specify the number of clusters in advance.
Examples of flat clustering algorithms include K-Means, K-Medoids, and Fuzzy C-Means.
Hierarchical Clustering: This type of clustering creates a hierarchical decomposition of the dataset. The two main types of hierarchical clustering are agglomerative (bottom-up) and divisive (top-down). Agglomerative clustering starts with each data point as a separate cluster and iteratively merges the most similar clusters until a single cluster is formed. Divisive clustering starts with the entire dataset as a single cluster and recursively divides it into smaller clusters until each data point is in its own cluster.
Our focus with this project is on Hierarchical Clustering so let's understand it a bit more deeply.
Hierarchical Clustering
As the name suggests, in Hierarchical Cluster there is some sort of hierarchy maintained in the decomposition of the dataset into clusters. This hierarchical decomposition is what gives Hierarchical Clustering an edge over other clustering techniques. Because in hierarchical clustering we not just get clusters of the data, we also get extra information as in what order the algorithm formed those clusters. This extra information can be helpful in determining some relation between groups within data. That is, we get to know which group is a closer cluster to which group and what is the next closer cluster. Hierarchical Clustering shows all the possible linkages between different clusters.
Types of Hierarchical Clustering
The two main types of hierarchical clustering are agglomerative (bottom-up) and divisive (top-down).
Divisive Clustering
Divisive clustering starts with the entire dataset as a single cluster and recursively divides it into smaller clusters until each data point is in its own cluster.
Divisive clustering is a "greedy" algorithm, as it makes a locally optimal choice at each step by dividing the cluster that has the largest dissimilarity among its data points. This can lead to a suboptimal clustering, as it may not consider the global structure of the data.
Despite its potential limitations, divisive clustering is a useful technique for exploring the structure of high-dimensional datasets and identifying clusters that can help in tasks such as pattern recognition and anomaly detection.
Agglomerative Clustering
This bottom up algorithm treat each datapoint as a single cluster at the initial step and then successively merge (or agglomerate) pairs of clusters until all clusters have been merged into a single cluster that contains all datapoints.
Bottom up hierarchical clustering is therefore called Hierarchical Agglomerative Clustering or HAC.
HAC is more frequently used in information retrieval than top-down clustering and is the main subject of this project.
The question now here is, How do we visualize this hierarchical decomposition of data into clusters? ---> This is where...
Dendrograms
comes into picture. Dendrogram is a tree-like diagram which records the sequences of merges or splits happen as the algorithm runs. By moving up from the bottom layer to the top node, a dendrogram allows us to reconstruct the history of merges that resulted in the depicted clustering.
In a dendrogram, the vertical axis represents the distance or dissimilarity between the objects being clustered, and the horizontal axis shows the objects themselves or the groups of objects at each level of the hierarchy. As you move down the dendrogram, the branches represent the groups of objects that are more and more similar to each other. The height of each branch in the dendrogram represents the distance or dissimilarity between the groups of objects being joined together.
...
Cluster 1 - Navy Blue (Misers)
*Misers meaning: People who hoard money and spends as little as possible*
Inference: These are customers with high annual income but low spending score. These could be the customers that are not much satisfied with the mall products or services.
Target potential: Quite High as these customers have potential to spend more money.
Cluster 2 - Orange (Normal Customers)
Inference: These are the customers with average Annual Income and Spending Score which seems most in frequency.
Target Potential: Not Much as there will always be a high number of average customers but different data analysis techniques can be used to increase the spending scores of average customers
Cluster 3 - Green (Lavish)
Inference: These are the customers whose Annual Income is High so is their Spending Score. These people might be the regular customers of the mall and are convinced by the mall’s facilities
Target Potential: Very High. People with high income and high spending scores creates an ideal case for the mall or shops as these people are the prime sources of profit.
Cluster 4 - Red (Spendthrift)
Inference: These are the customers with relatively low annual income but have high spending score. They probably are extremely satisfied with the mall facilities or just love to shop.
Target Potential: High. These customers can also be treated as potential target but they can be unpredictive. So mall or shop owner might not proactively target these people but still will try not to lose them.
Cluster 5 - Purple (Balanced Customers)
Inference: These are the customers who have low Annual income and low spending which makes sense.
Target Potential: Very Low. Mall or shop owners will be less interested in this segment of customers.
After working with the Mall customers data containing information regarding their Customer_ID, Age, Gender, Annual Income, and their Spending Score. And analyzing the features Annual Income and Spending Score of customers using an unsupervised machine learning algorithm i.e., Hierarchical Clustering, It can be safely recommended that Mall owners should focus heavily on customers with high spending score. As they are the direct influencers of business profitability.
An indirect influence to profitability of business could be the segment of customers whose Spending Score is not high enough but they earn a lot annually. These could be potential target as they are more probable to spend more if the right facilities and products are provided to them. Without analysis this specific customer segment could have been overlooked, causing potential profit opportunity lost.
Hence, business owner should concentrate their marketing heavily upon these targeted segments of the customers. Marketing could involve from a wide range of options such as improving facilities that attracts these customers, collecting more information regarding their product preferences and update the inventory pertaining to the same, providing incentives to spend more to these customers, etc.
This is how powerful data analysis for businesses can be. And this is really just tip of the iceberg. There is so much more that can be done with business or any data which can make us take better decisions for more profitable future.
Where and why NOT to use Hierarchical Clustering?
Although there are several advantages of using Hierarchical Clustering:
→ Representing the data in a more natural and informative way, showing not only the final clusters but also the intermediate levels of similarity between data points.
→ No requirement to specify the number of clusters in advance, which can be an advantage in situations where the number of clusters is unknown or difficult to determine.
However, Hierarchical Clustering can be more computationally expensive than flat clustering, particularly for large datasets. Here are some reasons why hierarchical clustering may not be the best choice for certain applications:
Computational complexity: Hierarchical clustering can be computationally expensive, particularly for large datasets. As the number of data points grows, the time and memory required to construct the dendrogram can become prohibitive. Which also leads to scalability issues.
Lack of flexibility: Hierarchical clustering produces a fixed hierarchy, which may not always be the best representation of the data. Moreover, it is difficult to update the hierarchy once it has been constructed, so hierarchical clustering may not be well-suited to streaming or online data analysis.
Incomprehensive for large datasets: It can also be difficult to interpret the results of hierarchical clustering, particularly when the dendrogram is complex and has many levels.
Sensitivity to noise: Hierarchical clustering is sensitive to noise and outliers, which can distort the dendrogram and produce misleading results.
In summary, while hierarchical clustering can be a powerful tool for exploratory data analysis and data mining, it may not always be the best choice for every application, and alternative clustering algorithms may need to be considered.
This is what I believe is the best part of any project that anyone completes. Projects are the best way to learn anything and enhance ones skills and knowledge in the domain the project is being worked on. From this project I learned quite a lot of things in such small span of time. Few of the skills that I learned and upgraded are mentioned below:
A thorough understanding of clustering, one of the most common unsupervised machine learning technique used in practice.
Sharpened my rusty python programming skills.
Data Exploration skills
Content writing
Improved my documenting skills which is going to be helpful for my next big thing
Soft skills of Markdown
Report Making
Git and Github