Customer Segmentation - Clustering
Customer Segmentation - Clustering
This project applied K-Means clustering to segment customers using real-world marketing data (2,240 records). After data cleaning, outlier detection, and feature engineering (e.g., total spend, tenure, family size), the dataset was standardized and key features were explored. The optimal number of clusters (k=5) was determined using the Elbow Method. Cluster profiling revealed distinct customer groups based on age, income, and spending behavior. Insights supported targeted marketing strategies, enabling better resource allocation and customer engagement.
Dataset Source: Kaggle - https://www.kaggle.com/code/karnikakapoor/customer-segmentation-clustering/input
Project Workflow:
Data Wrangling
Exploratory Data Analysis (EDA)
Feature Engineering
Clustering with K-Means
Segment Profiling
Outlier Detection and Data Cleaning
Visualized boxplots for detecting extreme values.
Significant outliers in features such as Income and Age could distort clustering if left untreated.
Total Spend is heavily impacted by a few high-spending customers, possibly VIP clients or bulk buyers.
Heatmap of Numerical Features
Income and Total spending show a strong positive correlation, confirming that higher-income customers tend to spend more.
Product categories like Wines, Meats, and Gold also show positive correlations with Total spending, indicating preferred items among high-spenders.
Age and Recency have little to no correlation with spending, implying that younger or older age alone is not a strong predictor of customer value.
Feature Relationship Analysis
A positive correlation between income and total spending indicates that income is a strong predictor of purchasing behavior.
Weak or no correlation between Age and Recency, suggesting that recent purchases are not strongly age-dependent.
Parental status does not distinctly separate customer behaviors, which hints that other factors (e.g., income, lifestyle) are more influential.
Elbow Method
The distortion score (within-cluster sum of squares) was plotted against the number of clusters (k) to determine the optimal k-value.
k = 5 is the optimal choice as the curve starts to flatten, indicating diminishing returns beyond this point.
Adding more clusters after k = 5 provides minimal improvement and could lead to overfitting.
Age vs. Total Spend (Density Plot) & Income vs. Total Spend (Scatter + Violin Plots)
Cluster 3: High-income, high-spending individuals spread across a wide age range. Likely represents loyal or premium customers who are ideal for exclusive offers and retention campaigns.
Clusters 0 & 4: Younger, low-income, and low-spending customers. May require budget-friendly promotions, onboarding campaigns, or brand awareness strategies.
Cluster 1: Mid-income, moderate spenders with stable behavior. Good candidates for upselling or loyalty-building initiatives.
Cluster 2: Mixed age group with high variability in spending despite similar income levels which suggested diverse preferences and untapped potential.
Brazilian E-Commerce
Analyzed over 100,000 orders from a real-world e-commerce dataset to uncover trends in customer behavior, payment preferences, product reviews, and regional demand.
Key Highlights:
Efficient Delivery: 97% of orders delivered; 89% delivered early.
Payment Trends: 74.5% of customers used credit cards; 19.4% used boleto.
Product Reviews: High engagement in bed & bath and health & beauty with both top (5) and low (1) ratings, indicating improvement opportunities.
Top Categories: Bed & bath and sports & leisure emerged as best-sellers.
Regional Insights: Interactive map reveals key purchase locations across Brazil, guiding location-specific strategies.
Source:https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce/data
Gmail: nguyensyhien011201@gmail.com
Mobile: 07851771075