Predicting Customer Churn with Logistic Regression, Random Forest and XGBoost
Predicting Customer Churn with Logistic Regression, Random Forest and XGBoost
Motivation: Having worked on the churn problem in a fintech company, I gained insights that led to an actual improvement in retention rates. However, despite our efforts to prevent churn through many incentives schemes, I’ve found that predicting churn remains a complex challenge. External factors beyond our control often contribute to users' decisions to stop using our product. This has motivated me to deepen my understanding of predictive models and advanced analytics, to enhance my current approaches in traditional data analysis techniques.
In this project, I’ll go through the process of building customer churn prediction models and share insights from feature importance analysis.
You can find the full code on GitHub here.
About the Dataset
For this project, I used the Online Retail Customer Churn Dataset, designed to reflect the online retail behavior. Some key features include:
Customer Demographics: Age, gender, and annual income.
Behavioral Metrics: Total spending, number of purchases, and returns.
Engagement: Support contacts and satisfaction score.
Promotion Response: The customer's response to the last promotional campaign (Responded, Ignored, Unsubscribed).
The target variable (Target_Churn) indicates whether a customer has stopped purchasing from the store over a given period.
Initial Exploration and Feature Engineering
Before diving into model building, the first step was to explore the data: visualizing the distribution of churn and identifying correlations between different features. For instance, I observed that customers who made fewer purchases were more likely to churn.
The key feature engineering steps I took included:
Encoding categorical variables: Gender and promotion response were converted into numerical form using one-hot encoding.
Scaling continuous variables: Features like annual income, total spend were standardized to ensure equal treatment by the models.
Model Selection
I used three machine learning models for this task:
Logistic Regression: It is often the go-to model for binary classification problems like churn prediction. Because of its simple, logistic regression can provide a solid baseline for model comparison.
Random Forest: It builds multiple decision trees and combines their predictions. It generally outperforms logistic regression due to its ability to capture complex interactions between features.
XGBoost: It sequentially builds trees to correct errors from previous iterations.
Hyperparameter Tuning
I used randomized search to tune hyperparameters for both Random Forest and XGBoost. For Random Forest, they were the number of trees, the depth of trees, and the minimum samples required to split nodes. For XGBoost, they were learning rate, max depth, and subsampling ratio.
Evaluating Model Performance
Evaluation metrics like accuracy, precision, recall and F1-score were used to assess the models.
Here’s a quick summary of the results:
Logistic Regression had the lowest accuracy at 47%, precision of 41% for non-churn, and 52% for churn, indicating struggles in correctly identifying both classes.
Random Forest achieved an accuracy of 51% with better performance for churn detection, reflected in a recall of 62% for churn, and a precision of 54% for churn, making it more effective in identifying churned customers.
XGBoost yielded a lower accuracy of 48%. Its recall for churn was 57%, and precision was 53%, showing slight improvement over Logistic Regression but lagging behind Random Forest.
XGBoost’s lower performance may stem from insufficient tuning, as it requires more fine-tuning of hyperparameters than Random Forest to achieve optimal results. Additionally, XGBoost's boosting nature might have overemphasized hard-to-classify instances, leading to overfitting in this scenario.
Quick recap of the key evaluation metrics:
1. Confusion Matrix: a table comparing the predicted labels to the actual labels. It has four key components:
True Positive (TP): Correctly predicted positive cases (correctly predicted churned customers).
True Negative (TN): Correctly predicted negative cases (correctly predicted non-churned customers).
False Positive (FP): Incorrectly predicted positive cases (non-churned customers predicted as churned).
False Negative (FN): Incorrectly predicted negative cases (churned customers predicted as non-churned).
2. Classification Report:
Precision: TP / (TP + FP) – How many predicted positive cases were actually positive?
Recall: TP / (TP + FN) – How many actual positive cases were correctly identified?
F1-Score: 2 * (Precision * Recall) / (Precision + Recall) – The harmonic mean of precision and recall, balancing the two metrics.
3. Accuracy Score: the ratio of correctly predicted instances to the total number of instances:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Insights from Feature Importance Analysis
By examining which features contributed most to the models’ decisions, we could uncover valuable insights for business stakeholders.
In the Random Forest model, the top 5 features accounted for 58.42% of the total importance. These features provide valuable insight into customer behavior:
Average Transaction Amount and Total Spend were the top drivers, indicating that higher spending customers are less likely to churn.
Annual Income and Age also played significant roles, suggesting that customer demographics are linked to churn risk.
Customer Loyalty reflects the importance of a long-term relationship in reducing churn.
In contrast, the XGBoost model’s top 5 features explained only 29.96% of the total importance, showing that it relied on a more distributed set of factors:
Recency was the top predictor, with recent customers being less likely to churn.
Gender emerged as a key demographic factor.
Average Transaction Amount was important here as well but less dominant than in Random Forest.
Number of Support Contacts indicates that frequent interactions with support might signal dissatisfaction, leading to churn.
The differences in feature importance suggest that Random Forest focused more on spending behavior, while XGBoost relied more on customer recency and demographic factors.