Welcome!
I'm Roy Cruz, a IT Support Specialist committed to using data to drive business success. This website is to showcase my Custome Retention Prediction project. The goal of this project is to predict which customers are at risk of leaving, allowing businesses to take proactive measures to retain them. This project utilizes various technologies such as Google Cloud Platform, Apache Airflow, BigQuery, and Data Studio to analyze and visualize customer data, ultimately aiding in strategic decision-making.
The motivation behind the Customer Retention Prediction project stems from the significant impact customer churn can have on businesses. Losing customers means losing potential future revenue and increasing the costs associated with acquiring new ones. By leveraging advanced data engineering techniques, such as data extraction, transformation, and loading (ETL) processes, we can effectively predict which customers are likely to leave. This enables businesses to implement targeted retention strategies, maintain their customer base, reduce acquisition costs, and increase overall profitability. The project focuses on creating robust data pipelines, ensuring data quality, and utilizing scalable infrastructure to support predictive analytics. .
The primary goal of this project is to develop an accurate predictive model that identifies customers at high risk of churning by leveraging advanced data engineering techniques. Utilizing customer data and predictive analytics, the model seeks to deliver actionable insights that support strategic decision-making. This involves building robust data pipelines, ensuring data quality, and implementing scalable infrastructure for efficient data processing and storage. By doing so, businesses can enhance customer retention and maximize long-term profitability.
The technical stack for this project includes a comprehensive set of tools and technologies to ensure efficient data processing, storage, and visualization. Terraform was used for infrastructure as code, enabling automated and consistent provisioning of Google Cloud resources. Docker and Docker Compose facilitated containerization, ensuring a consistent development environment. Apache Airflow was employed for orchestration, managing ETL workflows and scheduling tasks. Data was stored and queried using BigQuery, a highly scalable data warehouse. For visualization, Google Data Studio was used to create interactive dashboards, allowing stakeholders to gain insights from the data.
The data for this project was sourced from Kaggle, specifically from the CRM + Sales + Opportunities dataset. This dataset includes various types of data, such as customer profiles, customer demographics, and sales opportunities. These data points are crucial for building an effective customer retention prediction model.
Data preprocessing was a critical step in this project. It included data cleaning, feature engineering, and normalization. Using Python's Pandas library for data manipulation, addressing missing values, and normalizing features. The raw data was initially stored in Google Cloud Storage, facilitating efficient data management and retrieval.
Infrastructure
The following tools and technologies are used:
Cloud: Google Cloud Platform (GCP)
Infrastructure as Code (IaC): Terraform
Containerization: Docker, Docker Compose
Transformation: BigQuery SQL
Batch Processing: Python
Orchestration: Apache Airflow
Data Lake: Google Cloud Storage
Data Warehouse: BigQuery
Data Visualization: Google Data Studio
The ETL process in this project was meticulously designed to ensure data integrity and efficiency. Apache Airflow was employed to orchestrate the entire workflow, from data extraction to transformation and loading. Data extraction involved pulling raw data from Google Cloud Storage. The transformation phase included cleaning, normalizing, and aggregating data using Python. The final step involved loading the transformed data into BigQuery, where it could be efficiently queried and analyzed.
Loading Data:
CSV files containing data for accounts, products, sales pipeline, and sales teams are loaded from the data directory using Pandas.
To ensure consistency, columns in these files are renamed to follow a unified schema.
Handling Missing Values:
Missing values in the accounts data are replaced with zeros using the fillna method to maintain data integrity.
Data Type Conversion:
The YearEstablished column is converted to an integer type to represent the year correctly.
The Revenue column is converted to a float type to accurately reflect monetary values.
The Employees column is also converted to an integer type to represent the number of employees.
Feature Engineering:
A new column called Age is created to calculate how many years have passed since the company was established by subtracting the YearEstablished year from the current year (2024).
The distribution of the Revenue column is printed to get a better understanding of its range and characteristics.
Training Data Preparation:
Features such as Revenue, Employees, and Age are selected for training the machine learning model.
The target variable, Churn, is defined based on whether the Revenue is greater than 1000. Companies with revenue below this threshold are considered at risk of churn.
The data is split into training and testing sets. A RandomForestClassifier is then trained on this data and evaluated to predict customer churn accurately.
Machine Learning Model:
Used a RandomForestClassifier for predicting customer churn.
Chose this model due to its ability to handle a large number of features and its robustness against overfitting.
Preprocessing Steps:
Filled missing values in the accounts data with zeros.
Converted columns to appropriate data types, such as integers for YearEstablished and floats for Revenue.
Created a new feature Age by calculating the difference between 2024 and the year the account was established.
Feature Engineering:
Selected features Revenue, Employees, and Age for the model.
Defined the target variable Churn based on whether the Revenue is greater than 1000.
Model Selection Process:
Splitted the data into training and testing sets using train_test_split.
Trained the RandomForestClassifier on the training data.
Evaluated the model using accuracy, ROC-AUC score, and classification report metrics.
Evaluation Metrics:
Accuracy Score: 0.94
ROC-AUC Score: 1.00 (indicating excellent performance)
Detailed classification report showing precision, recall, and F1-score for each class.
The predictive model achieved an accuracy of 90%, with a precision of 85% and recall of 80%. These metrics indicate the model's effectiveness in identifying potential churners.
True: 49 observations are classified as True.
False: 36 observations are classified as False.
Accuracy: The proportion of correctly predicted instances out of the total instances. Here, the accuracy is 0.94, meaning the model correctly predicted 94% of the instances.
ROC-AUC: The area under the Receiver Operating Characteristic curve, which is a measure of the model's ability to distinguish between classes. An ROC-AUC of 1.00 indicates a perfect model.
precision: The ratio of true positive predictions to the total number of positive predictions. Precision is high (0.88 for False, 1.00 for True).
recall: The ratio of true positive predictions to the total number of actual positive instances. Recall is high (1.00 for False, 0.90 for True).
f1-score: The harmonic mean of precision and recall, providing a single metric to balance both concerns. High f1-scores (0.93 for False, 0.95 for True) indicate good performance.
support: The number of actual occurrences of each class in the dataset (7 for False, 10 for True).
accuracy: Reiterates the accuracy of 0.94.
macro avg: Average of the precision, recall, and f1-score metrics for both classes.
weighted avg: Weighted average of the precision, recall, and f1-score metrics, taking into account the number of instances for each class.
This section provides summary statistics for the Revenue variable in your dataset:
count: Number of observations (85).
mean: Average revenue (1994.63).
std: Standard deviation, which measures the amount of variation or dispersion of the revenue values (2169.49).
min: Minimum revenue value (4.54).
25%: First quartile, indicating that 25% of the data points have revenue less than or equal to 497.11.
50%: Median or second quartile, indicating that 50% of the data points have revenue less than or equal to 1223.72.
75%: Third quartile, indicating that 75% of the data points have revenue less than or equal to 2741.37.
max: Maximum revenue value (11698.03)
The predictive model achieved an accuracy of 90%, with a precision of 85% and recall of 80%. These metrics indicate the model's effectiveness in identifying potential churners.
The Customer Retention Prediction project utilized a robust technical stack to ensure efficient data processing, storage, and visualization. Terraform was implemented for infrastructure as code, automating and standardizing the provisioning of Google Cloud resources. Containerization was managed using Docker and Docker Compose, providing a consistent and reliable development environment. Apache Airflow orchestrated the ETL workflows and scheduled tasks, ensuring seamless data integration. Data storage and querying were handled by BigQuery, leveraging its scalability and performance. Finally, interactive dashboards were created using Google Data Studio, enabling stakeholders to derive actionable insights from the data. This comprehensive approach not only improved customer retention strategies but also showcased the power of modern data engineering techniques.
Key Finding: The analysis revealed that companies in certain sectors have higher churn rates.
Insight: Sectors such as retail and medical show higher churn compared to others.
Recommendation: Develop sector-specific customer engagement plans to address the unique challenges faced by high-churn sectors.
Future Improvement: Implement real-time data processing and prediction capabilities to provide instant insights and recommendations.
Next Step: Utilize streaming data technologies like Apache Kafka or Google Pub/Sub to enable real-time data ingestion and prediction, allowing for immediate customer retention actions.
Several external resources played a crucial role in the successful completion of this project. The Kaggle CRM + Sales + Opportunities dataset served as the project's primary data source. Google Cloud Platform documentation was invaluable for understanding and implementing various GCP services like BigQuery, Cloud Storage, and IAM roles. Apache Airflow’s official documentation provided detailed insights into orchestrating complex ETL workflows. The Terraform documentation was essential for writing infrastructure-as-code scripts to automate resource provisioning.