Our project embarked on a meticulous journey of data collection and exploration, delving into customer behaviour and retention patterns critical for enhancing Customer Lifetime Value (CLTV) prediction. Through a comprehensive dataset amalgamation, including customer churn analysis, demographic insights, and unemployment rate fluctuations, we've laid the foundation for nuanced strategic decisions aimed at bolstering long-term business success.
Initiating with a rich dataset from a GitHub repository focused on the telecommunications sector, we enhanced our analysis with demographic data from the US Census Bureau and unemployment rates across California ZIP codes. This multi-faceted approach allowed us to capture the intricacies of consumer behavior and market dynamics.
Dataset Description
Our dataset comprises a wide array of attributes, including customer demographics, satisfaction scores, churn indicators, and CLTV indices. This enabled a deep dive into factors influencing customer retention and turnover patterns, providing a solid base for predictive modelling. Initiating with a rich dataset from a GitHub repository focused on the telecommunications sector, we enhanced our analysis with demographic data from the US Census Bureau and unemployment rates across California ZIP codes. This multi-faceted approach allowed us to capture the intricacies of consumer behaviour and market dynamics.
Data Pre processing
We utilized the United States Census Bureau’s API to gather demographic data essential for our research. Through Python’s requests package, we accessed population statistics primarily focusing on Californian cities. By carefully specifying the required parameters in the API queries, including the requested fields and geographic boundaries, we obtained detailed demographic information crucial for comprehending consumer behaviour. This meticulous approach allowed us to procure population estimates for each city, providing insights into the potential customer base across various regions
We employed web scraping techniques to extract unemployment rate comparisons across California ZIP codes from ZipAtlas. Further, the data obtained from API and web scraping was merged with the dataset from a GitHub repository focusing on customer churn analysis within the telecommunications industry. The data looks as follows
In the data cleaning stage, we ensured that the dataset was high-quality and intact for further analysis. Several actions were taken to accomplish this. First, to maintain data accuracy and consistency, we eliminated rows that included null entries to rectify missing values. Furthermore, we found and removed unnecessary columns—such as Index, paperless billing, nation, state, Under 30, senior citizen, and latlong—that had no bearing on our study. Additionally, outlier values were found and eliminated from the dataset to prevent them from skewing the results of our research. The data after cleaning looks as follows
Histogram of Customer Status
The dataset’s "Customer Status" distribution is shown by the histogram. The client statuses of Joined and Stayed are represented by two bins. The majority of Stayed, or current, consumers are just that—customers. A smaller subset is marked as Joined signifying that they are fresh clients.
Total and Monthly Charge Distribution
The distribution of the total and monthly charges to clients in your dataset is shown by these histograms. The blue histogram indicates that the majority of clients have total charges that fall between the low and high ranges. According to the green histogram, more customers are charged more each month, while fewer customers are charged less. These illustrations make it easier to comprehend how costs are distributed among your customers.
Different Charges in Total Revenue
The pie chart shows how Total Revenue is divided into two parts: Total Charges, which constitute 54.5% of Total Revenue, and Total Long Distance Charges, which constitute 45.5%. Since their percentage is 0, the total extra data charges do not contribute to the total revenue. Understanding the proportionate contributions of the various charge kinds to the total revenue is made easier by this chart. The pie chart highlights the dominance of Total Charges and Total Long Distance Charges in the overall revenue structure and provides a clear visual depiction of the distribution of revenue components.
Boxplots of Relevant Numerical Columns
The tenure of clients associated with the business displays a varied range, reflecting fluctuations in the duration of their association. Monthly charges exhibit a broad interquartile range, providing clients with a spectrum of monthly rates. Total charges demonstrate a wide range skewed towards higher values. Average monthly long-distance fees are generally modest, with occasional exceptions. Most clients receive refunds within a narrow range, while total additional data fees show consistent distribution. Total long-distance charges show a slightly broader range with anomalies suggesting some clients incur significantly higher fees. Customer satisfaction scores span a wide range, indicating varying levels of satisfaction. The churn value likely represents a small-range or binary numeric column indicating customer churn status.
Data points are concentrated at lower total charges and spread out as total charges rise, indicating a range of monthly charges as consumers’ overall spending increases. The distribution of Total Charges is displayed in the top marginal histogram, which displays a right-skewed distribution with a concentration of consumers at the lower end. The distribution of monthly charges is depicted in the marginal histogram to the right, where it is comparatively level with a small increase toward the higher rates. This graphic aids in recognizing patterns or trends in the customer billing data and helps explain how monthly expenses affect overall charges over time.
Age vs Revenue
When there are too many data points to show as individual points in a scatter plot, the density of data points is represented using this image, which is a hexbin plot. The color bar on the right shows that the plot is divided into hexagonal bins, and the color of each bin indicates the number of data points within that bin. Greater concentrations of data points are represented by darker hues. Two variables, one on the x- and one on the y-axes, are probably represented by the plot. Darker areas indicate a higher concentration of people in particular age and economic ranges. As an example, it might depict the relationship between age (x-axis) and income (y-axis).
Distribution of Total Revenue by Contract Type
The box plot illustrates how various contract types affect Total Revenue. One-year contracts are less variable and have a median revenue in the middle range. With a wider range and numerous outliers, two-year contracts have a higher median revenue, suggesting some very high-revenue clients. Compared to longer contracts, month-to-month agreements had the smallest range and lowest median revenue, indicating greater stability but generally lower total revenue.
Survey Responses by Internet Service Status
The comparison of survey responses between consumers with and without internet connection, grouped by various satisfaction rankings, is shown in this stacked bar chart. With segments inside each bar signifying the percentage of respondents with particular satisfaction scores, each bar represents a group (No or Yes to Internet Service). From the lowest to the highest satisfaction scores, the different shades of blue represent the scores. A broad range of scores is shown by both groups, indicating similar score distributions. But there isn’t enough labelling in the legend to understand how scores relate to colour parts. Still, the graphic does a good job of illustrating how consumers’ satisfaction ratings differ according to their internet service status.
Proportion of Customers with Internet Services
The percentage of consumers split into two groups according to whether or not they have internet access is displayed in the pie chart. 30.8% of people do not have internet service, compared to the majority of 69.2% who do. The distribution of customers in the dataset with and without internet service is shown graphically in this figure.
Monthly Charge by Internet Services
The monthly costs for DSL, Fiber Optic, and Cable internet services are displayed in a scatter plot. Every dot symbolizes a customer, and its vertical position signifies the monthly fee. It shows that while Fiber Optic customers face a larger range of rates, usually greater than DSL, DSL users often have lower charges within a small range. The rates that cable customers pay are similar to those of DSL users, however they lean a little bit in the direction of higher rates. The plot’s jitter skillfully illustrates the density of points, offering details on the concentration of users at various price tiers for every kind of internet service.