About the Project
This dataset is randomly collected from an Iranian telecom companyâs database over a period of 12 months. A total of 3150 rows of data, each representing a customer, bear information for 13 columns.
The attributes that are in this dataset are call failures, frequency of SMS, number of complaints, number of distinct calls, subscription length, age group, the charge amount, type of service, seconds of use, status, frequency of use, and Customer Value.
All of the attributes except for attribute churn is the aggregated data of the first 9 months. The churn labels are the state of the customers at the end of 12 months. The three months is the designated planning gap.
The Iranian churn dataset is a dataset that contains information related to customer churn in the telecommunications industry in Iran. "Churn" refers to the phenomenon where customers discontinue their services or switch to a competitor.
This dataset is commonly used for churn prediction analysis and building machine learning models to predict customer churn.
We perform some exploratory data analysis and data preprocessing on the dataset, and then apply the above listed classifiers and regressors model using sklearn. We evaluate the model performance on the testing set using various metrics. We compare the results, and find that the Random Forest Classifier achieves a higher accuracy of 0.93 and a higher f1-score,Decision Tree also achieve almost the same accuracy and AdaBoost Classifier with an accuracy of 0.91 for classsifications. So we can conclude that these three are suitable model for classifying the churn for this dataset.
For regression we can use Random Forest Regressor and Decision Tree Regressor as it has higher r score still only 0.6 than the other regressor model still we shouldn’t use any as it is primary a classification dataset where a classification model should be the most suitable model for predicting churn, and discuss some limitations and implications of the analysis. We also provide some recommendations for future work, such as using more features, applying feature selection or dimensionality reduction techniques, and exploring other machine learning models or ensemble methods.
In this report 7 classification models that are Decision Tree Classifier, Random Forest Classifier, KNeighbors Classifier, Logistic Regression, Gaussian NB, AdaBoost Classifier, and Perceptron are used along with 4 regression models that are LinearRegressor, DecisionTree Regressor, SVR and RandomForest Regressor.
The purpose of this study is to predict the state of the customers at the end of 12 months by using all the attributes of dataset i.e The results of this research indicate that a customer's dissatisfaction, their amount of service usage and certain demographic characteristics have the most influence on their decision to remain or churn. The results also imply that customer status (active or inactive status) mediates the relationship between churn and the cause of churn.
This dataset contains 3150 instances and 13 columns.
1. Predictive Analytics: Predictive analytics leverages historical data and statistical modeling to make predictions about future events, such as customer churn. It often involves using machine learning algorithms to create predictive models.
2. Feature Selection: Feature selection is the process of identifying the most relevant and informative features for the churn classification task. It helps in reducing noise and improving model performance.
3. Evaluation Metrics: Evaluation metrics are used to assess the performance of churn classification models. Common metrics include accuracy, precision, recall, F1-score, and ROC- AUC.
4. Imbalanced Data: Churn datasets often suffer from class imbalance, where the number of churners is significantly lower than non-churners. Dealing with imbalanced data is a challenge in churn classification and requires appropriate handling techniques.
5. Customer Behavior: Analyzing customer behavior is crucial for churn prediction. This includes studying patterns, preferences, usage habits, and interactions with the company's products or services.
This dataset is randomly collected from an Iranian telecom conpany database over a period of 12 months. A total of 3150 rows of data, each representing a customer, bear information for 13 columns.This dataset is perfect for practicing prescriptive analysis such as predictive prescription or predictive decision making.
The reason is that the dataset has the attribute of customer value which allows for creating False Positive (FP) and False Negative(FN) costs in case of misclassification. In standard classification tasks, it is assumed that FPs and FNs are the same, which is not the case for many cases.
Furthermore, even if it is recognized that FPs and FNs are indeed different, their different balances per each data object are not understood or taken into consideration. This dataset gives you the opportunity to create a model that recognizes these complexities. For further information about the balance of FPs and FNs see the first mentioned publication. Also, you can find more information about each attribute on one of the publications.
The target variable is Churn which can be 1 or 0 (binary variable).We can download the dataset from
https://archive.ics.uci.edu/dataset/563/iranian%2Bchurn%2Bdataset .
We can see that the Random Forest Classifier, Decision Tree Classifier and Ada Boost Classifier outperforms all the other classification model and Decision Tree Regressor and Random Forest Regression model outperforms all the other regressor model on the on all the metrics, especially on the accuracy and f1-score.
The confusion matrix shows that the Random Forest classifier has fewer false positives and false negatives than the Decision tree and Ada Boost classifier, which means that it can correctly classify churn correctly in the Iranian Dataset.
The Random Forest Classifier, Decision Tree Classifier and AdaBoost classifier is a most suitable model ,and the best among these three are Random Forest to capture linear and non linear relationships and interactions among the features better than the other models.
In conclusion, Random Forest Classifier is a best algorithm for classification of Iranian Churn Dataset than other models used here in this dataset which are listed above. Future work could involve exploring other machine learning algorithms or using different preprocessing techniques. We can try to use other machine learning models or ensemble methods, such as SVM, KNN, AdaBoost, Gradient Boosting, etc., to compare their performance with the Random Forest Classifier.
Experiment with different machine learning algorithms and ensemble methods to potentially enhance the accuracy and robustness of churn prediction models. Conduct further in-depth analysis on specific segments or subgroups of customers to identify churn patterns and develop targeted retention strategies. Explore the use of advanced techniques like deep learning or natural language processing (if applicable) to extract additional insights from unstructured data or textual customer feedback.
1. https://www.kaggle.com/datasets/royjafari/customer-churn
2. https://www.researchgate.net/publication/227426715_Churn_analysis_for_an_Iranian_mobile_ operator
5. https://www.kdnuggets.com/2019/12/random-forest-vs-neural-networks-predicting-customer- churn.html
7. https://ijcsmc.com/docs/papers/February2022/V11I2202210.pdf