Background
The purpose of this capstone project is to analyze data to determine if a marketing campaign will be successful with customers. Specifically, this dataset is from a bank that is marketing a new product to some of their existing customers. The end goal is to be able to provide a framework for identifying the best customers to target to help make the business more efficient with their time and effort.
Why
Since Covid-19 has caused many businesses around the world to take large hits to their bottom lines, finding ways to be more efficient are more important than ever. With the cost of upselling an existing customer only being 24% of the cost to acquire a new one, this is a great opportunity for companies [1]. By identifying and understanding the characteristics of the customers most likely to buy into this marketing campaign, time can be focused more on that cluster of customers.
Existing methods like the Apriori Association Algorithm and Market Basket Analysis seem to be better suited for predicting pairings in purchases for things like groceries [2]. The focus of this project will be for transactions of a single product in higher quality existing relationships.
Dataset
The dataset is from the UCI Repository website and is free to access [3]. The dataset includes information about the specific customers contacted, such as their age, marital status and education level. Additionally, there is information on the timing and duration of the calls. There are 17 columns and 45,212 rows.
EDA
While looking through the data several trends were identifiable. The first thing that stood out was how many of the short calls resulted in no new business, so all calls less than three minutes were removed from the data. Once those short calls were removed, more trends were identifiable.
Calls lasting at least 720 seconds (six minutes) were key to success.
Large call volume from May to July resulted in a low percentage of success. The success rate of the following three months is higher.
The under 30 and over 60 groups were the best performers.
Single and divorced people were better.
College graduates were the best customers.
Duration had the strongest correlation to outcome.
The Decision Tree was 71% accurate. Other algorithms will be used to try to improve prediction accuracy.
The Random Forest algorithm is likely to produce a more accurate model because it uses multiple trees to learn [4].
Trying Additional Algorithms
In Phase 2 the Decision Tree was 71.04% accurate so the goal of Phase 3 was to see if trying other algorithms would produce better outcomes. The three algorithms tested in Phase 3 were K Nearest Neighbor, Naïve Bayes, and Random Forest.
KNN produced 68.01% accuracy. Multiple combinations of attributes were removed and tested to try to find a combination of attributes than may produce a more accurate outcome but that did not work. However, adjusting the number of neighbors used did offer marginal improvements.
Naïve Bayes improved on the KNN numbers, but only marginally. With 68.33% accuracy, this algorithm did not improve further with the removal of any attributes or unique combinations of attributes.
With Random Forest, accuracy was the best of the three used in Phase 3. 70.78% of the outcomes were correctly predicted. Surprisingly, removing any attributes actually has a negative impact on the outcome accuracy. Adjustments made to the number of estimators did have a positive impact on performance by a few percentage points.
Outcome
The Decision Tree algorithm was the best performer of the four tested with this data set. Although I was hoping for results around 80% accuracy, the model was still fairly accurate. For a business trying to make predictions on their customers' behavior, this is a positive thing because the Decision Tree algorithm is easy to code and easy to understand [5]. Additionally, using the Decision Tree algorithm does not require special investment in hardware or software.
Equipped with good predictive analysis on their customer base, companies could focus their efforts where they are most likely to be fruitful. As data collection continues to grow, companies could then add their new data to help make the model even more accurate.
[1] "The Economics Of The Upsell," insightsquared.com, para. 9. [Online].
Available: https://www.insightsquared.com/blog/the-economics-of-the-upsell/. [Accessed Sep. 21, 2020].
[2] R. Moodley and F. Chiciana and F. Caraffini and J. Carter, "A product-centric data mining algorithm for targeted promotions," Journal of Retailing and Consumer Services, Vol. 54, May 2020.
Available: ScienceDirect, https://www-sciencedirect-com.proxy-bc.researchport.umd.edu/science/article/pii/S096969891831169X. [Accessed September 21, 2020].
[3] "Bank Marketing Data Set," uci.edu. [Online].
Available: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing. [Accessed Sep. 9, 2020].
[4] E. Grimaldi, "Decision Tree: An Algorithm That Works Like the Human Brain," para 12. Sep. 10 2018. [Online]. Available: https://towardsdatascience.com/decision-tree-an-algorithm-that-works-like-the-human-brain-8bc0652f1fc6. [Accessed Oct. 28 2020].
[5] "Advantages of a Decision Tree for Classification,' pythonprogramminglanguage.com, para. 9. [Online].
Available: https://pythonprogramminglanguage.com/what-are-the-advantages-of-using-a-decision-tree-for-classification/. [Accessed Dec. 6, 2020].