The data which was analysed comes from a Portugal based bank which did a direct marketing campaign to sell a term deposit product by making phone calls. The data has been generated based on the campaign targeting the potential individuals. A term deposit is a cash investment to a financial institution for an agreed interest rate over a fixed time period. The goal of this analysis is determining whether a customer will subscribe to a term deposit or not.
This dataset has 21 variables and 41188 records that we found that it contains 9 numerical variables and 11 categorical variables. The y variable would be used as the target variable for the models that we are attempting to predict. The variables of dataset were divided to five parts:
â—Ź Bank client data: 1- age, 2- job, 3- marital, 4- education, 5- default, 6- housing, 7- loan.
â—Ź Depend on the last contact of the current campaign: 8- contact, 9- month, 10- day_of_week, 11- duration.
â—Ź Other attributes: 12- campaign,13- pdays, 14- previous, 15- poutcome.
â—Ź Social and Economic Context Attributes: 16- emp.var.rate, 17- cons.price.idx, 18- cons.conf.idx, 19- euribor3m, 20- nr.employed.
â—Ź Target Variable: 21- y
Research Question
Will education and duration be able to predict whether the client will subscribe to a term deposit or not?
Hypothesis:
Null Hypothesis: There is no significant prediction of client subscribing to a term deposit based on the education and duration of a call.
Alternative Hypothesis: There is a significant prediction of client subscribing to a term deposit based on the education and duration of a call.
Variable of Interest:
Duration: (Independent variable) Call duration made by bank employees previously - numeric
Education: (Independent variable) Client's education - categorical
Y: (Dependent variable) Client will subscribe to term deposit - Binary
Data Wrangling
df = read_csv("bank-additional-full.csv")
dim(df)
apply(is.na(df),2,sum)
colSums(df == "unknown")
After checking the missing value and “unknown” value, we found there wasn't any “NA” value. But some variables have a lot of “unknown” value. So, we have sorted out the column names with their sum of unknown.
df$y=ifelse(df$y=="yes",1,0)
Converting the target variable to binary numerical values . Initially it was Yes/No after converting the values are 0/1.
Cleaning the unknown values for 5 variables (job, marital, education, housing, default) based on their correlation with the target variable (dependent variable) which is “y”.
1. For job, we found that the response was highest among students around 31% while other groups were between 6-15%. Also, we could easily ignore the unknown values here.
2. For marital, singles were making the highest percentage while other were more prone to say no. We can ignore the unknown values here.
3. For education, we can easily see that as the level of education increases people are more likely to subscribe to the term deposit. Also, the number of people opting for term deposit in the unknown are more so we can’t ignore these values, instead through CrossTable we found that the most proportionate data of unknown was resembling the university level. So, we have replaced the unknown with university holders.
4. For housing, there were no proper variations and the chi-square proportions suggests that we can ignore these values.
5. For default, the proportions were quite abnormal. People opting for term deposit were just 3 while 32000 people opted for no. So, it’ll be better to remove this column.
CrossTable(df$job,df$y)
df = df %>% filter( job!= "unknown")
CrossTable(df$marital,df$y)
df = df %>% filter( marital!= "unknown")
CrossTable(df$education,df$y)
df$education[df$education=="unknown"]="university.degree"
For education, we can easily see that as the level of education increases people are more likely to subscribe to the term deposit. Also, the number of people opting for term deposit in the unknown are more so we can’t ignore these values, instead through CrossTable we found that the most proportionate data of unknown was resembling the university level. So, we have replaced the unknown with university holders.
CrossTable(df$default,df$y)
CrossTable(df$housing,df$y)
df = df %>% filter( housing!= "unknown")
df$housing = ifelse(df$housing == "yes",1,0)
df$default = ifelse(df$default == "yes",1,0)
df$loan = ifelse(df$loan == "yes",1,0)
df$job <- as.numeric(as.factor(df$job))
df$marital <- as.numeric(as.factor(df$marital))
df$education <- as.numeric(as.factor(df$education))
df$month <- as.numeric(as.factor(df$month))
df$contact <- as.numeric(as.factor(df$contact))
df$poutcome <- as.numeric(as.factor(df$poutcome))
Data Balancing through ROSE (Random Over Sampling)
When we summarized our target variable, we found that there was a major skewness towards the target variable for “No”. So, data balancing needs to be performed for removing this skewness. We over sampled our data with respect to the no values in the target variable.
df=ovun.sample(y ~ ., data = df, method = "over",N = 70632)$data
After balancing the data the target variable became completely normal as now we are having equal instances of 0 and 1.
Before
After
Representative Sample
Whenever there is a huge dataset it is not possible that the statistics operations are applied to the whole population because it becomes a tedious task. To ease our methodology we will take out a sample from the population/dataset which will exactly represent the whole population.
It is very important to choose a sample that is not bias because it will completely change the output.
The best example for this will be EXIT POLLS during elections. It solely depends on the sample take by the news agency is from which population, is biased or not and so forth. If the sample is biased then the results will vary heavily. That is the reason why there is always a conflict between channels that whose polls are correct or approximate.
Descriptive Statistics and Visuals:
Variable 1: Education
gg <- ggplot(df, aes(x=education))
gg <- gg+ggtitle("Histogram for education")
#Change the label of the x axis
gg <- gg + labs(x="Education")
#manage binwidth and colours
gg <- gg + geom_histogram(binwidth=1.5, colour="black", aes(y=..density.., fill=..count..))
gg <- gg + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")
#adding a normal curve
#use stat_function to compute a normalised score for each value of education
#pass the mean and standard deviation
#use the na.rm parameter to say how missing values are handled
gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$education, na.rm=TRUE), sd=sd(df$education, na.rm=TRUE)))
gg
#Create a qqplot
qqnorm(df$education, main="Figure 2 - QQ Plot")
qqline(df$education, col=2) #show a line on the plot
#get summary statistics
mean(df$education) [1] 4.845169
sd(df$education) [1] 2.077639
length(df$education) [1] 70632
eduskew<-semTools::skew(df$education)
edukurt<-semTools::kurtosis(df$education)
eduskew[1]/eduskew[2] -44.95352
edukurt[1]/edukurt[2] -62.96356
zedu<- abs(scale(df$education))
FSA::perc(as.numeric(zedu), 1.96, "gt") [1] 0
gg <- ggplot(df, aes(x=duration))
gg <- gg+ggtitle("Histogram for duration")
#Change the label of the x axis
gg <- gg + labs(x="Duration")
#manage binwidth and colours
gg <- gg + geom_histogram(binwidth=1, colour="black", aes(y=..density.., fill=..count..))
gg <- gg + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")
#adding a normal curve
#use stat_function to compute a normalised score for each value of duration
#pass the mean and standard deviation
#use the na.rm parameter to say how missing values are handled
gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$duration, na.rm=TRUE), sd=sd(df$duration, na.rm=TRUE)))
gg
#Create a qqplot
qqnorm(df$education, main="Figure 2 - QQ Plot")
qqline(df$education, col=2) #show a line on the plot
#get summary statistics
mean(df$education) [1] 386.2306
sd(df$education) [1] 356.8631
length(df$education) [1] 70632
durationskew<-semTools::skew(df$education)
durationkurt<-semTools::kurtosis(df$education)
durationskew[1]/durationskew[2] 239.915
durationkurt[1]/durationkurt[2] 466.4033
zduration<- abs(scale(df$education))
FSA::perc(as.numeric(zduration), 1.96, "gt") 5.055782
FSA::perc(as.numeric(zduration), 3.29, "gt") 1.186431
gg <- ggplot(df, aes(x=age))
gg <- gg+ggtitle("Histogram for age")
#Change the label of the x axis
gg <- gg + labs(x="Age")
#manage binwidth and colours
gg <- gg + geom_histogram(binwidth=1, colour="black", aes(y=..density.., fill=..count..))
gg <- gg + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")
#adding a normal curve
#use stat_function to compute a normalised score for each value of age
#pass the mean and standard deviation
#use the na.rm parameter to say how missing values are handled
gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$age, na.rm=TRUE), sd=sd(df$age, na.rm=TRUE)))
gg
#Create a qqplot
qqnorm(df$education, main="Figure 2- QQ Plot")
qqline(df$education, col=2) #show a line on the plot
#get summary statistics
mean(df$education) [1] 40.33282
sd(df$education) [1] 12.02635
length(df$education) [1] 70632
ageskew<-semTools::skew(df$education)
agekurt<-semTools::kurtosis(df$education)
ageskew[1]/ageskew[2] 108.4225
agekurt[1]/agekurt[2] 60.90218
zage<- abs(scale(df$education))
FSA::perc(as.numeric(zage), 1.96, "gt") 4.15534
FSA::perc(as.numeric(zage), 3.29, "gt") 0.9528259
psych::describeBy(df$education,df$y, mat=TRUE)
item group1 vars n mean sd median trimmed mad min max range skew kurtosis se
X11 1 0 1 35316 4.663467 2.075533 4 4.829299 2.9652 1 7 6 -0.2664577 -1.246018 0.01104445
X12 2 1 1 35316 5.026872 2.063837 6 5.275218 1.4826 1 7 6 -0.5713304 -1.003954 0.01098221
psych::describeBy(df$duration,df$y, mat=TRUE)
item group1 vars n mean sd median trimmed mad min max range skew kurtosis se
X11 1 0 1 35316 221.0923 207.4743 164 186.4218 121.5732 0 4918 4918 3.690761 31.708927 1.104024
X12 2 1 1 35316 551.3688 396.3808 449 496.4166 330.6198 37 4199 4162 1.790427 6.143439 2.109244
This model doesn’t have a prediction potential. It is used to just identify the probability of a point belonging to a particular class or category. We classify the samples we have based on their probability of occurrence in a particular category. The output can be either yes or no. Hence the solutions can be [0,1]. This is also known as Binary Logistic Regression. Suppose we have two input variables x1 and x2, consider a point in the space (a,b) where a is the value of x1 and b is the value of x2
Our equation becomes: b0 + b1a +b2b.
There are 3 conditions now:
The point(a,b) lies in the positive boundary, it can be on the negative boundary or it could lie on the linear boundary itself.
To map the outcome of this relationship to the probability we use the ODDS RATIO which is defined by:
OR(X) = P(X) / 1 - P(X)'
The baseline/null model is the baseline comparator to which any model is compared. Predictions of this baseline model in this case is made purely on whichever category occurred most often in our dataset.
We aim to improve this prediction by including our additional variables. Omnibus trial of the model is utilized to watch that the new model (with logical factors included) is an improvement over the gauge model.It utilizes chi-square tests to check whether there is a huge contrast between the standard model (invalid) and the model made.
Assumptions made for this model
1)Requires the dependent variable to be binary/nominal with multiple categories.
2)The observations must be independent of each other.
3)There must be no correlation between independent variables
4) Typically requires a large sample size.
#MODEL1
logmodel1 <- glm(y ~ education+duration, data = df, na.action = na.exclude, family = binomial(link=logit))
#Full summary of the model
summary(logmodel1)
modelChi <- logmodel1$null.deviance - logmodel1$deviance
modelChi
pseudo.R2 <- modelChi / logmodel1$null.deviance
pseudo.R2
chidf <- logmodel1$df.null - logmodel1$df.residual
chidf
Epi::ROC(form=df$y ~ df$education+df$duration+df$age, plot="ROC")
exp(coefficients(logmodel1))
(Intercept) education duration
0.1046004 1.1362409 1.0048512
DescTools::PseudoR2(logmodel1, which="CoxSnell")
CoxSnell
0.2647159
DescTools::PseudoR2(logmodel1, which="Nagelkerke")
Nagelkerke
0.3529546
stargazer(logmodel1, type="text")
=============================================
Dependent variable:
---------------------------
y
---------------------------------------------
education 0.129***
(0.004)
duration 0.005***
(0.00004)
Constant -2.259***
(0.028)
---------------------------------------------
Observations 70,632
Log Likelihood -38,098.760
Akaike Inf. Crit. 76,203.520
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
generalhoslem::logitgof(df$y, fitted(logmodel1))
Hosmer and Lemeshow test (binary model)
data: df$y, fitted(logmodel1)
X-squared = 2107.7, df = 8, p-value < 2.2e-16
#Collinearity
vifmodel<-car::vif(logmodel1)#You can ignore the warning messages, GVIF^(1/(2*Df)) is the value of interest
vifmodel
education duration
1.012069 1.012069
1/vifmodel
education duration
0.9880749 0.9880749
LOGISTIC REGRESSION REPORT MODEL 1:
Reporting the results
Multinomial logistic regression analysis was conducted on a Portuguese bank dataset where the outcome variable was (whether the client will subscribe for a term deposit) with age, duration and education being attended as predictors.
The data met the assumption for independent observations. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as Outlined in Tarling (2008).
From the ROC graph we can see that :
Sensitivity (True Positive): 72%
Specificity (True Negative): 76.6%
Positive Predicted: 26.8%
Negative Predicted: 24.5%
AUC: 82.4%
VIF
education duration
1.012069 1.012069
TOLERANCE
education duration
0.9880749 0.9880749
The Hosmer Lemeshow goodness of fit statistic did not indicate any issues with the assumption of linearity between the independent variables and the log odds of the model .
#MODEL2
logmodel2 <- glm(y ~ education+age+duration, data = df, na.action = na.exclude, family = binomial(link=logit))
#Full summary of the model
summary(logmodel2)
modelChi <- logmodel2$null.deviance - logmodel1$deviance
modelChi
pseudo.R2 <- modelChi / logmodel2$null.deviance
pseudo.R2
chidf <- logmodel2$df.null - logmodel2$df.residual
chidf
Epi::ROC(form=df$y ~ df$marital+df$age+df$education+df$duration, plot="ROC")
generalhoslem::logitgof(df$y, fitted(logmodel2))
exp(coefficients(logmodel1))
(Intercept) education age duration
0.05028127 1.15749338 1.01565996 1.00489005
DescTools::PseudoR2(logmodel1, which="CoxSnell")
CoxSnell
0.2705487
DescTools::PseudoR2(logmodel1, which="Nagelkerke")
Nagelkerke
0.3607317
stargazer(logmodel1, type="text")
=============================================
Dependent variable:
---------------------------
y
---------------------------------------------
education 0.146***
(0.004)
age 0.016***
(0.001)
duration 0.005***
(0.00004)
Constant -2.990***
(0.045)
---------------------------------------------
Observations 70,632
Log Likelihood -37,817.490
Akaike Inf. Crit. 75,642.980
=============================================
Note: *p<0.1; **p<0.05; ***p<0.01
generalhoslem::logitgof(df$y, fitted(logmodel1))
Hosmer and Lemeshow test (binary model)
data: df$y, fitted(logmodel2)
X-squared = 2256.2, df = 8, p-value < 2.2e-16
#Collinearity
vifmodel<-car::vif(logmodel2)#You can ignore the warning messages, GVIF^(1/(2*Df)) is the value of interest
vifmodel
education age duration
1.061058 1.049143 1.018890
1/vifmodel
education age duration
0.9424558 0.9531591 0.9814604
Reporting the results
Multinomial logistic regression analysis was conducted on a Portuguese bank dataset where the outcome variable was (whether the client will subscribe for a term deposit) with age, duration and education being attended as predictors.
The data met the assumption for independent observations. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as Outlined in Tarling (2008).
From the ROC graph we can see that :
Sensitivity (True Positive): 75.8%
Specificity (True Negative): 73.7%
Positive Predicted: 24.7%
Negative Predicted: 25.7%
AUC: 82.8%
VIF
education age duration
1.061058 1.049143 1.018890
TOLERANCE
education age duration
0.9424558 0.9531591 0.9814604
The Hosmer Lemeshow goodness of fit statistic did not indicate any issues with the assumption of linearity between the independent variables and the log odds of the model .
stargazer(model2, model3, type="text") #Quick model comparison
==============================================
Dependent variable:
----------------------------
y
(1) (2)
----------------------------------------------
education 0.129*** 0.146***
(0.004) (0.004)
age 0.016***
(0.001)
duration 0.005*** 0.005***
(0.00004) (0.00004)
Constant -2.259*** -2.990***
(0.028) (0.045)
----------------------------------------------
Observations 70,632 70,632
Log Likelihood -38,098.760 -37,817.490
Akaike Inf. Crit. 76,203.520 75,642.980
==============================================
Note: *p<0.1; **p<0.05; ***p<0.01
log(p/1-p) = -2.259 + 1.136*education + 1.004*duration
log(p/1-p) = -2.990 + 1.157*education + 1.004*duration + 1.015*age