PortugUese Bank Marketing Dataset

Overview

The data which was analysed comes from a Portugal based bank which did a direct marketing campaign to sell a term deposit product by making phone calls. The data has been generated based on the campaign targeting the potential individuals. A term deposit is a cash investment to a financial institution for an agreed interest rate over a fixed time period. The goal of this analysis is determining whether a customer will subscribe to a term deposit or not.

This dataset has 21 variables and 41188 records that we found that it contains 9 numerical variables and 11 categorical variables. The y variable would be used as the target variable for the models that we are attempting to predict. The variables of dataset were divided to five parts:

● Bank client data: 1- age, 2- job, 3- marital, 4- education, 5- default, 6- housing, 7- loan.

● Depend on the last contact of the current campaign: 8- contact, 9- month, 10- day_of_week, 11- duration.

● Other attributes: 12- campaign,13- pdays, 14- previous, 15- poutcome.

● Social and Economic Context Attributes: 16- emp.var.rate, 17- cons.price.idx, 18- cons.conf.idx, 19- euribor3m, 20- nr.employed.

● Target Variable: 21- y

Link to the Dataset

Research Question

Will education and duration be able to predict whether the client will subscribe to a term deposit or not?

Hypothesis:

Null Hypothesis: There is no significant prediction of client subscribing to a term deposit based on the education and duration of a call.

Alternative Hypothesis: There is a significant prediction of client subscribing to a term deposit based on the education and duration of a call.

Variable of Interest:

Duration: (Independent variable) Call duration made by bank employees previously - numeric

Education: (Independent variable) Client's education - categorical

Y: (Dependent variable) Client will subscribe to term deposit - Binary

Data Wrangling

df = read_csv("bank-additional-full.csv")

dim(df)

apply(is.na(df),2,sum)

colSums(df == "unknown")

After checking the missing value and “unknown” value, we found there wasn't any “NA” value. But some variables have a lot of “unknown” value. So, we have sorted out the column names with their sum of unknown.

df$y=ifelse(df$y=="yes",1,0)

Converting the target variable to binary numerical values . Initially it was Yes/No after converting the values are 0/1.

Cleaning the unknown values for 5 variables (job, marital, education, housing, default) based on their correlation with the target variable (dependent variable) which is “y”.

1. For job, we found that the response was highest among students around 31% while other groups were between 6-15%. Also, we could easily ignore the unknown values here.

2. For marital, singles were making the highest percentage while other were more prone to say no. We can ignore the unknown values here.

3. For education, we can easily see that as the level of education increases people are more likely to subscribe to the term deposit. Also, the number of people opting for term deposit in the unknown are more so we can’t ignore these values, instead through CrossTable we found that the most proportionate data of unknown was resembling the university level. So, we have replaced the unknown with university holders.

4. For housing, there were no proper variations and the chi-square proportions suggests that we can ignore these values.

5. For default, the proportions were quite abnormal. People opting for term deposit were just 3 while 32000 people opted for no. So, it’ll be better to remove this column.

CrossTable(df$job,df$y)

df = df %>% filter( job!= "unknown")

CrossTable(df$marital,df$y)

df = df %>% filter( marital!= "unknown")

CrossTable(df$education,df$y)

df$education[df$education=="unknown"]="university.degree"

For education, we can easily see that as the level of education increases people are more likely to subscribe to the term deposit. Also, the number of people opting for term deposit in the unknown are more so we can’t ignore these values, instead through CrossTable we found that the most proportionate data of unknown was resembling the university level. So, we have replaced the unknown with university holders.

CrossTable(df$default,df$y)

CrossTable(df$housing,df$y)

df = df %>% filter( housing!= "unknown")

df$housing = ifelse(df$housing == "yes",1,0)

df$default = ifelse(df$default == "yes",1,0)

df$loan = ifelse(df$loan == "yes",1,0)

df$job <- as.numeric(as.factor(df$job))

df$marital <- as.numeric(as.factor(df$marital))

df$education <- as.numeric(as.factor(df$education))

df$month <- as.numeric(as.factor(df$month))

df$contact <- as.numeric(as.factor(df$contact))

df$poutcome <- as.numeric(as.factor(df$poutcome))

Data Balancing through ROSE (Random Over Sampling)

When we summarized our target variable, we found that there was a major skewness towards the target variable for “No”. So, data balancing needs to be performed for removing this skewness. We over sampled our data with respect to the no values in the target variable.

df=ovun.sample(y ~ ., data = df, method = "over",N = 70632)$data

After balancing the data the target variable became completely normal as now we are having equal instances of 0 and 1.

Before

After

Representative Sample

Whenever there is a huge dataset it is not possible that the statistics operations are applied to the whole population because it becomes a tedious task. To ease our methodology we will take out a sample from the population/dataset which will exactly represent the whole population.

It is very important to choose a sample that is not bias because it will completely change the output.

The best example for this will be EXIT POLLS during elections. It solely depends on the sample take by the news agency is from which population, is biased or not and so forth. If the sample is biased then the results will vary heavily. That is the reason why there is always a conflict between channels that whose polls are correct or approximate.

Descriptive Statistics and Visuals:

Variable 1: Education

gg <- ggplot(df, aes(x=education))

gg <- gg+ggtitle("Histogram for education")

#Change the label of the x axis

gg <- gg + labs(x="Education")

#manage binwidth and colours

gg <- gg + geom_histogram(binwidth=1.5, colour="black", aes(y=..density.., fill=..count..))

gg <- gg + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")

#adding a normal curve

#use stat_function to compute a normalised score for each value of education

#pass the mean and standard deviation

#use the na.rm parameter to say how missing values are handled

gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$education, na.rm=TRUE), sd=sd(df$education, na.rm=TRUE)))

gg

#Create a qqplot

qqnorm(df$education, main="Figure 2 - QQ Plot")

qqline(df$education, col=2) #show a line on the plot

#get summary statistics

mean(df$education) [1] 4.845169

sd(df$education) [1] 2.077639

length(df$education) [1] 70632

eduskew<-semTools::skew(df$education)

edukurt<-semTools::kurtosis(df$education)

eduskew[1]/eduskew[2] -44.95352

edukurt[1]/edukurt[2] -62.96356

zedu<- abs(scale(df$education))

FSA::perc(as.numeric(zedu), 1.96, "gt") [1] 0

Variable 2: Duration

gg <- ggplot(df, aes(x=duration))

gg <- gg+ggtitle("Histogram for duration")

#Change the label of the x axis

gg <- gg + labs(x="Duration")

#manage binwidth and colours

gg <- gg + geom_histogram(binwidth=1, colour="black", aes(y=..density.., fill=..count..))

gg <- gg + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")

#adding a normal curve

#use stat_function to compute a normalised score for each value of duration

#pass the mean and standard deviation

#use the na.rm parameter to say how missing values are handled

gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$duration, na.rm=TRUE), sd=sd(df$duration, na.rm=TRUE)))

gg

#Create a qqplot

qqnorm(df$education, main="Figure 2 - QQ Plot")

qqline(df$education, col=2) #show a line on the plot

#get summary statistics

mean(df$education) [1] 386.2306

sd(df$education) [1] 356.8631

length(df$education) [1] 70632

durationskew<-semTools::skew(df$education)

durationkurt<-semTools::kurtosis(df$education)

durationskew[1]/durationskew[2] 239.915

durationkurt[1]/durationkurt[2] 466.4033

zduration<- abs(scale(df$education))

FSA::perc(as.numeric(zduration), 1.96, "gt") 5.055782

FSA::perc(as.numeric(zduration), 3.29, "gt") 1.186431

Variable 3: Age

gg <- ggplot(df, aes(x=age))

gg <- gg+ggtitle("Histogram for age")

#Change the label of the x axis

gg <- gg + labs(x="Age")

#manage binwidth and colours

gg <- gg + geom_histogram(binwidth=1, colour="black", aes(y=..density.., fill=..count..))

gg <- gg + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")

#adding a normal curve

#use stat_function to compute a normalised score for each value of age

#pass the mean and standard deviation

#use the na.rm parameter to say how missing values are handled

gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$age, na.rm=TRUE), sd=sd(df$age, na.rm=TRUE)))

gg

#Create a qqplot

qqnorm(df$education, main="Figure 2- QQ Plot")

qqline(df$education, col=2) #show a line on the plot

#get summary statistics

mean(df$education) [1] 40.33282

sd(df$education) [1] 12.02635

length(df$education) [1] 70632

ageskew<-semTools::skew(df$education)

agekurt<-semTools::kurtosis(df$education)

ageskew[1]/ageskew[2] 108.4225

agekurt[1]/agekurt[2] 60.90218

zage<- abs(scale(df$education))

FSA::perc(as.numeric(zage), 1.96, "gt") 4.15534

FSA::perc(as.numeric(zage), 3.29, "gt") 0.9528259

psych::describeBy(df$education,df$y, mat=TRUE)

item group1 vars n mean sd median trimmed mad min max range skew kurtosis se

X11 1 0 1 35316 4.663467 2.075533 4 4.829299 2.9652 1 7 6 -0.2664577 -1.246018 0.01104445

X12 2 1 1 35316 5.026872 2.063837 6 5.275218 1.4826 1 7 6 -0.5713304 -1.003954 0.01098221

psych::describeBy(df$duration,df$y, mat=TRUE)

item group1 vars n mean sd median trimmed mad min max range skew kurtosis se

X11 1 0 1 35316 221.0923 207.4743 164 186.4218 121.5732 0 4918 4918 3.690761 31.708927 1.104024

X12 2 1 1 35316 551.3688 396.3808 449 496.4166 330.6198 37 4199 4162 1.790427 6.143439 2.109244

Logistic Regression

This model doesn’t have a prediction potential. It is used to just identify the probability of a point belonging to a particular class or category. We classify the samples we have based on their probability of occurrence in a particular category. The output can be either yes or no. Hence the solutions can be [0,1]. This is also known as Binary Logistic Regression. Suppose we have two input variables x1 and x2, consider a point in the space (a,b) where a is the value of x1 and b is the value of x2

Our equation becomes: b0 + b1a +b2b.

There are 3 conditions now:

The point(a,b) lies in the positive boundary, it can be on the negative boundary or it could lie on the linear boundary itself.

To map the outcome of this relationship to the probability we use the ODDS RATIO which is defined by:

OR(X) = P(X) / 1 - P(X)'

The baseline/null model is the baseline comparator to which any model is compared. Predictions of this baseline model in this case is made purely on whichever category occurred most often in our dataset.

We aim to improve this prediction by including our additional variables. Omnibus trial of the model is utilized to watch that the new model (with logical factors included) is an improvement over the gauge model.It utilizes chi-square tests to check whether there is a huge contrast between the standard model (invalid) and the model made.

Assumptions made for this model

1)Requires the dependent variable to be binary/nominal with multiple categories.

2)The observations must be independent of each other.

3)There must be no correlation between independent variables

4) Typically requires a large sample size.

#MODEL1

logmodel1 <- glm(y ~ education+duration, data = df, na.action = na.exclude, family = binomial(link=logit))

#Full summary of the model

summary(logmodel1)

modelChi <- logmodel1$null.deviance - logmodel1$deviance

modelChi

pseudo.R2 <- modelChi / logmodel1$null.deviance

pseudo.R2

chidf <- logmodel1$df.null - logmodel1$df.residual

chidf

Epi::ROC(form=df$y ~ df$education+df$duration+df$age, plot="ROC")

exp(coefficients(logmodel1))

(Intercept) education duration

0.1046004 1.1362409 1.0048512

DescTools::PseudoR2(logmodel1, which="CoxSnell")

CoxSnell

0.2647159

DescTools::PseudoR2(logmodel1, which="Nagelkerke")

Nagelkerke

0.3529546

stargazer(logmodel1, type="text")

=============================================

Dependent variable:

---------------------------

y

---------------------------------------------

education 0.129***

(0.004)

duration 0.005***

(0.00004)

Constant -2.259***

(0.028)

---------------------------------------------

Observations 70,632

Log Likelihood -38,098.760

Akaike Inf. Crit. 76,203.520

=============================================

Note: *p<0.1; **p<0.05; ***p<0.01

generalhoslem::logitgof(df$y, fitted(logmodel1))

Hosmer and Lemeshow test (binary model)

data: df$y, fitted(logmodel1)

X-squared = 2107.7, df = 8, p-value < 2.2e-16

#Collinearity

vifmodel<-car::vif(logmodel1)#You can ignore the warning messages, GVIF^(1/(2*Df)) is the value of interest

vifmodel

education duration

1.012069 1.012069

1/vifmodel

education duration

0.9880749 0.9880749

LOGISTIC REGRESSION REPORT MODEL 1:

Reporting the results

Multinomial logistic regression analysis was conducted on a Portuguese bank dataset where the outcome variable was (whether the client will subscribe for a term deposit) with age, duration and education being attended as predictors.

The data met the assumption for independent observations. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as Outlined in Tarling (2008).

From the ROC graph we can see that :

Sensitivity (True Positive): 72%

Specificity (True Negative): 76.6%

Positive Predicted: 26.8%

Negative Predicted: 24.5%

AUC: 82.4%

VIF

education duration

1.012069 1.012069

TOLERANCE

education duration

0.9880749 0.9880749

The Hosmer Lemeshow goodness of fit statistic did not indicate any issues with the assumption of linearity between the independent variables and the log odds of the model .

#MODEL2

logmodel2 <- glm(y ~ education+age+duration, data = df, na.action = na.exclude, family = binomial(link=logit))

#Full summary of the model

summary(logmodel2)

modelChi <- logmodel2$null.deviance - logmodel1$deviance

modelChi

pseudo.R2 <- modelChi / logmodel2$null.deviance

pseudo.R2

chidf <- logmodel2$df.null - logmodel2$df.residual

chidf

Epi::ROC(form=df$y ~ df$marital+df$age+df$education+df$duration, plot="ROC")

generalhoslem::logitgof(df$y, fitted(logmodel2))

exp(coefficients(logmodel1))

(Intercept) education age duration

0.05028127 1.15749338 1.01565996 1.00489005

DescTools::PseudoR2(logmodel1, which="CoxSnell")

CoxSnell

0.2705487

DescTools::PseudoR2(logmodel1, which="Nagelkerke")

Nagelkerke

0.3607317

stargazer(logmodel1, type="text")

=============================================

Dependent variable:

---------------------------

y

---------------------------------------------

education 0.146***

(0.004)

age 0.016***

(0.001)

duration 0.005***

(0.00004)

Constant -2.990***

(0.045)

---------------------------------------------

Observations 70,632

Log Likelihood -37,817.490

Akaike Inf. Crit. 75,642.980

=============================================

Note: *p<0.1; **p<0.05; ***p<0.01

generalhoslem::logitgof(df$y, fitted(logmodel1))

Hosmer and Lemeshow test (binary model)

data: df$y, fitted(logmodel2)

X-squared = 2256.2, df = 8, p-value < 2.2e-16

#Collinearity

vifmodel<-car::vif(logmodel2)#You can ignore the warning messages, GVIF^(1/(2*Df)) is the value of interest

vifmodel

education age duration

1.061058 1.049143 1.018890

1/vifmodel

education age duration

0.9424558 0.9531591 0.9814604

Reporting the results

Multinomial logistic regression analysis was conducted on a Portuguese bank dataset where the outcome variable was (whether the client will subscribe for a term deposit) with age, duration and education being attended as predictors.

The data met the assumption for independent observations. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as Outlined in Tarling (2008).

From the ROC graph we can see that :

Sensitivity (True Positive): 75.8%

Specificity (True Negative): 73.7%

Positive Predicted: 24.7%

Negative Predicted: 25.7%

AUC: 82.8%

VIF

education age duration

1.061058 1.049143 1.018890

TOLERANCE

education age duration

0.9424558 0.9531591 0.9814604

The Hosmer Lemeshow goodness of fit statistic did not indicate any issues with the assumption of linearity between the independent variables and the log odds of the model .

Summary Comparison

stargazer(model2, model3, type="text") #Quick model comparison

==============================================

Dependent variable:

----------------------------

y

(1) (2)

----------------------------------------------

education 0.129*** 0.146***

(0.004) (0.004)

age 0.016***

(0.001)

duration 0.005*** 0.005***

(0.00004) (0.00004)

Constant -2.259*** -2.990***

(0.028) (0.045)

----------------------------------------------

Observations 70,632 70,632

Log Likelihood -38,098.760 -37,817.490

Akaike Inf. Crit. 76,203.520 75,642.980

==============================================

Note: *p<0.1; **p<0.05; ***p<0.01

Equation 1:

log(p/1-p) = -2.259 + 1.136*education + 1.004*duration

Equation 2:

log(p/1-p) = -2.990 + 1.157*education + 1.004*duration + 1.015*age