When to used?
Exploratory Data Analysis (EDA)
Data Partitoning
Variable Formatting (Qualitative Variable)
Simple Binary Logistic Regression
Multiple Binary Logistic Regression
1) When to used?
To investigate the relationship between variables.
Dependent variable, Y must be a categorical (qualitative) variable.
Examples:
Zakat Eligibility (Eligible / Not Eligible) vs. Family Income, Number of siblings etc.
Student Performance (Pass / Fail) vs. Study hours, Number of exercise etc.
Example 1 (Coronary Heart Disease)
Variable Description
AGE : Age of patients
CHD: Coronary Heart Disease (0 - absent, 1 - present)
install.packages("ACSWR") # install package ACSWR = A Course in Statistics with R
library(ACSWR) # load package
data("chdage") # load package
chdage # call data
summary(chdage) # summary() function = produce summary
Example 1 (Coronary Heart Disease)
install.packages("caret") # install package ACSWR = A Course in Statistics with R
library(caret) # load package
index1 <- createDataPartition(chdage$CHD, times = 1, p = 0.8, list = FALSE) # create index to split data for example to 80%
train1 <- chdage[index1,] # create train data
test1 <- chdage[-index1,] # create test data
Example 1 (Coronary Heart Disease)
train1$CHD[train1$CHD==0] <- "Absence" # re-lable qualitative variable (0 - Absence, 1 - Presence)
train1$CHD[train1$CHD==1] <- "Presence"
test1$CHD[test1$CHD==0] <- "Absence" # re-lable qualitative variable (0 - Absence, 1 - Presence)
test1$CHD[test1$CHD==1] <- "Presence"
train1$CHD <- as.factor(train1$CHD) # as.factor () function set a variable as qualitative (categorical) variable
test1$CHD <- as.factor(test1$CHD)
summary(train1)
summary(test1)
Example 1 (Coronary Heart Disease)
a) Fit the model
model_chd <- glm( CHD ~. , data = train1, family = binomial) # built model using glm() function.
model_chd <- glm( CHD ~ AGE, data = train1, family = binomial)
summary(model_chd)
alternative
install.packages("blorr") # blorr package =ToolsforDevelopingBinaryLogisticRegressionModels.
library(blorr)
blr_regress(model_chd) # blr_regress() function to produce model summary.
b) Classification Matrix
An intuitively appealing way to summarize the results of a fitted logistic regression model is via a classification table. The Classification table was also used in this study to know how well the model is able to predict the correct category. This table also provides the sensitivity and specificity of the model. Sensitivity measures the proportion of actual positives which are correctly identified, whereas Specificity measures the proportion of negative which are correctly identified.
blr_confusion_matrix(model_chd)
c) Make Predictions
probabilities <- predict(model_chd, test1, type = "response") # predict the probabilities of test set
probabilities
predicted.classes <- ifelse(probabilities > 0.5, "Presence", "Absence") # predict class for each probabilities
predicted.classes
data.frame(probabilities,predicted.classes) # display in table using data.frame() function
c) Model Accuracy
mean(predicted.classes == test1$CHD) # check model accuracy
The classification prediction accuracy is about __%, which is good. The misclassification error rate is __%
Example 2 (Low Birth Weight)
Data Description: Click Here
library(ACSWR)
data("lowbwt") # call data "Low Birth Weight"
head(lowbwt) # head() = display top 6 observation of the datasets
a) Data Partioning
library(caret)
index2 <- createDataPartition(lowbwt$LOW, times = 1, p = 0.8, list = FALSE)
train2 <- lowbwt[index2,]
test2 <- lowbwt[-index2,]
b) Variable Formatting (Qualitative Variable)
train2$LOW[train2$LOW == 0] <- ">2500g" #re-lable LOW (0 = >2500, 1 = <2500) for both train & test
train2$LOW[train2$LOW == 1] <- "<2500g"
test2$LOW[test2$LOW == 0] <- ">2500g"
test2$LOW[test2$LOW == 1] <- "<2500g"
summary(train2)
train2$SMOKE[train2$SMOKE == "0"] <- "No" #re-lable SMOKE (0 = No, 1 = Yes) for both train & test
train2$SMOKE[train2$SMOKE == "1"] <- "Yes"
test2$SMOKE[test2$SMOKE == "0"] <- "No"
test2$SMOKE[test2$SMOKE == "1"] <- "Yes"
summary(train2)
train2$LOW <- as.factor(train2$LOW) #as.factor() function used to set a variable as qualitative variable
train2$RACE <- as.factor(train2$RACE)
train2$SMOKE <- as.factor(train2$SMOKE)
train2$PTL <- as.factor(train2$PTL)
train2$HT <- as.factor(train2$HT)
train2$UI <- as.factor(train2$UI)
train2$FTV <- as.factor(train2$FTV)
test2$LOW <- as.factor(test2$LOW) #as.factor() function used to set a variable as qualitative variable
test2$RACE <- as.factor(test2$RACE)
test2$SMOKE <- as.factor(test2$SMOKE)
test2$PTL <- as.factor(test2$PTL)
test2$HT <- as.factor(test2$HT)
test2$UI <- as.factor(test2$UI)
test2$FTV <- as.factor(test2$FTV)
summary(train2) #check the percentage for each categories in Dependent variable
summary(test2)
c) Fit the model (Enter method)
model_enter <- glm( LOW ~ AGE + LWT + RACE + SMOKE + PTL + HT + UI + FTV, data = train2, family = binomial)
# built model using glm() function.
summary(model_enter)
alternative
blr_regress(model_enter)
d) Classification Matrix
blr_confusion_matrix(model_enter)
e) Make Predictions
probabilities <- predict(model_enter, test2, type = "response") # predict the probabilities of test set
probabilities
predicted.classes <- ifelse(probabilities > 0.5, "<2500g", ">2500g") # predict class for each probabilities
predicted.classes
data.frame(probabilities,predicted.classes) # display in table using data.frame() function
mean(predicted.classes == test2$LOW) # check model accuracy
f) Fit the model (Stepwise method)
library(MASS)
model_stepwise <- stepAIC(model_enter, trace = FALSE, direction = "both")
alternative
model_step <- blr_step_aic_both(model_enter, details = TRUE)
summary(model_stepwise)
probabilities <- predict(model_stepwise, test2, type = "response") # compute the probabilities of test set
predicted.classes <- ifelse(probabilities > 0.5, "<2500g", ">2500g") # predict class for each probabilities
mean(predicted.classes == test$LOW)
Example 2 (Low Birth Weight)
a) Goodness of Fit - Pseudo R-square
The goodness of fit of the logistic regression model can be expressed by some variants of pseudo R squared statistics.
blr_model_fit_stats(model_stepwise)
b) Goodness of Fit - Hosmer & Lemeshow Test
H0: The logistic regression model is a good fit for the data
H1: The logistic regression model is not a good fit for the data
Decision Rule: Accept H1 if p-value < significance level (alpha)
blr_test_hosmer_lemeshow(model_stepwise)