Logistic Regression using R

Course Outline

When to used?
Exploratory Data Analysis (EDA)
Data Partitoning
Variable Formatting (Qualitative Variable)
Simple Binary Logistic Regression
Multiple Binary Logistic Regression

1) When to used?

To investigate the relationship between variables.
Dependent variable, Y must be a categorical (qualitative) variable.
Examples:
1. Zakat Eligibility (Eligible / Not Eligible) vs. Family Income, Number of siblings etc.
2. Student Performance (Pass / Fail) vs. Study hours, Number of exercise etc.

2) Exploratory Data Analysis (EDA)

Example 1 (Coronary Heart Disease)

Variable Description

AGE : Age of patients

CHD: Coronary Heart Disease (0 - absent, 1 - present)

install.packages("ACSWR") # install package ACSWR = A Course in Statistics with R

library(ACSWR) # load package

data("chdage") # load package

chdage # call data

summary(chdage) # summary() function = produce summary

3) Data Partioning

Example 1 (Coronary Heart Disease)

install.packages("caret") # install package ACSWR = A Course in Statistics with R

library(caret) # load package

index1 <- createDataPartition(chdage$CHD, times = 1, p = 0.8, list = FALSE) # create index to split data for example to 80%

train1 <- chdage[index1,] # create train data

test1 <- chdage[-index1,] # create test data

4) Variable Formatting (Qualitative Variable)

Example 1 (Coronary Heart Disease)

train1$CHD[train1$CHD==0] <- "Absence" # re-lable qualitative variable (0 - Absence, 1 - Presence)

train1$CHD[train1$CHD==1] <- "Presence"

test1$CHD[test1$CHD==0] <- "Absence" # re-lable qualitative variable (0 - Absence, 1 - Presence)

test1$CHD[test1$CHD==1] <- "Presence"

train1$CHD <- as.factor(train1$CHD) # as.factor () function set a variable as qualitative (categorical) variable

test1$CHD <- as.factor(test1$CHD)

summary(train1)

summary(test1)

5) Simple Logistic Regression

Example 1 (Coronary Heart Disease)

a) Fit the model

model_chd <- glm( CHD ~. , data = train1, family = binomial) # built model using glm() function.

model_chd <- glm( CHD ~ AGE, data = train1, family = binomial)

summary(model_chd)

alternative

install.packages("blorr") # blorr package =ToolsforDevelopingBinaryLogisticRegressionModels.

library(blorr)

blr_regress(model_chd) # blr_regress() function to produce model summary.

b) Classification Matrix

An intuitively appealing way to summarize the results of a fitted logistic regression model is via a classification table. The Classification table was also used in this study to know how well the model is able to predict the correct category. This table also provides the sensitivity and specificity of the model. Sensitivity measures the proportion of actual positives which are correctly identified, whereas Specificity measures the proportion of negative which are correctly identified.

blr_confusion_matrix(model_chd)

c) Make Predictions

probabilities <- predict(model_chd, test1, type = "response") # predict the probabilities of test set

probabilities

predicted.classes <- ifelse(probabilities > 0.5, "Presence", "Absence") # predict class for each probabilities

predicted.classes

data.frame(probabilities,predicted.classes) # display in table using data.frame() function

c) Model Accuracy

mean(predicted.classes == test1$CHD) # check model accuracy

The classification prediction accuracy is about __%, which is good. The misclassification error rate is __%

6) Multiple Logistic Regression (Stepwise Selection Procedure)

Example 2 (Low Birth Weight)

Data Description: Click Here

library(ACSWR)

data("lowbwt") # call data "Low Birth Weight"

head(lowbwt) # head() = display top 6 observation of the datasets

a) Data Partioning

library(caret)

index2 <- createDataPartition(lowbwt$LOW, times = 1, p = 0.8, list = FALSE)

train2 <- lowbwt[index2,]

test2 <- lowbwt[-index2,]

b) Variable Formatting (Qualitative Variable)

train2$LOW[train2$LOW == 0] <- ">2500g" #re-lable LOW (0 = >2500, 1 = <2500) for both train & test

train2$LOW[train2$LOW == 1] <- "<2500g"

test2$LOW[test2$LOW == 0] <- ">2500g"

test2$LOW[test2$LOW == 1] <- "<2500g"

summary(train2)

train2$SMOKE[train2$SMOKE == "0"] <- "No" #re-lable SMOKE (0 = No, 1 = Yes) for both train & test

train2$SMOKE[train2$SMOKE == "1"] <- "Yes"

test2$SMOKE[test2$SMOKE == "0"] <- "No"

test2$SMOKE[test2$SMOKE == "1"] <- "Yes"

summary(train2)

train2$LOW <- as.factor(train2$LOW) #as.factor() function used to set a variable as qualitative variable

train2$RACE <- as.factor(train2$RACE)

train2$SMOKE <- as.factor(train2$SMOKE)

train2$PTL <- as.factor(train2$PTL)

train2$HT <- as.factor(train2$HT)

train2$UI <- as.factor(train2$UI)

train2$FTV <- as.factor(train2$FTV)

test2$LOW <- as.factor(test2$LOW) #as.factor() function used to set a variable as qualitative variable

test2$RACE <- as.factor(test2$RACE)

test2$SMOKE <- as.factor(test2$SMOKE)

test2$PTL <- as.factor(test2$PTL)

test2$HT <- as.factor(test2$HT)

test2$UI <- as.factor(test2$UI)

test2$FTV <- as.factor(test2$FTV)

summary(train2) #check the percentage for each categories in Dependent variable

summary(test2)

c) Fit the model (Enter method)

model_enter <- glm( LOW ~ AGE + LWT + RACE + SMOKE + PTL + HT + UI + FTV, data = train2, family = binomial)

# built model using glm() function.

summary(model_enter)

alternative

blr_regress(model_enter)

d) Classification Matrix

blr_confusion_matrix(model_enter)

e) Make Predictions

probabilities <- predict(model_enter, test2, type = "response") # predict the probabilities of test set

probabilities

predicted.classes <- ifelse(probabilities > 0.5, "<2500g", ">2500g") # predict class for each probabilities

predicted.classes

data.frame(probabilities,predicted.classes) # display in table using data.frame() function

mean(predicted.classes == test2$LOW) # check model accuracy

f) Fit the model (Stepwise method)

library(MASS)

model_stepwise <- stepAIC(model_enter, trace = FALSE, direction = "both")

alternative

model_step <- blr_step_aic_both(model_enter, details = TRUE)

summary(model_stepwise)

probabilities <- predict(model_stepwise, test2, type = "response") # compute the probabilities of test set

predicted.classes <- ifelse(probabilities > 0.5, "<2500g", ">2500g") # predict class for each probabilities

mean(predicted.classes == test$LOW)

7) Evaluating Logistic Regression Models

Example 2 (Low Birth Weight)

a) Goodness of Fit - Pseudo R-square

The goodness of fit of the logistic regression model can be expressed by some variants of pseudo R squared statistics.

blr_model_fit_stats(model_stepwise)

b) Goodness of Fit - Hosmer & Lemeshow Test

H0: The logistic regression model is a good fit for the data

H1: The logistic regression model is not a good fit for the data

Decision Rule: Accept H1 if p-value < significance level (alpha)

blr_test_hosmer_lemeshow(model_stepwise)