Rstudio
Rstudio EDA Implementation of Tidyverse R with some ML for titanic3 dataset
Below, we provide some R code adapted from script developed by Dave Langer and Hal Varian. This will be used to set up an Exploratory Data Analysis spanning data manipulation and visualization with some machine learning techniques and modelling introduced and executed in Rstudio. We will initially investigate the dataset and then apply Machine Learning and evaluate the classification accuracy by making reference to a confusion matrix to measure error. Machine Learning, classification trees, ctree and confusion matrix are explained in Hal Varian's paper. For the purposes of this analysis; the target variable is survival for the titanic3 dataset and we focus on this in the discussion below:
Confusion Matrix for Titanic
# I found this link below extremely useful for sifting through the titanic dataset
# I try to adapt this template here as a neat way to perform Exploratory Data Analysis
# https://www.youtube.com/channel/UCRhUp6SYaJ7zme4Bjwt28DQ
# By installing this package it is possible to introduce the titanic3 dataset into any R environment
install.packages("PASWR")
library(PASWR)
# tidyverse R The tidyverse is a collection of R packages designed
# for data science. All packages share an underlying design philosophy,
# grammar, and data structures that are especially useful for data
# transformation (dplyr)) and visualization (ggplot2)
install.packages("tidyverse")
library(tidyverse)
# Also possible to install separately
#install.packages("ggplot2")
#library(ggplot2)
#install.packages("dplyr")
#library(dplyr)
# You could also retrieve data from the titanic package from
# www.https://www.kaggle.com/c/titanic-dataset/data
# and this would instance a titanic_train and a titanic_test
# we follow here in part the approach suggested by Hal Varian
# https://www.aeaweb.org/articles?id=10.1257/jep.28.2.3
#some base R basic syntax
str(titanic3)
titanic <- titanic3
View(titanic)
#tidyverse R syntax
glimpse(titanic)
# Some Preliminary estimation before we force survived to be a factor
summarise(titanic, SurvivalRate = sum(survived)/nrow(titanic)*100)
#Just to put in storaage a dataset without/before changes being introduced
titanicB4Fct <- titanic
# Setting up factors is important because so that data is treated as
# categories and not numbers. Factor/categorical data is often observed in
# a business context and ggplot2 offers a substantial syntax for
# mapping and visualizing data that incorporate factors or categories.
titanic$pclass <- as.factor(titanic$pclass)
titanic$survived <- as.factor(titanic$survived)
titanic$sex <- as.factor(titanic$sex)
titanic$embarked <- as.factor(titanic$embarked)
# A quick inspection just to see what the effect of any changes are on
# the dataset
glimpse(titanic)
# ggplot2 and dplyr open up an expansive set of commands to
# unearth key details, trends and patterns
# Some Motivating Questions
# Question 1 - What was the survival rate and how does survival
# correlate with other variables?
# The survived variable can be presented using a simple barchart
ggplot(titanic, aes(x = survived)) +
geom_bar()
# Numbers not surviving or surviving
table(titanic$survived)
809+500
# If you really want percentages.
prop.table(table(titanic$survived))
# Add some additional customization for labels and theme.
ggplot(titanic, aes(x = survived)) +
theme_bw() +
geom_bar() +
labs(y = "Passenger Count",
title = "Titanic Passenger Survival Rates")
# Question 2 - What was the survival rate disaggregated for
# male and females?
# Plot barplot of passenger Sex
ggplot(titanic, aes(x = sex)) +
geom_bar()
# We can apply color to investigate varying layers (i.e., dimensions)
# of the data simultaneously.
ggplot(titanic, aes(x = sex, fill = survived)) +
theme_bw() +
geom_bar() +
labs(y = "Passenger Count",
title = "Titanic Passenger Survival Rates by Sex")
# Question 3 - What was the survival rate by passenger class marked on ticket?
#
ggplot(titanic, aes(x = pclass, fill = survived)) +
theme_bw() +
geom_bar() +
labs(y = "Passenger Count",
title = "Titanic Survival Rates by passenger class")
# Question 4 - What was the survival rate by class of ticket and sex?
# We can leverage the facet_wrap command to further segment the data and enable
# some visual dissagegation of the data.
#
ggplot(titanic, aes(x = sex, fill = survived)) +
theme_bw() +
facet_wrap(~ pclass) +
geom_bar() +
labs(y = "Passenger Count",
title = "Titanic Survival Rates by passenger class and gender")
# We will now examine more intently continuous numerical data
# using ggplot2. Visual exploration of columns or single
# numeric variables (i.e. attributes, features) illustrate how
# ggplot2 facilitates potent visualization.
#
# Question 5 - What is the distribution of passenger ages?
#
# The histogram is a staple of visualizing numeric data as it very
# powerfully communicates the distribution of a variable
# (i.e. column).
ggplot(titanic, aes(x = age)) +
theme_bw() +
geom_histogram(binwidth = 10) +
labs(y = "Passenger Count",
x = "Age (binwidth = 10)",
title = "Passenger Age Distribtion")
# Question 6 - What are the survival rates disaggregated by age?
#
ggplot(titanic, aes(x = age, fill = survived)) +
theme_bw() +
geom_histogram(binwidth = 5) +
labs(y = "Passenger Count",
x = "Age (binwidth = 5)",
title = "Titanic Passenger Survival Rates by Age")
# Another important visualization in the tidyverse toolkit is the
# box-and-whisker plot.
# A box and whisker plot (box plot) exhibits the five-number summary
# of a set of data. The five-number summary is the minimum, first quartile,
# median, third quartile, and maximum. In a box plot, the first quartile to
# the third quartile - is typically represented as a box. A vertical line
# cuts through the box at the median. Although not obviously important here
# at first glance, age proves to be very significant predictor for survival
# https://r4ds.had.co.nz/exploratory-data-analysis.html#cat-cont
ggplot(titanic, aes(x = survived, y = age)) +
theme_bw() +
geom_boxplot() +
labs(y = "Age",
x = "Survived",
title = "Titanic Passenger Survival Rates by Age")
titanic %>%
ggplot(mapping = aes(x = pclass, y = age)) +
geom_point(colour = "#1380A1", size = 1) +
geom_jitter(aes(colour = survived))+ #This generates multiple colours
geom_boxplot(alpha = 0.5, outlier.colour = NA)+
labs(title = "Age Distribution by Passenger Class on the Titanic",
x = "Passenger Class",
y = "Age") +
theme(plot.subtitle = element_text(
size=20))+
facet_wrap(.~sex)
# Question 7 - What is the survival rates by age when segmented by
# gender and class of ticket?
#Calculating mean and median age by Passenger Class and Gender for survivors
titanic %>%
filter(survived ==1)%>%
group_by(pclass, sex)%>%
summarise(
n = n(), #count of passengers
Average.age = mean(age, na.rm = TRUE),
Median.age = median(age, na.rm = TRUE)
)
# Calculating mean and median age by Passenger Class and Gender for those
# who perished
titanic %>%
filter(survived ==0)%>%
group_by(pclass, sex)%>%
summarise(
n = n(), #count of passengers
Average.age = mean(age, na.rm = TRUE),
Median.age = median(age, na.rm = TRUE)
)
# Density Distribution
ggplot(titanic, aes(x = age, fill = survived)) +
theme_bw() +
facet_wrap(sex ~ pclass) +
geom_density(alpha = 0.5) +
labs(y = "Survived",
x = "Age",
title = "Titanic Survival Rates by Age, Pclass and Sex")
# If you prefer histograms, no problem!
ggplot(titanic, aes(x = age, fill = survived)) +
theme_bw() +
facet_wrap(sex ~ pclass) +
geom_histogram(binwidth = 5) +
labs(y = "Survived",
x = "Age",
title = "Titanic Survival Rates by Age, Pclass and Sex")
# Question 8 - How does fare (a proxy for class) and sex influence survival? Make use of
# Pivot tables.
# First, we will consider very simply the mean and median costs of fares
# There would appear to be some evidence for skew.
titanic %>%
summarise(meanFare = mean(fare, na.rm=TRUE))
# Check out the median Fare overall - the difference between mean and the median
# will gives a sense of skew
titanic %>%
summarise(medianFare = median(fare, na.rm=TRUE))
# There would appear to be a large discrepancy in the fares paid by men and women
# Men pay less than women on average
titanic %>%
filter(sex == "male") %>%
summarise(meanFareMen = mean(fare, na.rm=TRUE))
titanic %>%
filter(sex == "female") %>%
summarise(meanFareWomen = mean(fare, na.rm=TRUE))
# Men typically pay less than women
titanic %>%
filter(sex == "male") %>%
summarise(medianFareMen = median(fare, na.rm=TRUE))
# Check out median Fare for women
titanic %>%
filter(sex == "female") %>%
summarise(medianFareWomen = median(fare, na.rm=TRUE))
# Pivot tables
# A pivot table allows us to explore large sets of data interactively.
# Once you create a pivot table, you can quickly transform
# huge numbers of rows and columns into a meaningful, succinctly formatted report.
# We can observe large discrepancies between male and females in terms of fares
#Mean Values
pivotfare0 <- titanic %>%
group_by(pclass, sex) %>%
summarize(MeanFare = mean(fare, na.rm=TRUE),
PassengerCount = n()) %>%
arrange(pclass)
View(pivotfare0)
# Median Values
pivotfare1 <- titanic %>%
group_by(pclass, sex) %>%
summarize(MedianFare = median(fare, na.rm=TRUE),
PassengerCount = n()) %>%
arrange(pclass)
View(pivotfare1)
#Question 10- Explore how does class and sex and fares (mean and median) influence
# survival
#Mean values and disaggegated for survived/drowned
pivotfare2 <- titanic %>%
group_by(pclass, survived, sex) %>%
summarize(MeanFare = mean(fare, na.rm=TRUE),
PassengerCount = n()) %>%
arrange(pclass, survived)
View(pivotfare2)
#Median values and disaggegated for survived/drowned
pivotfare3 <- titanic %>%
group_by(pclass, survived, sex) %>%
summarize(MedianFare = median(fare, na.rm=TRUE),
PassengerCount = n()) %>%
arrange(pclass, survived)
View(pivotfare3)
# Question 11 does point of Departure (Embarcation) influence fare.
# Make use of Pivot Tables
#Proportion of passengers embarking from each port
titanic %>%
count(embarked) %>%
mutate(prop = n/sum(n)) %>%
arrange(desc(prop))
#The passenger numbers by embarcation, passenger class, their count and mean fares
pivotfare4 <- titanic %>%
group_by(pclass, embarked) %>%
summarize(MeanFare = mean(fare, na.rm=TRUE),
PassengerCount = n()) %>%
arrange(pclass)
View(pivotfare4)
#The passenger numbers by embarcation, passenger class, their count and median fares
pivotfare5 <- titanic %>%
group_by(pclass, embarked) %>%
summarize(MedianFare = median(fare, na.rm=TRUE),
PassengerCount = n()) %>%
arrange(pclass)
View(pivotfare5)
# Question 12 - Examine the link between fare, point of departure and survival
#Pivot tables start to become less intelligible when dimensionality increases
pivotfare6 <- titanic %>%
group_by(pclass, survived, embarked) %>%
summarize(MeanFare = mean(fare, na.rm=TRUE),
PassengerCount = n()) %>%
arrange(pclass, survived)
View(pivotfare6)
pivotfare7 <- titanic %>%
group_by(pclass, survived, embarked) %>%
summarize(MedianFare = median(fare, na.rm=TRUE),
PassengerCount = n()) %>%
arrange(pclass, survived)
View(pivotfare7)
#It is useful to save some work and write it out to a spreadsheet
write.csv(pivotfare7,"pivotfare7.csv")
# A graph here might be useful. A picture is like a 1,000 words
# We follow data camp here and add nuances or layers of complexity
# using the approach suggested by data camp. Please see link
# https://www.datacamp.com/community/tutorials/tidyverse-tutorial-r
# The difference is somewhat striking
#It is quite clear that Men paid more than women and were less
# likely to survive
# Scatter plot of Age vs Fare
ggplot(titanic, aes(x = age, y = fare)) +
geom_point()
# Scatter plot of Age vs Fare colored by Sex
ggplot(titanic, aes(x = age, y = fare, color = sex)) +
geom_point()
# Scatter plot of Age vs Fare colored by Sex faceted by Survived
ggplot(titanic, aes(x = age, y = fare, color = sex)) +
geom_point() +
facet_grid(~survived)
# Question 13 - What was the role of Family Size
# before familysie column is introduced
glimpse(titanic)
# We start by using base R and create a new feature or column called
# familysize
# Add a new feature (i.e., column) to the data frame for FamilySize
# A new column was added without using tidyversemutate
titanic$FamilySize <- 1 + titanic$sibsp + titanic$parch
glimpse(titanic)
# We can now try to find the link between family size and survival
# It is clear very families were few however the evidence points
# to low survival rates
ggplot(titanic, aes(x = FamilySize, fill = survived)) +
theme_bw() +
facet_wrap(sex ~ pclass) +
geom_histogram(binwidth = 1)
#######################################################################
# Introduction to Machine Learning using R code suggested by Hal Varian
#######################################################################
# Decision Trees
#######################################################################
# Question 14 - Implement a number of Machine Learning models to explore
# titanic3 dataset
# We will make use of models here suggested by Hal Varian
# https://www.aeaweb.org/articles?id=10.1257/jep.28.2.3
sex <- titanic$sex
age <- titanic$age
class <- titanic$pclass
survived <- titanic$survived
sibsp <- titanic$sibsp
# fit a simple tree
install.packages("rpart")
library("rpart")
model.rpart <- rpart(survived ~ class + age)
model.prune <- prune(model.rpart,cp=.038)
# plot it
install.packages("rpart.plot")
library(rpart.plot)
# Like Figure 1 in Hal Varian's paper
rpart.plot(model.prune,type=0,extra=2)
graphics.off()
######################################################################
# ctree
######################################################################
# Some Background
# ctree identify the structure of the branches using a sequence of
# hypothesis tests. ctrees consequently require only light pruning
# The ctree presented below initially divides by gender. The second
# node then divides by passenger class, the third node divides by age, and
# then we form bins which graphically depict the proportion who survived
# following a given set of conditions being. The intuitive nature of ctree
# makes them powerful tool for understanding machine learning.
######################################################################
install.packages("party")
library(party)
model.ctree <- ctree(as.factor(survived) ~ pclass + sex + age + sibsp, data = titanic)
plot(model.ctree)
titanic.pred <- predict(model.ctree)
titanic.conf <- table(titanic$survived,titanic.pred)
titanic.conf
titanic.pred
titanic$survived
titanic.error <- titanic.conf[2,1]+titanic.conf[1,2]
titanic.error