Rstudio

Rstudio EDA Implementation of Tidyverse R with some ML for titanic3 dataset

Below, we provide some R code adapted from script developed by Dave Langer and Hal Varian. This will be used to set up an Exploratory Data Analysis spanning data manipulation and visualization with some machine learning techniques and modelling introduced and executed in Rstudio. We will initially investigate the dataset and then apply Machine Learning and evaluate the classification accuracy by making reference to a confusion matrix to measure error. Machine Learning, classification trees, ctree and confusion matrix are explained in Hal Varian's paper. For the purposes of this analysis; the target variable is survival for the titanic3 dataset and we focus on this in the discussion below:

Confusion Matrix for Titanic

# I found this link below extremely useful for sifting through the titanic dataset

# I try to adapt this template here as a neat way to perform Exploratory Data Analysis

# https://github.com/datasciencedojo/IntroDataVisualizationWithRAndGgplot2/blob/master/IntroDataVizRAndGgplot2.R

# https://www.youtube.com/channel/UCRhUp6SYaJ7zme4Bjwt28DQ



# By installing this package it is possible to introduce the titanic3 dataset into any R environment

install.packages("PASWR")

library(PASWR)


# tidyverse R The tidyverse is a collection of R packages designed

# for data science. All packages share an underlying design philosophy,

# grammar, and data structures that are especially useful for data

# transformation (dplyr)) and visualization (ggplot2)

install.packages("tidyverse")

library(tidyverse)


# Also possible to install separately

#install.packages("ggplot2")

#library(ggplot2)


#install.packages("dplyr")

#library(dplyr)



# You could also retrieve data from the titanic package from

# www.https://www.kaggle.com/c/titanic-dataset/data

# and this would instance a titanic_train and a titanic_test


# we follow here in part the approach suggested by Hal Varian

# https://www.aeaweb.org/articles?id=10.1257/jep.28.2.3


#some base R basic syntax

str(titanic3)

titanic <- titanic3

View(titanic)


#tidyverse R syntax

glimpse(titanic)


# Some Preliminary estimation before we force survived to be a factor


summarise(titanic, SurvivalRate = sum(survived)/nrow(titanic)*100)


#Just to put in storaage a dataset without/before changes being introduced

titanicB4Fct <- titanic


# Setting up factors is important because so that data is treated as

# categories and not numbers. Factor/categorical data is often observed in

# a business context and ggplot2 offers a substantial syntax for

# mapping and visualizing data that incorporate factors or categories.

titanic$pclass <- as.factor(titanic$pclass)

titanic$survived <- as.factor(titanic$survived)

titanic$sex <- as.factor(titanic$sex)

titanic$embarked <- as.factor(titanic$embarked)


# A quick inspection just to see what the effect of any changes are on

# the dataset

glimpse(titanic)


# ggplot2 and dplyr open up an expansive set of commands to

# unearth key details, trends and patterns


# Some Motivating Questions

# Question 1 - What was the survival rate and how does survival

# correlate with other variables?


# The survived variable can be presented using a simple barchart

ggplot(titanic, aes(x = survived)) +

geom_bar()


# Numbers not surviving or surviving

table(titanic$survived)


809+500


# If you really want percentages.

prop.table(table(titanic$survived))


# Add some additional customization for labels and theme.

ggplot(titanic, aes(x = survived)) +

theme_bw() +

geom_bar() +

labs(y = "Passenger Count",

title = "Titanic Passenger Survival Rates")



# Question 2 - What was the survival rate disaggregated for

# male and females?



# Plot barplot of passenger Sex

ggplot(titanic, aes(x = sex)) +

geom_bar()


# We can apply color to investigate varying layers (i.e., dimensions)

# of the data simultaneously.


ggplot(titanic, aes(x = sex, fill = survived)) +

theme_bw() +

geom_bar() +

labs(y = "Passenger Count",

title = "Titanic Passenger Survival Rates by Sex")


# Question 3 - What was the survival rate by passenger class marked on ticket?

#

ggplot(titanic, aes(x = pclass, fill = survived)) +

theme_bw() +

geom_bar() +

labs(y = "Passenger Count",

title = "Titanic Survival Rates by passenger class")



# Question 4 - What was the survival rate by class of ticket and sex?

# We can leverage the facet_wrap command to further segment the data and enable

# some visual dissagegation of the data.

#

ggplot(titanic, aes(x = sex, fill = survived)) +

theme_bw() +

facet_wrap(~ pclass) +

geom_bar() +

labs(y = "Passenger Count",

title = "Titanic Survival Rates by passenger class and gender")


# We will now examine more intently continuous numerical data

# using ggplot2. Visual exploration of columns or single

# numeric variables (i.e. attributes, features) illustrate how

# ggplot2 facilitates potent visualization.

#


# Question 5 - What is the distribution of passenger ages?

#

# The histogram is a staple of visualizing numeric data as it very

# powerfully communicates the distribution of a variable

# (i.e. column).


ggplot(titanic, aes(x = age)) +

theme_bw() +

geom_histogram(binwidth = 10) +

labs(y = "Passenger Count",

x = "Age (binwidth = 10)",

title = "Passenger Age Distribtion")



# Question 6 - What are the survival rates disaggregated by age?

#

ggplot(titanic, aes(x = age, fill = survived)) +

theme_bw() +

geom_histogram(binwidth = 5) +

labs(y = "Passenger Count",

x = "Age (binwidth = 5)",

title = "Titanic Passenger Survival Rates by Age")


# Another important visualization in the tidyverse toolkit is the

# box-and-whisker plot.

# A box and whisker plot (box plot) exhibits the five-number summary

# of a set of data. The five-number summary is the minimum, first quartile,

# median, third quartile, and maximum. In a box plot, the first quartile to

# the third quartile - is typically represented as a box. A vertical line

# cuts through the box at the median. Although not obviously important here

# at first glance, age proves to be very significant predictor for survival

# https://r4ds.had.co.nz/exploratory-data-analysis.html#cat-cont


ggplot(titanic, aes(x = survived, y = age)) +

theme_bw() +

geom_boxplot() +

labs(y = "Age",

x = "Survived",

title = "Titanic Passenger Survival Rates by Age")


titanic %>%

ggplot(mapping = aes(x = pclass, y = age)) +

geom_point(colour = "#1380A1", size = 1) +

geom_jitter(aes(colour = survived))+ #This generates multiple colours

geom_boxplot(alpha = 0.5, outlier.colour = NA)+

labs(title = "Age Distribution by Passenger Class on the Titanic",

x = "Passenger Class",

y = "Age") +

theme(plot.subtitle = element_text(

size=20))+

facet_wrap(.~sex)



# Question 7 - What is the survival rates by age when segmented by

# gender and class of ticket?



#Calculating mean and median age by Passenger Class and Gender for survivors

titanic %>%

filter(survived ==1)%>%

group_by(pclass, sex)%>%

summarise(

n = n(), #count of passengers

Average.age = mean(age, na.rm = TRUE),

Median.age = median(age, na.rm = TRUE)

)


# Calculating mean and median age by Passenger Class and Gender for those

# who perished

titanic %>%

filter(survived ==0)%>%

group_by(pclass, sex)%>%

summarise(

n = n(), #count of passengers

Average.age = mean(age, na.rm = TRUE),

Median.age = median(age, na.rm = TRUE)

)


# Density Distribution

ggplot(titanic, aes(x = age, fill = survived)) +

theme_bw() +

facet_wrap(sex ~ pclass) +

geom_density(alpha = 0.5) +

labs(y = "Survived",

x = "Age",

title = "Titanic Survival Rates by Age, Pclass and Sex")


# If you prefer histograms, no problem!

ggplot(titanic, aes(x = age, fill = survived)) +

theme_bw() +

facet_wrap(sex ~ pclass) +

geom_histogram(binwidth = 5) +

labs(y = "Survived",

x = "Age",

title = "Titanic Survival Rates by Age, Pclass and Sex")



# Question 8 - How does fare (a proxy for class) and sex influence survival? Make use of

# Pivot tables.


# First, we will consider very simply the mean and median costs of fares

# There would appear to be some evidence for skew.


titanic %>%

summarise(meanFare = mean(fare, na.rm=TRUE))


# Check out the median Fare overall - the difference between mean and the median

# will gives a sense of skew

titanic %>%

summarise(medianFare = median(fare, na.rm=TRUE))


# There would appear to be a large discrepancy in the fares paid by men and women

# Men pay less than women on average

titanic %>%

filter(sex == "male") %>%

summarise(meanFareMen = mean(fare, na.rm=TRUE))


titanic %>%

filter(sex == "female") %>%

summarise(meanFareWomen = mean(fare, na.rm=TRUE))


# Men typically pay less than women

titanic %>%

filter(sex == "male") %>%

summarise(medianFareMen = median(fare, na.rm=TRUE))


# Check out median Fare for women

titanic %>%

filter(sex == "female") %>%

summarise(medianFareWomen = median(fare, na.rm=TRUE))


# Pivot tables

# A pivot table allows us to explore large sets of data interactively.

# Once you create a pivot table, you can quickly transform

# huge numbers of rows and columns into a meaningful, succinctly formatted report.


# We can observe large discrepancies between male and females in terms of fares


#Mean Values

pivotfare0 <- titanic %>%

group_by(pclass, sex) %>%

summarize(MeanFare = mean(fare, na.rm=TRUE),

PassengerCount = n()) %>%

arrange(pclass)

View(pivotfare0)


# Median Values

pivotfare1 <- titanic %>%

group_by(pclass, sex) %>%

summarize(MedianFare = median(fare, na.rm=TRUE),

PassengerCount = n()) %>%

arrange(pclass)

View(pivotfare1)


#Question 10- Explore how does class and sex and fares (mean and median) influence

# survival


#Mean values and disaggegated for survived/drowned

pivotfare2 <- titanic %>%

group_by(pclass, survived, sex) %>%

summarize(MeanFare = mean(fare, na.rm=TRUE),

PassengerCount = n()) %>%

arrange(pclass, survived)

View(pivotfare2)


#Median values and disaggegated for survived/drowned

pivotfare3 <- titanic %>%

group_by(pclass, survived, sex) %>%

summarize(MedianFare = median(fare, na.rm=TRUE),

PassengerCount = n()) %>%

arrange(pclass, survived)

View(pivotfare3)



# Question 11 does point of Departure (Embarcation) influence fare.

# Make use of Pivot Tables


#Proportion of passengers embarking from each port

titanic %>%

count(embarked) %>%

mutate(prop = n/sum(n)) %>%

arrange(desc(prop))


#The passenger numbers by embarcation, passenger class, their count and mean fares

pivotfare4 <- titanic %>%

group_by(pclass, embarked) %>%

summarize(MeanFare = mean(fare, na.rm=TRUE),

PassengerCount = n()) %>%

arrange(pclass)

View(pivotfare4)


#The passenger numbers by embarcation, passenger class, their count and median fares

pivotfare5 <- titanic %>%

group_by(pclass, embarked) %>%

summarize(MedianFare = median(fare, na.rm=TRUE),

PassengerCount = n()) %>%

arrange(pclass)

View(pivotfare5)


# Question 12 - Examine the link between fare, point of departure and survival


#Pivot tables start to become less intelligible when dimensionality increases

pivotfare6 <- titanic %>%

group_by(pclass, survived, embarked) %>%

summarize(MeanFare = mean(fare, na.rm=TRUE),

PassengerCount = n()) %>%

arrange(pclass, survived)

View(pivotfare6)


pivotfare7 <- titanic %>%

group_by(pclass, survived, embarked) %>%

summarize(MedianFare = median(fare, na.rm=TRUE),

PassengerCount = n()) %>%

arrange(pclass, survived)

View(pivotfare7)


#It is useful to save some work and write it out to a spreadsheet

write.csv(pivotfare7,"pivotfare7.csv")


# A graph here might be useful. A picture is like a 1,000 words


# We follow data camp here and add nuances or layers of complexity

# using the approach suggested by data camp. Please see link

# https://www.datacamp.com/community/tutorials/tidyverse-tutorial-r

# The difference is somewhat striking


#It is quite clear that Men paid more than women and were less

# likely to survive

# Scatter plot of Age vs Fare

ggplot(titanic, aes(x = age, y = fare)) +

geom_point()


# Scatter plot of Age vs Fare colored by Sex

ggplot(titanic, aes(x = age, y = fare, color = sex)) +

geom_point()


# Scatter plot of Age vs Fare colored by Sex faceted by Survived

ggplot(titanic, aes(x = age, y = fare, color = sex)) +

geom_point() +

facet_grid(~survived)



# Question 13 - What was the role of Family Size

# before familysie column is introduced

glimpse(titanic)


# We start by using base R and create a new feature or column called

# familysize

# Add a new feature (i.e., column) to the data frame for FamilySize

# A new column was added without using tidyversemutate

titanic$FamilySize <- 1 + titanic$sibsp + titanic$parch

glimpse(titanic)


# We can now try to find the link between family size and survival


# It is clear very families were few however the evidence points

# to low survival rates

ggplot(titanic, aes(x = FamilySize, fill = survived)) +

theme_bw() +

facet_wrap(sex ~ pclass) +

geom_histogram(binwidth = 1)


#######################################################################

# Introduction to Machine Learning using R code suggested by Hal Varian

#######################################################################

# Decision Trees

#######################################################################

# Question 14 - Implement a number of Machine Learning models to explore

# titanic3 dataset


# We will make use of models here suggested by Hal Varian

# https://www.aeaweb.org/articles?id=10.1257/jep.28.2.3


sex <- titanic$sex

age <- titanic$age

class <- titanic$pclass

survived <- titanic$survived

sibsp <- titanic$sibsp



# fit a simple tree

install.packages("rpart")

library("rpart")

model.rpart <- rpart(survived ~ class + age)

model.prune <- prune(model.rpart,cp=.038)


# plot it

install.packages("rpart.plot")

library(rpart.plot)

# Like Figure 1 in Hal Varian's paper

rpart.plot(model.prune,type=0,extra=2)


graphics.off()


######################################################################

# ctree

######################################################################

# Some Background

# ctree identify the structure of the branches using a sequence of

# hypothesis tests. ctrees consequently require only light pruning

# The ctree presented below initially divides by gender. The second

# node then divides by passenger class, the third node divides by age, and

# then we form bins which graphically depict the proportion who survived

# following a given set of conditions being. The intuitive nature of ctree

# makes them powerful tool for understanding machine learning.

######################################################################


install.packages("party")

library(party)

model.ctree <- ctree(as.factor(survived) ~ pclass + sex + age + sibsp, data = titanic)

plot(model.ctree)



titanic.pred <- predict(model.ctree)

titanic.conf <- table(titanic$survived,titanic.pred)

titanic.conf

titanic.pred

titanic$survived


titanic.error <- titanic.conf[2,1]+titanic.conf[1,2]

titanic.error