Titanic Tidyverse

Titanic Sinks Four Hours After Hitting Iceberg; 866 Rescued By Carpathia, Probably 1,250 Perish; Ismay Safe, Mrs. Astor Maybe, Noted Names Missing. Expect to Pick Up the Few Hundreds Who Took to the Lifeboats. WOMEN AND CHILDREN FIRST. Carpathia Rushing to New York with the Survivors. Rescuers too late.....

Titanic RStudio

Install Tidyverse in Colab

Sex-Class-Survival

Tidyverse and Python Script

Age - Survival

Python Seaborn Visualization

EDA and some Machine Learning

RMS Titanic and Digital Storytelling

RMS Titanic was a Belfast built liner operated by the White Star Line that sank off the Coast of Newfoundland in the early morning of April 15th, 1912, during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of the most dramatic peacetime maritime disasters. The Titanic was constructed by the Harland and Wolff shipyard and passengers hailing from both sides of the Atlantic drowned. The disaster was heavily mediatized around the world and subsequently became the subject of many documentaries and films with huge public interest that never seem to wane. The most vivid depiction is probably the 1997 blockbusting tragic romance directed, by James Cameron. The storyline incorporating both historical and romanticized themes, starring Leonardo DiCaprio and Kate Winslet, as members of different social strata who fall in love aboard the ship during the doomed odyssey. In recent years, the Titanic quarter was reconstructed close to the Harland & Wolff shipyard which formerly specialized in ship building in Belfast, Northern Ireland. Harland & Wolff was famous for having built the majority of the ocean liners for the White Star Line. Today, the shipyard operates in ship repair, conversion and offshore construction sectors and has found a niche for certain types of marine activity. The keen interest in the Titanic catastrophe is still evident today with the large throngs of tourist who flock to the Titanic Quarter every year since its inception.

Courtesy: https://www.kiln.digital/

Titanic by the Numbers.pptx

Visualisation and Data Transformation tools to grasp the titanic human tragedy

To appreciate the full potential of the R tidyverse suite - it is worth exploring how ggplot2 and dplyr packages sit together. A unified treatment serves to reveal how the different packages spark. I have chosen to set out how the powerful features of tidyverse combine together - using the titanic3 dataset available from Hal Varian. The titanic3 dataset is also available from kaggle: https://www.kaggle.com/c/titanic/data and the PASWR package from CRAN repository. The kaggle portal furnishes a great deal of R and Python code for Exploratory Data Analysis and ML modelling. From a pedagogic perspective, the titanic3 dataset is relatively intuitive because most people are somewhat domain experts given they have seen the film(s) or read the book(s). In many respects, the dataset looks like a small HR or sales database with names, gender, age, addresses, class and fares. The titanic3 dataset is a common staple of most Data Science courses whether for professional training or in academia. The classic description of the titanic3 dataset adheres to the following form:

1309 Passengers (Rows), 14 Variables (Columns)

The data frame captures the survival status of passengers aboard the RMS Titanic. The titanic3 data frame does not include information relating to the crew. It includes actual and estimated ages for almost 80% of the passengers. Using the the glimpse command in Tidyverse R we obtained the following:

Observations: 1,309

Variables: 14

$ Pclass <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...

$ Survived <int> 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1...

$ Name <chr> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hudson Trevor", "Al...

$ Sex <chr> "female", "male", "female", "male", "female", "male", "female", "male"...

$ Age <dbl> 29.00, 0.92, 2.00, 30.00, 25.00, 48.00, 63.00, 39.00, 53.00, 71.00, 47...

$ SibSp <int> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0...

$ Parch <int> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0...

$ Ticket <chr> "24160", "113781", "113781", "113781", "113781", "19952", "13502", "11...

$ Fare <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26.5500, 77.9583, 0....

$ Cabin <chr> "B5", "C22 C26", "C22 C26", "C22 C26", "C22 C26", "E12", "D7", "A36", ...

$ Embarked <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "C", "C", "C", "C", "S", ...

$ boat <chr> "2", "11", "", "", "", "3", "10", "", "D", "", "", "4", "9", "6", "B",...

$ body <int> NA, NA, NA, 135, NA, NA, NA, NA, NA, 22, 124, NA, NA, NA, NA, NA, NA, ...

$ home.dest <chr> "St Louis, MO", "Montreal, PQ / Chesterville, ON", "Montreal, PQ / Che...

where

pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)

survival: Survival (0 = No; 1 = Yes)

name: Name

sex: Sex

age: Age

sibsp: Number of Siblings/Spouses Aboard

parch: Number of Parents/Children Aboard

ticket: Ticket Number

fare: Passenger Fare (British pound)

cabin: Cabin embarked

Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

boat: Lifeboat body

body: Identification Number

home.dest: Home/Destination

Exploratory Data Analysis using RStudio and ggplot2

Below we set up in RStudio the tidyverse commands to perform some Exploratory Data Analytics. We follow the ggplot2 commands suggested by Dave Langer github. We employ dplyr commands and apply pivot tabling to summarize data from titanic3. I found the Vanderbilt.edu link useful for appreciating some aspects of the dataset. Please see here entire playlist.

Titanic Tidyverse - a whirlwind data odyssey

R Code for Titanic Dataset Analysis

Titanic Tidyverse 2

EDA

Titanic Tidyverse 3

EDA

Effective Digital storytelling with Tidyverse and Machine Learning

Titanic Tidyverse 4

EDA

Titanic Tidyverse 5

Machine Learning

R Tidyverse in Google Colab

Below, we set out a Series of Google Colabs for running the R Tidyverse operations. With each Colab we set out a concurrent video clip explaining implementation. For the most part we follow David Langer's approach to graphing and visualizations. We also combine R and Python to enhance commands to demonstrate the versatility of the Colab notebook environment.

Install Tidyverse R suite of packages in Google Colab

Tidyverse is a leading Data Transformation and Visualization R package. Tidyverse Data Manipulation and Visualization can be easily set up in Google Colab and shared collaboratively as a cloud projects for teams working together.

Google Colaboratory

Some interesting worked examples of Tidyverse applications to the titanic dataset

Tidyverse is a leading Data Transformation and Visualization R package. The following links I found useful when applying the tidyverse suite to the titanic3 dataset.

https://datascienceplus.com/getting-started-with-dplyr-in-r-using-titanic-dataset/

https://github.com/datasciencedojo/IntroDataVisualizationWithRAndGgplot2/blob/master/IntroDataVizRAndGgplot2.R

https://rstudio-pubs-static.s3.amazonaws.com/331601_26b5dcff888944b29a0081aac9e30858.html

https://www.kaggle.com/varimp/a-mostly-tidyverse-tour-of-the-titanic

Considering the effect of Class and Sex on Passenger Survival Prospects

Below, we set out a series of Tidyverse R commands that organize the titanic3 dataset into categories of those who survived and drowned in terms of their passenger class and sexual status. Visualizations through ggplot2 assists in explaining the relevance of class and sex. The precision of the data tables can be made more intuitive by graphing.

Google Colaboratory

Combining R and Python together in Google Colab: the correlation matrix for titanic3

One of the key attractions of Google Colab is related to the interoperability of both R and Python commands on this platform. Google Colab permits use of the combined data libraries of both programming languages in the same project. In the Colaboratory below, we run both Tidyverse R and Python pandas, numpy, matplotlib and seaborn libraries together. Calculations and data can be shared in the respective environments. We estimate the correlation matrix below for titanic3 numerical variables. You might find the following Kaggle project useful for executing Python code and estimating a correlation matrix;

Google Colaboratory

What was the effect of Age on Titanic Survival - The Importance of EDA

Below we consider the effect of age on survival. We run a logistic regression model and find age is not a statistically significant factor determining survival yet much of our understanding of the event s around the sinking suggest, age would have important criterium/determinant for getting on lifeboats. Below we demonstrate why EDA is an important framing and extending our understanding of the catastrophe and why have a good visualization tool is always a great plus when parsing through the data.

Google Colaboratory

What was the effect of Family Size on Titanic Survival - The Importance of EDA

The rate of survival of people increases with an increase in family size but then dramatically drops later. Family size more than 4 maybe linked to class or other factors.

Google Colaboratory

R Tidyverse and leveraging the Python visualization tools of Matplotlib and Seaborn

Tidyverse and ggplot2 offer stunning visualization tools that can be invoked by using a relatively simple syntax. ggplot2 is extremely practical and produces professional looking representations with minimal effort - ideal for writing business reports. More stunning graphics can of course be produced using a much more elaborate syntax. Python possesses a number of powerful libraries for visualization but some effort is required here to codify. That effort is justified when you need to dress to impress. In the Google Colab below, I head over to kaggle and find an awesome Notebook for visualizing Age and Titanic Survival prospects:

Google Colaboratory