RMS Titanic was a Belfast built liner operated by the White Star Line that sank off the Coast of Newfoundland in the early morning of April 15th, 1912, during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of the most dramatic peacetime maritime disasters. The Titanic was constructed by the Harland and Wolff shipyard and passengers hailing from both sides of the Atlantic drowned. The disaster was heavily mediatized around the world and subsequently became the subject of many documentaries and films with huge public interest that never seem to wane. The most vivid depiction is probably the 1997 blockbusting tragic romance directed, by James Cameron. The storyline incorporating both historical and romanticized themes, starring Leonardo DiCaprio and Kate Winslet, as members of different social strata who fall in love aboard the ship during the doomed odyssey. In recent years, the Titanic quarter was reconstructed close to the Harland & Wolff shipyard which formerly specialized in ship building in Belfast, Northern Ireland. Harland & Wolff was famous for having built the majority of the ocean liners for the White Star Line. Today, the shipyard operates in ship repair, conversion and offshore construction sectors and has found a niche for certain types of marine activity. The keen interest in the Titanic catastrophe is still evident today with the large throngs of tourist who flock to the Titanic Quarter every year since its inception.
To appreciate the full potential of the R tidyverse suite - it is worth exploring how ggplot2 and dplyr packages sit together. A unified treatment serves to reveal how the different packages spark. I have chosen to set out how the powerful features of tidyverse combine together - using the titanic3 dataset available from Hal Varian. The titanic3 dataset is also available from kaggle: https://www.kaggle.com/c/titanic/data and the PASWR package from CRAN repository. The kaggle portal furnishes a great deal of R and Python code for Exploratory Data Analysis and ML modelling. From a pedagogic perspective, the titanic3 dataset is relatively intuitive because most people are somewhat domain experts given they have seen the film(s) or read the book(s). In many respects, the dataset looks like a small HR or sales database with names, gender, age, addresses, class and fares. The titanic3 dataset is a common staple of most Data Science courses whether for professional training or in academia. The classic description of the titanic3 dataset adheres to the following form:
1309 Passengers (Rows), 14 Variables (Columns)
The data frame captures the survival status of passengers aboard the RMS Titanic. The titanic3 data frame does not include information relating to the crew. It includes actual and estimated ages for almost 80% of the passengers. Using the the glimpse command in Tidyverse R we obtained the following:
Observations: 1,309
Variables: 14
$ Pclass <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ Survived <int> 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1...
$ Name <chr> "Allen, Miss. Elisabeth Walton", "Allison, Master. Hudson Trevor", "Al...
$ Sex <chr> "female", "male", "female", "male", "female", "male", "female", "male"...
$ Age <dbl> 29.00, 0.92, 2.00, 30.00, 25.00, 48.00, 63.00, 39.00, 53.00, 71.00, 47...
$ SibSp <int> 0, 1, 1, 1, 1, 0, 1, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0...
$ Parch <int> 0, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0...
$ Ticket <chr> "24160", "113781", "113781", "113781", "113781", "19952", "13502", "11...
$ Fare <dbl> 211.3375, 151.5500, 151.5500, 151.5500, 151.5500, 26.5500, 77.9583, 0....
$ Cabin <chr> "B5", "C22 C26", "C22 C26", "C22 C26", "C22 C26", "E12", "D7", "A36", ...
$ Embarked <chr> "S", "S", "S", "S", "S", "S", "S", "S", "S", "C", "C", "C", "C", "S", ...
$ boat <chr> "2", "11", "", "", "", "3", "10", "", "D", "", "", "4", "9", "6", "B",...
$ body <int> NA, NA, NA, 135, NA, NA, NA, NA, NA, 22, 124, NA, NA, NA, NA, NA, NA, ...
$ home.dest <chr> "St Louis, MO", "Montreal, PQ / Chesterville, ON", "Montreal, PQ / Che...
where
pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
survival: Survival (0 = No; 1 = Yes)
name: Name
sex: Sex
age: Age
sibsp: Number of Siblings/Spouses Aboard
parch: Number of Parents/Children Aboard
ticket: Ticket Number
fare: Passenger Fare (British pound)
cabin: Cabin embarked
Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
boat: Lifeboat body
body: Identification Number
home.dest: Home/Destination
Below we set up in RStudio the tidyverse commands to perform some Exploratory Data Analytics. We follow the ggplot2 commands suggested by Dave Langer github. We employ dplyr commands and apply pivot tabling to summarize data from titanic3. I found the Vanderbilt.edu link useful for appreciating some aspects of the dataset. Please see here entire playlist.
Titanic Tidyverse 2
EDA
Titanic Tidyverse 3
EDA
Titanic Tidyverse 4
EDA
Titanic Tidyverse 5
Machine Learning
Below, we set out a Series of Google Colabs for running the R Tidyverse operations. With each Colab we set out a concurrent video clip explaining implementation. For the most part we follow David Langer's approach to graphing and visualizations. We also combine R and Python to enhance commands to demonstrate the versatility of the Colab notebook environment.
Tidyverse is a leading Data Transformation and Visualization R package. Tidyverse Data Manipulation and Visualization can be easily set up in Google Colab and shared collaboratively as a cloud projects for teams working together.
Tidyverse is a leading Data Transformation and Visualization R package. The following links I found useful when applying the tidyverse suite to the titanic3 dataset.
https://datascienceplus.com/getting-started-with-dplyr-in-r-using-titanic-dataset/
https://rstudio-pubs-static.s3.amazonaws.com/331601_26b5dcff888944b29a0081aac9e30858.html
https://www.kaggle.com/varimp/a-mostly-tidyverse-tour-of-the-titanic
Below, we set out a series of Tidyverse R commands that organize the titanic3 dataset into categories of those who survived and drowned in terms of their passenger class and sexual status. Visualizations through ggplot2 assists in explaining the relevance of class and sex. The precision of the data tables can be made more intuitive by graphing.
One of the key attractions of Google Colab is related to the interoperability of both R and Python commands on this platform. Google Colab permits use of the combined data libraries of both programming languages in the same project. In the Colaboratory below, we run both Tidyverse R and Python pandas, numpy, matplotlib and seaborn libraries together. Calculations and data can be shared in the respective environments. We estimate the correlation matrix below for titanic3 numerical variables. You might find the following Kaggle project useful for executing Python code and estimating a correlation matrix;
Below we consider the effect of age on survival. We run a logistic regression model and find age is not a statistically significant factor determining survival yet much of our understanding of the event s around the sinking suggest, age would have important criterium/determinant for getting on lifeboats. Below we demonstrate why EDA is an important framing and extending our understanding of the catastrophe and why have a good visualization tool is always a great plus when parsing through the data.
The rate of survival of people increases with an increase in family size but then dramatically drops later. Family size more than 4 maybe linked to class or other factors.
Tidyverse and ggplot2 offer stunning visualization tools that can be invoked by using a relatively simple syntax. ggplot2 is extremely practical and produces professional looking representations with minimal effort - ideal for writing business reports. More stunning graphics can of course be produced using a much more elaborate syntax. Python possesses a number of powerful libraries for visualization but some effort is required here to codify. That effort is justified when you need to dress to impress. In the Google Colab below, I head over to kaggle and find an awesome Notebook for visualizing Age and Titanic Survival prospects: