Although a small dataset, the titanic3 provides us with an initial glance of how to approach problems where we have to predict binary outcomes. Using the Tidyverse R suite, we previously trawled through the titanic dataset. We established that by being a woman, by being in 1st class, and by being a child the odds of survival could be dramatically improved. What if we could relegate the trawling to a series of independently learned/discerned variables and map these out autonomously. Machine Learning can be used to build algorithms that independently combine these variable to appropriately predict outcomes. In the titanic dataset, the target variable is most likely survival - so we are confronted with a classification problem. This, of course, is a common problem and we grapple frequently with predicting binary choice outcomes. Solving these problems without direct hands-on human guidance in real-time is the magic sauce that helps configure the most optimum menu choice. This capacity to autonomously learn and identify patterns has fueled today's digital revolution. The capacity to reveal and apply algorithms that accurately capture consumer leanings, viewing preferences etc has transformed the internet from being a gigantic repository of film and readings to being a daily indispensable tool with human friendly interfaces. To understand the capabilities embedded in Machine learning - it is worth taking a second look at the titanic3 dataset. It would appear possible to unearth and map out the key drivers for survival . See video link below for explanation and R code in Google Colab.
The visualizations from rpart and ctrees will broadly tally with the insights obtained from Tidyverse. To get a feel for how machine learning uncovers key nuances, we follow the approach set out by Hal Varian who provides a Bird's Eye view. Conditional Inference and Decision trees take a shape that illustrates graphically the impacts and nonlinear interactions of parameters when modelling a target variable. Decision trees break the data down into smaller and smaller subsets and reveal key contours that shape outcomes. Decision trees are also referred to as recursive partitioning. These approaches clearly resonate with detail garnered from Exploratory Data Analysis revealed using the Tidyverse suite. Machine Learned drivers are revealed instantaneously and this evidently is important where decision makers want to be present in real time the pertinent factors or most appropriate menu choice. A domain expert typically would want to know what key drivers predict a target variable like sales.
The titanic dataset is valuable as a pedagogic tool in that it has the feel and touch of a small business dataset and data analytic applied to titanic3 can be easily applied elsewhere to yield business intelligence. The binary outcome of survival/perish is similar to the binary outcome of a mouse click crystalizing into a sale/no sale, for instance. The small scale of the titanic3 dataset opens up the vista, to grasp how Machine Learning techniques can be used to garner insights around consumer decision making and salient influences or web content/keywords that exert some relevance on a target demographic. CTree is a non-parametric class of regression trees embedding tree-structured regression models. It is applicable to all kinds of regression problems where the target variable is nominal/ordinal/numeric . Below, we exploit the flexible and extensible computational tools in party for fitting and visualizing conditional inference trees (see video clip for explanation). To introduce some libraries in R that set out visualizations relating to decision trees and conditional inference trees – please see Hal Varian's R code below in the Google Colab (Hal is chief economist of Google). The R code comes largely from Hal's paper in the Journal of Economic Perspectives. https://www.aeaweb.org/articles?id=10.1257/jep.28.2.3 The CTree visualization below is useful for grasping how a machine learning algorithm gets to work:
Below we extend the Varian (2014) analysis above and apply sklearn python libraries to generate predictions of survival/drowning. Importantly we split the data for training and testing and we evaluate confusion matrices in a manner consistent with medium.
If eager to test the limits of these technologies for more advanced machine learning applied to titanic3, you might consider the following tutorial in Python and also this Kaggle submission by Subin An. You will need to install a few libraries if you do not have them already:
A nice R notebook, in Kaggle, that expands the machine learning toolkit set out above is available at this link. Also for a very complete overview, in base R, please check out Dave Langer (gihub replete with code and video explanation). Please also check out this very intuitive Google Colab.
An emerging powerhouse in programing neural networks is an open source library from Google called TensorFlow. This library is the foundation for many recent key advances in machine learning. Examples include training a computer programs to create unique works of music and visual art. To see the TensorFlow set up in Google Colab - please follow the link.
https://www.kaggle.com/mjbhobe/titanic-chance-of-survival-acc-0-97-auc-0-99