In this section we are going to work through a small machine learning project end-to-end.
Here is an overview, what we are going to cover:
Installing the R platform.
Loading the dataset.
Summarizing the dataset.
Visualizing the dataset.
Evaluating some algorithms.
Making some predictions.
Download R - You can download R from The R Project webpage.
Install R - R is is easy to install and I’m sure you can handle it. There are no special requirements. If you have questions or need help installing see R Installation and Administration.
Start R. - Open your command line, change (or create) to your project directory and start R by typing: R
or do it on the Cloud: Open an account with R Studio.
Install R Packages - Packages are third party add-ons or libraries that we can use in R.
The caret package provides a consistent interface into hundreds of machine learning algorithms and provides useful convenience methods for data visualization, data resampling, model tuning and model comparison, among other features. It’s a must have tool for machine learning projects in R.
install.packages("caret")
library(caret)
install.packages("tidyverse")
library(tidyverse)
#adding elllipse for Multivariate Plots, scatter plot matrix
install.packages("ellipse")
library(ellipse)
The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.
You can learn more about this dataset on Wikipedia.
Load Data The Easy Way
Fortunately, the R platform provides the iris dataset for us. Load the dataset as follows:
# attach the iris dataset to the environment
data(iris)
#rename the dataset
dataset < iris
You now have the iris data loaded in R and accessible via the dataset variable.
Create a Validation Dataset
#create a list of 80% of the rows in the original dataset we can use for training
validation_index <- createDataPartition(dataset$Species, p=.80, list=FALSE)
#select 20% of the data for validation
validation <- dataset[-validation_index]
#use the remaining 80% of data to training and testing models
dataset <- dataset[validation_index]
You now have training data in the dataset variable and a validation set we will use later in the validation variable.
Now it is time to take a look at the data.
In this step we are going to take a look at the data a few different ways:
Dimensions of the dataset.
Types of the attributes.
Peek at the data itself.
Levels of the class attribute.
Breakdown of the instances in each class.
Statistical summary of all attributes.
Dimensions of Dataset
#dimensions of the dataset: 120 instances and 5 attibutes
dim(dataset)
Types of Attributes
#list type of each attributes
sapply(dataset, class)
Peak at the Data
# take a peek at the first 6 rows of data
head(data)
Levels of The Class
#The class variable is a factor. A factor is a class that has multiple class labels or levels. Let’s look at the levels:
#List the level for the class
level(dataset$Species)
Class Distribution
#Summarize the class distribution
percetange <- prop.table(table(dataset$species) ) * 100
cbind(freq=table(dataset$Species), percentage=percentage)
We can see that each class has the same number of instances (40 or 33% of the dataset):
Statistical Summary
Now finally, we can take a look at a summary of each attribute.
This includes the mean, the min and max values as well as some percentiles (25th, 50th or media and 75th e.g. values at this points if we ordered all the values for an attribute)
#Summarize attribute distributions
summary(dataset)
We can see that all of the numerical values have the same scale (centimeters) and similar ranges [0,8] centimeters:
We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
Univariate plots to better understand each attribute.
Multivariate plots to better understand the relationships between attributes.
Univariate Plots
We start with some univariate plots, that is, plots of each individual variable.
It is helpful with visualization to have a way to refer to just the input attributes and just the output attributes. Let’s set that up and call the inputs attributes x and the output attribute (or class) y.
#split input and output
x <- dataset[,1:4]
y <- dataset[,5]
Given that the input variables are numeric, we can create box and whisker plots of each.
#boxplot for each attribute on one image:
par(mfrow=c(1,4))
for(i in 1:4){
boxplot(x[,i], main=names(iris)[i])
}
This gives us a much clearer idea of the distribution of the input attributes:
We can also create a barplot of the Species class variable to get a graphical representation of the class distribution (generally uninteresting in this case because they’re even).
#barplot for class breakdown
plot(y)
This confirms what we learned in the last section, that the instances are evenly distributed across the three class:
Multivariate Plots
Now we can look at the interactions between the variables.
First let’s look at scatterplots of all pairs of attributes and color the points by class. In addition, because the scatterplots show that points for each class are generally separate, we can draw ellipses around them.
#scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse")
We can also look at box and whisker plots of each input variable again, but this time broken down into separate plots for each class. This can help to tease out obvious linear separations between the classes.
#box and whisker plots for each attribute
featurePlot(x=x, y=y, plot="box")
This is useful to see that there are clearly different distributions of the attributes for each class value:
Next we can get an idea of the distribution of each attribute, again like the box and whisker plots, broken down by class value. Sometimes histograms are good for this, but in this case we will use some probability density plots to give nice smooth lines for each distribution.
#Density plots for each attributes by class value
scales <- list(x=list(relation="free"),y=list(relation="free"))
featurePlot(x=x, y=y, plot="density", scales=scales)
Like the boxplots, we can see the difference in distribution of each attribute by class value. We can also see the Gaussian-like distribution (bell curve) of each attribute.
Now it is time to create some models of the data and estimate their accuracy on unseen data.
Here is what we are going to cover in this step:
Set-up the test harness to use 10-fold cross validation.
Build 5 different models to predict species from flower measurements
Select the best model.
Test Harness
We will 10-fold crossvalidation to estimate accuracy.
This will split our dataset into 10 parts, train in 9 and test on 1 and release for all combinations of train-test splits. We will also repeat the process 3 times for each algorithm with different splits of the data into 10 groups, in an effort to get a more accurate estimate.
#Run Algorithms using 10-fold cross validation
control <- trainControl(method="cv", number=10)
metric <- "Accuracy"
We are using the metric of “Accuracy” to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the metric variable when we run build and evaluate each model next.
Build Models
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
Let’s evaluate 5 different algorithms:
Linear Discriminant Analysis (LDA)
Classification and Regression Trees (CART).
k-Nearest Neighbors (kNN).
Support Vector Machines (SVM) with a linear kernel.
Random Forest (RF)
This is a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). We reset the random number seed before reach run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
Let’s build our five models:
# a) linear algorithms
#LDA
set.seed(7)
fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
# b) nonlinear algorithms
# CART
set.seed(7)
fit.cart <- train(Species~.,data=dataset, method="rpart", metric=metric, trControl=control)
set.seed(7)
# kNN
set.seed(7)
fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
# c) advanced algorithms
# SVM
set.seed(7)
fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
#Random Forest
set.seed(7)
fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)
Select Best Model
We now have 5 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
We can report on the accuracy of each model by first creating a list of the created models and using the summary function.
#summarize accuracy of models
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
We can see the accuracy of each classifier and also other metrics like Kappa:
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
#Compare accuracy of Models
dotplot(results)
We can see that the most accurate model in this case was LDA:
The results for just the LDA model can be summarized.
#Summarize the best model
print(fit.lda)
This gives a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 97.5% accuracy +/- 4%
The LDA was the most accurate model. Now we want to get an idea of the accuracy of the model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.
We can run the LDA model directly on the validation set and summarize the results in a confusion matrix.
#Estimate skill of LDA on the validation dataset
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)
We can see that the accuracy is 100%. It was a small validation dataset (20%), but this result is within our expected margin of 97% +/-4% suggesting we may have an accurate and a reliably accurate model.