Random Forest

What is Random Forest ?

The random forest is a model made up of many decision trees. Rather than just simply averaging the prediction of trees (which we could call a “forest”), this model uses two key concepts that gives it the name random:

Random sampling of training data points when building trees
Random subsets of features considered when splitting nodes

Random sampling of training observations

When training, each tree in a random forest learns from a random sample of the data points. The samples are drawn with replacement, known as bootstrapping, which means that some samples will be used multiple times in a single tree. The idea is that by training each tree on different samples, although each tree might have high variance with respect to a particular set of the training data, overall, the entire forest will have lower variance but not at the cost of increasing the bias.

At test time, predictions are made by averaging the predictions of each decision tree. This procedure of training each individual learner on different bootstrapped subsets of the data and then averaging the predictions is known as bagging, short for bootstrap aggregating.

Random Subsets of features for splitting nodes

The other main concept in the random forest is that only a subset of all the features are considered for splitting each node in each decision tree. Generally this is set to sqrt(n_features) for classification meaning that if there are 16 features, at each node in each tree, only 4 random features will be considered for splitting the node. (The random forest can also be trained considering all the features at every node as is common in regression. These options can be controlled in the Scikit-Learn Random Forest implementation).

Why Random Forest is the best?

Random forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because of its simplicity and diversity (it can be used for both classification and regression tasks).

What is a tree in Random Forest?

Random forests are an example of an ensemble learner built on decision trees. ... In machine learning implementations of decision trees, the questions generally take the form of axis-aligned splits in the data: that is, each node in the tree splits the data into two groups using a cutoff value within one of the features.

Why do we use Random Forest ?

In random forest we use multiple random decision trees for a better accuracy. Random Forest is a ensemble bagging algorithm to achieve low prediction error. It reduces the variance of the individual decision trees by randomly selecting trees and then either average them or picking the class that gets the most vote.

How can we predict using random forest?

It works in four steps:

Select random samples from a given dataset.
Construct a decision tree for each sample and get a prediction result from each decision tree.
Perform a vote for each predicted result.
Select the prediction result with the most votes as the final prediction.

Example with R Code

Source: Cardiography.csv

R Code:

# Read Datadata <- read.csv("~/Desktop/CTG.csv", header = TRUE)str(data)data$NSP <- as.factor(data$NSP)table(data$NSP)
# Data Partitionset.seed(123)ind <- sample(2, nrow(data), replace = TRUE, prob = c(0.7, 0.3))train <- data[ind==1,]test <- data[ind==2,]
# Random Forestlibrary(randomForest)set.seed(222)rf <- randomForest(NSP~., data=train, ntree = 300, mtry = 8, importance = TRUE, proximity = TRUE)print(rf)attributes(rf)
# Prediction & Confusion Matrix - train datalibrary(caret)p1 <- predict(rf, train)confusionMatrix(p1, train$NSP)
# # Prediction & Confusion Matrix - test datap2 <- predict(rf, test)confusionMatrix(p2, test$NSP)
# Error rate of Random Forestplot(rf)
# Tune mtryt <- tuneRF(train[,-22], train[,22], stepFactor = 0.5, plot = TRUE, ntreeTry = 300, trace = TRUE, improve = 0.05)
# No. of nodes for the treeshist(treesize(rf), main = "No. of Nodes for the Trees", col = "green")
# Variable ImportancevarImpPlot(rf, sort = T, n.var = 10, main = "Top 10 - Variable Importance")importance(rf)varUsed(rf)
# Partial Dependence PlotpartialPlot(rf, train, ASTV, "2")
# Extract Single TreegetTree(rf, 1, labelVar = TRUE)
# Multi-dimensional Scaling Plot of Proximity Matrix

MDSplot(rf, train$NSP)

Page updated

Google Sites

Report abuse