Understanding random forest for machine learning

Introduction

Laymen explanation

Random forest is a flexible, easy to use machine learning algorithm that produces, even without hyper-parameter tuning, a great result most of the time. It is also one of the most used algorithms, because of its simplicity and diversity (it can be used for both classification and regression tasks).

Technical explanation

Random forest is a supervised learning algorithm. The "forest" it builds, is an ensemble of decision trees, usually trained with the “bagging” method. The general idea of the bagging method is that a combination of learning models increases the overall result.

Analogy to understand the algorithm

Andrew wants to decide where to go during one-year vacation, so he asks the people who know him best for suggestions. The first friend he seeks out asks him about the likes and dislikes of his past travels. Based on the answers, he will give Andrew some advice.

This is a typical decision tree algorithm approach. Andrew's friend created rules to guide his decision about what he should recommend, by using Andrew's answers.

Afterwards, Andrew starts asking more and more of his friends to advise him and they again ask him different questions they can use to derive some recommendations from. Finally, Andrew chooses the places that where recommend the most to him, which is the typical random forest algorithm approach.

Advantages

One big advantage of random forest is that it can be used for both classification and regression problems, which form the majority of current machine learning systems.
Another great quality of the random forest algorithm is that it is very easy to measure the relative importance of each feature on the prediction.
Overfitting is normally a big concern in data science.Random forests does not overfit(considering good hyperparameters like large number of trees, less depth of tree ). [Verify]
- Validation set can be skipped. You can use Out-of-bag error as you validation error, if you are short of data.

Use-cases

The random forest algorithm is used in a lot of different fields, like banking, the stock market, medicine and e-commerce.

In finance, for example, it is used to detect customers more likely to repay their debt on time, or use a bank's services more frequently. In this domain it is also used to detect fraudsters out to scam the bank. In trading, the algorithm can be used to determine a stock's future behavior.
In the healthcare domain it is used to identify the correct combination of components in medicine and to analyze a patient’s medical history to identify diseases.
Random forest is used in e-commerce to determine whether a customer will actually like the product or not.

DECISION TREES vs RANDOM FORESTS

In comparison, the random forest algorithm randomly selects observations and features to build several decision trees and then averages the results.

Another difference is "deep" decision trees might suffer from overfitting. Most of the time, random forest prevents this by creating random subsets of the features and building smaller trees using those subsets.

Random forest hyperparameters

RF involves several hyperparameters controlling the structure of each individual tree

The minimal size nodesize a node should have to be split
The structure and size of the forest (e.g., the number of trees) as well as its level of randomness (e.g., the number mtry of variables considered as candidate splitting variables at each split or the sampling scheme used to generate the datasets on which the trees are built).

Performance tips

Hyperparameter tuning is important part for getting good accuracy. Hyperparameters include the number of decision trees in the forest and the number of features considered by each tree when splitting a node.

The best way to think about hyperparameters is like the settings of an algorithm that can be adjusted to optimize performance, just as we might turn the knobs of an AM radio to get a clear signal (or your parents might have!).

Hyperparameter tuning relies more on experimental results than theory, and thus the best method to determine the optimal settings is to try many different combinations evaluate the performance of each model. This paper discusses about tuning.

Tips

It is important to note that RF may be used in practice for two different purposes.

In some RF applications, the focus is on the construction of a classification or regression rule with good accuracy that is intended to be used as a prediction tool on future data.
In other RF applications, the goal is not to derive a classification or regression rule but to investigate the relevance of the candidate predictor variables for the prediction problem at hand or, in other words, to assess their respective contribution to the prediction of the response variable.

Above two objectives have to be kept in mind when investigating the effect of parameters. Note, however, that there might be overlap of these two objectives

Reference

https://builtin.com/data-science/random-forest-algorithm

https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

https://images.app.goo.gl/pgZvds2rRk1sWngd7

https://syncedreview.com/2017/10/24/how-random-forest-algorithm-works-in-machine-learning/

https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

https://arxiv.org/pdf/1804.03515.pdf

https://datascience.stackexchange.com/questions/61418/are-validation-sets-necessary-for-random-forest-classifier