Knowing why Ensemble learning works well in data science

Introduction

Laymen explanation

Ensembles are predictive models that combine predictions from two or more other classifiers models. Even if each classifier is a weak learner (mean‐ ing it does only slightly better than random guessing), the ensemble can still be a strong learner (achieving high accuracy)

If you love to know mathematics for why it performs good even with weak classifier, then this document is for you. It also talks about various uses in data science

Technical explanation

Many decisions we make in life are based on the opinions of multiple other people.

This includes choosing a book to read based on reviews, choosing a course of action based on the advice of multiple medical doctors.

Ensembles are predictive models that combine predictions from two or more other models.

Ensemble learning methods are popular and the go-to technique when the best performance on a predictive modeling project is the most important outcome.

The models that contribute to the ensemble, referred to as ensemble members, may be the same type or different types and may or may not be trained on the same training data.

Why should we consider using an ensemble?

Performance: An ensemble can make better predictions and achieve better performance than any single contributing model.
Robustness: An ensemble reduces the spread or dispersion of the predictions and model performance.

Mathematical Reason behind better result

Reason why ensemble learning performs better for weak classifier as well, is below.

Statistical justification

Let's consider max voting approach(ensemble results 1 if majority votes 1).

We can answer this by mapping this problem with multiple simultaneous coin flipping. Consider binary classifier problem

Let's say each classifier makes only 51% of times correct result. If we use 1000 classifiers of such weak performance, then for correct prediction we need 500 or more classifier to result correctly
Let's check what is probability of this.
- Using binomial probability distribution formula, we can arrive that it is 74%.
- To verify, use below cumulative distribution function formula of binomial distribution. Here k=500, n=1000, p=.51

Note that you need to subtract above calculated value to 1 since 1 is overall probability.

Alternately, you can use this link to automatically calculate this value for you.

Conditions which must be satisfied

All classifiers must be trained on different dataset. This is only true if all classifiers are perfectly independent, making uncorrelated errors, which is clearly not the case since they are trained on the same data.
Sufficient number of weak classifiers

Ensemble criteria

The predictions made by the ensemble members may be combined using statistics, such as the mode or mean, max voting or by more sophisticated methods that learn how much to trust each member and under what conditions.

These Models are trained on different subsets of the training data naturally through the use of resampling methods such as cross-validation and the bootstrap, designed to estimate the average performance of the model generally on unseen data. Refer here for the detail of cross-validation approach.

Taking Care of dataset allocation for models

If you create all the models on the same set of data and combine it, will it be useful? There is a high chance that these models will give the same result since they are getting the same input.

So how can we solve this problem? One of the techniques is bootstrapping. Bootstrapping is a sampling technique in which we create subsets of observations from the original dataset, with replacement. The size of the subsets is the same as the size of the original set.

Challenges

Sometimes, the best performing model, e.g. the best expert, is sufficiently superior compared to other models that combining its predictions with other models can result in worse performance. As such, selecting models, even ensemble models, still requires carefully controlled experiments on a robust test harness.

Similarly, If a data point is incorrectly predicted by the first model, and then the next (probably all models), will combining the predictions provide better results?

How to solve above challenge?

Boosting is an ensemble technique that attempts to create a strong classifier from a number of weak classifiers. Boosting is a sequential process, where each subsequent model attempts to correct the errors of the previous model. Boosting reduces variance, and also reduces bias.

This is done by building a model from the training data, then creating a second model that attempts to correct the errors from the first model. Models are added until the training set is predicted perfectly or a maximum number of models are added.

AdaBoost , XGBoost are examples of these models

Role in machine learning

Following widely used ML models are based on ensemble learning

Random forest ML model. Refer here for the detail
AdaBoost
- XGBoost
Gradient boosting

Relevance with neural networks

Most neural network algorithms achieve sub-optimal performance specifically due to the existence of an overwhelming number of sub-optimal local minima. If we take a set of neural networks which have converged to local minima and apply averaging we can construct an improved estimate.

This article talks about it for handwritten digit data set.
For NLP, this paper talks about ensemble approach for BERT
This article talks about sentiment analysis use-case via ensembling multiple NLP algorithms

Point to remember

Although ensemble predictions will have a lower variance, they are not guaranteed to have better performance than any single contributing member.

Reference

https://youtu.be/-ckKBa4EB64?t=1221

https://machinelearningmastery.com/why-use-ensemble-learning/

https://machinelearningmastery.com/what-is-ensemble-learning/

https://analyticsindiamag.com/hands-on-guide-to-create-ensemble-of-convolutional-neural-networks/

https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/

https://machinelearningmastery.com/boosting-and-adaboost-for-machine-learning/

https://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-learning/

https://www.youtube.com/watch?v=FakVn1RgDms

https://images.app.goo.gl/XHMnNCHdTah3Tid78

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/understanding-random-forest-for-machine-learning

https://analyticsindiamag.com/random-forest-vs-xgboost-comparing-tree-based-algorithms-with-codes/

https://datascience.stackexchange.com/questions/23789/why-do-we-need-xgboost-and-random-forest

https://sites.google.com/site/jbsakabffoi12449ujkn/home/machine-intelligence/role-of-cross-validation-data-in-machine-learning

https://machinelearningmastery.com/how-to-create-a-random-split-cross-validation-and-bagging-ensemble-for-deep-learning-in-keras/

https://builtin.com/data-science/random-forest-algorithm

https://www.linkedin.com/posts/dpkumar_machinelearning-datascience-algorithms-activity-6754777287792697344--YNF

https://en.wikipedia.org/wiki/Binomial_distribution

https://www.amazon.in/Hands-Machine-Learning-Scikit-Learn-TensorFlow/dp/1491962291

https://images.app.goo.gl/mt6eqprsvGR3pCPp6

https://machinelearningmastery.com/ensemble-methods-for-deep-learning-neural-networks/