1.9. Ensembles

Applied machine learning often involves fitting and evaluating models on a dataset. To measure how well our models perform, we use the ROC-AUC score, which trades off having high precision and high recall [1].

Given that we cannot know which model will perform best on the dataset beforehand, this may involve a lot of trial and error until we find a model that performs well or best for our project.

An alternate approach is to prepare multiple different models, and then combine their predictions. This is called an ensemble machine learning model, or simply an ensemble, and the process of finding a well-performing ensemble model is referred to as “ensemble learning" [2].

It has been recognized since the early days of ML research that ensembles of classifiers can be more accurate than individual models. In ML, ensembles are effectively committees that aggregate the predictions of individual classifiers. They are effective for very much the same reasons a committee of experts works in human decision-making, they can bring different expertise to bear and the averaging effect can reduce errors [3].

There are two main reasons to use an ensemble over a single model, and they are related; they are [2]:

Reliability: Ensembles can reduce the variance of the predictions.
Skill: Ensembles can achieve better performance than a single model, avoiding model bias error.

The next figure helps to illustrate how the previous two features could be combined in four cases [4].

Types of Ensembles

There are different types of ensemble methods, and each one brings a set of advantages and disadvantages. Before diving into each method, let's understand two components employed in all models [4]:

Base learners: are the first level of an ensemble learning architecture, and each one of them is trained to make individual predictions.
Meta learners: on the other hand, are in the second level, and they are trained on the output of the base learners.

The next figure helps to illustrate how these elements could be combined.

The ensembles methods could be [2]:

Bagging Ensemble: works by creating samples of the training dataset and fitting a decision tree on each sample.
Random forest Ensemble: like bagging, the random forest ensemble fits a decision tree on different bootstrap samples of the training dataset. Unlike bagging, the random forest will also sample the features (columns) of each dataset.
AdaBoost Ensemble: Boosting involves adding models sequentially to the ensemble where new models attempt to correct the errors made by prior models already added to the ensemble. As such, the more ensemble members that are added, the fewer errors the ensemble is expected to make, at least to a limit supported by the data and before overfitting the training dataset.
Gradient boosting Ensemble: is a framework for boosting ensemble algorithms and an extension to AdaBoost. It re-frames boosting as an additive model under a statistical framework and allows for the use of arbitrary loss functions to make it more flexible and loss penalties (shrinkage) to reduce overfitting.
Voting Ensemble: use simple statistics to combine the predictions from multiple models. Typically, this involves fitting multiple different model types on the same training dataset, and then calculating the average prediction in the case of regression or the class label with the most votes for classification called hard voting.
Stacking Ensemble: involves combining the predictions of multiple different types of base models, much like voting. The important difference from voting is that another machine learning model is used to learn how to best combine the predictions of the base models. This is often a linear model, such as linear regression for regression problems or logistic regression for classification, but can be any machine learning model you like.

A numerical example using Voting Ensemble

The next code creates the same dataset from previous section, but with a different parameter value, i.e., cluster_std equals to 5 instead of 3.

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

def get_train_test(test_size=0.33, SEED = 1):

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=SEED, cluster_std=5)

# demonstrate that the train-test split procedure is repeatable

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=SEED)

return X_train, X_test, y_train, y_test

SEED = 1

X_train, X_test, y_train, y_test = get_train_test(SEED)

# A look at the data

print('X_test = ', X_test)

print('y_test = ', y_test)

X_test = [[ 3.99858703 12.00557395] [ -5.11786366 2.42272223] [ -2.61873767 -0.03165495] [-15.99405267 0.36337804] [ -5.01287134 6.2943088 ] [ -5.09542341 0.18046166] [ 4.33502949 5.33227196] [ -0.85937456 8.78733447] [ 0.45791186 4.79319021] [ -3.34042694 -5.38988786] [ -2.70403107 7.33960582] [-10.67993622 -4.54861949] [ -4.75956413 7.89665004] [ -9.60860686 -0.86144724] [ -6.70246352 -12.09054025] [-12.58318479 -8.93848269] [ -7.3721509 2.65977626] [ -0.66806131 5.0015331 ] [ -4.30041867 -0.95835324] [-10.19119005 -12.03221032] [ -8.88012893 1.88416055] [ 2.53535716 9.06200028] [ -0.23162328 8.83219569] [ -7.48178844 2.51278087] [-15.47227173 -3.10643638] [-10.93056136 -4.46207791] [ -4.39062396 -1.90884586] [ -0.15870831 2.64524064] [ 2.8483937 6.91896156] [-10.90231401 -6.97295169] [-16.36050628 -2.38560995] [ 7.06449892 0.60045536] [ -1.05876514 7.49250542]]

y_test = [0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0]

The next code helps to train, predict values, and evaluate the score of two models: Linear regression, and Gaussian Mixture.

import numpy as np

import pandas as pd

# ROC and AUC Score

from sklearn.metrics import roc_auc_score

from sklearn.linear_model import LogisticRegression

from sklearn.mixture import GaussianMixture

def get_models():

"""Generate a library of base learners."""

lr = LogisticRegression(C=100, random_state=SEED)

gm = GaussianMixture(n_components=2, random_state=SEED)

models = {'logistic': lr,

'gaussian': gm

}

return models

def train_predict(model_list):

"""Fit models in list on training set and return preds"""

P = np.zeros((y_test.shape[0], len(model_list)))

P = pd.DataFrame(P)

print("Fitting models.")

cols = list()

for i, (name, m) in enumerate(models.items()):

print("%s..." % name, end=" ", flush=False)

m.fit(X_train, y_train)

P.iloc[:, i] = m.predict_proba(X_test)[:, 1]

cols.append(name)

print("done")

P.columns = cols

print("Done.\n")

return P

def score_models(P, y):

"""Score model in prediction DF"""

print("Scoring models.")

for m in P.columns:

score = roc_auc_score(y, P.loc[:, m])

print("%-26s: %.3f" % (m, score))

print("Done.\n")

models = get_models()

P = train_predict(models)

print('P = ',P)

score_models(P, y_test)

Fitting models.

logistic... done

gaussian... done

Done.

P = logistic gaussian

0 0.000201 0.000127

1 0.562759 0.321583

2 0.451088 0.337373

3 0.999179 0.999934

4 0.215037 0.036274

5 0.750645 0.650172

6 0.002165 0.003317

7 0.010276 0.001585

8 0.022628 0.010160

9 0.906035 0.933811

10 0.048193 0.006970

11 0.997605 0.999668

12 0.113842 0.011478

13 0.982316 0.989557

14 0.998790 0.999860

15 0.999846 0.999998

16 0.804899 0.622029

17 0.038477 0.013044

18 0.749700 0.691259

19 0.999823 0.999996

20 0.927950 0.889530

21 0.001407 0.000671

22 0.007142 0.001306

23 0.822709 0.661517

24 0.999711 0.999993

25 0.997846 0.999722

26 0.819469 0.803962

27 0.069383 0.038095

28 0.002692 0.001934

29 0.999166 0.999947

30 0.999767 0.999995

31 0.002917 0.016437

32 0.018732 0.003600

Scoring models.

logistic : 0.988

gaussian : 0.988

Done.

We're now ready to create a prediction matrix P, where each feature corresponds to the predictions made by a given model, and score each model against the test set. First, let's install the library mlens.

!pip install mlens

Collecting mlens Downloading mlens-0.2.3-py2.py3-none-any.whl (227 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.7/227.7 kB 4.7 MB/s eta 0:00:00 Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.10/dist-packages (from mlens) (1.23.5) Requirement already satisfied: scipy>=0.17 in /usr/local/lib/python3.10/dist-packages (from mlens) (1.11.3) Installing collected packages: mlens Successfully installed mlens-0.2.3

The next code compute the matrix P visualization.

# You need ML-Ensemble for this figure: you can install it with: pip install mlens

from mlens.visualization import corrmat

import matplotlib.pyplot as plt

corrmat(P.corr(), inflate=False)

plt.show()

To create an ensemble, we proceed as before and average predictions, and as we might expect the ensemble outperforms the baseline. Averaging is a simple process, and if we store model predictions, we can start with a simple ensemble and increase its size on the fly as we train new models. The next code shows that the Voting Ensemble by average performs better than both methods.

print("Ensemble ROC-AUC score: %.3f" % roc_auc_score(y_test, P.mean(axis=1)))

Ensemble ROC-AUC score: 0.992

The Ensemble's best performance is confirmed with a graphical visual analysis as done in the next code.

from sklearn.metrics import roc_curve

def plot_roc_curve(y_test, P_base_learners, P_ensemble, labels, ens_label):

"""Plot the roc curve for base learners and ensemble."""

plt.figure(figsize=(10, 8))

plt.plot([0, 1], [0, 1], 'k--')

cm = [plt.cm.rainbow(i)

for i in np.linspace(0, 1.0, P_base_learners.shape[1] + 1)]

for i in range(P_base_learners.shape[1]):

p = P_base_learners[:, i]

fpr, tpr, _ = roc_curve(y_test, p)

plt.plot(fpr, tpr, label=labels[i], c=cm[i + 1])

fpr, tpr, _ = roc_curve(y_test, P_ensemble)

plt.plot(fpr, tpr, label=ens_label, c=cm[0])

plt.xlabel('False positive rate')

plt.ylabel('True positive rate')

plt.title('ROC curve')

plt.legend(frameon=False)

plt.show()

plot_roc_curve(y_test, P.values, P.mean(axis=1), list(P.columns), "ensemble")

The next code is a modification of the previous one to help in the visualization of the performance of each method and the performance of the Ensemble formed by the two methods compared with the individual method's performances.

from sklearn.metrics import roc_curve

def plot_roc_curve(y_test, P_base_learners, P_ensemble, labels, ens_label):

"""Plot the roc curve for base learners and ensemble."""

#plt.figure(figsize=(10, 8))

fig = plt.figure()

gs = fig.add_gridspec(3, hspace=0)

axs = gs.subplots(sharex=True, sharey=True)

fig.suptitle('ROC Curve')

axs[P_base_learners.shape[1]].plot([0, 1], [0, 1], 'k--')

cm = [plt.cm.rainbow(i)

for i in np.linspace(0, 1.0, P_base_learners.shape[1] + 1)]

for i in range(P_base_learners.shape[1]):