1. Concepts & Definitions
1.1. Regression versus Classification
1.3. Parameter versus Hyperparameter
1.4. Training, Validation, and Test
2. Problem & Solution
2.1. Gaussian Mixture x K-means on HS6 Weight
2.2. Evaluation of classification method using ROC curve
2.3. Comparing logistic regression, neural network, and ensemble
2.4. Fruits or not, split or encode and scale first?
Applied machine learning often involves fitting and evaluating models on a dataset. To measure how well our models perform, we use the ROC-AUC score, which trades off having high precision and high recall [1].
Given that we cannot know which model will perform best on the dataset beforehand, this may involve a lot of trial and error until we find a model that performs well or best for our project.
An alternate approach is to prepare multiple different models, and then combine their predictions. This is called an ensemble machine learning model, or simply an ensemble, and the process of finding a well-performing ensemble model is referred to as “ensemble learning" [2].
It has been recognized since the early days of ML research that ensembles of classifiers can be more accurate than individual models. In ML, ensembles are effectively committees that aggregate the predictions of individual classifiers. They are effective for very much the same reasons a committee of experts works in human decision-making, they can bring different expertise to bear and the averaging effect can reduce errors [3].
There are two main reasons to use an ensemble over a single model, and they are related; they are [2]:
Reliability: Ensembles can reduce the variance of the predictions.
Skill: Ensembles can achieve better performance than a single model, avoiding model bias error.
The next figure helps to illustrate how the previous two features could be combined in four cases [4].
There are different types of ensemble methods, and each one brings a set of advantages and disadvantages. Before diving into each method, let's understand two components employed in all models [4]:
Base learners: are the first level of an ensemble learning architecture, and each one of them is trained to make individual predictions.
Meta learners: on the other hand, are in the second level, and they are trained on the output of the base learners.
The next figure helps to illustrate how these elements could be combined.
The ensembles methods could be [2]:
Bagging Ensemble: works by creating samples of the training dataset and fitting a decision tree on each sample.
Random forest Ensemble: like bagging, the random forest ensemble fits a decision tree on different bootstrap samples of the training dataset. Unlike bagging, the random forest will also sample the features (columns) of each dataset.
AdaBoost Ensemble: Boosting involves adding models sequentially to the ensemble where new models attempt to correct the errors made by prior models already added to the ensemble. As such, the more ensemble members that are added, the fewer errors the ensemble is expected to make, at least to a limit supported by the data and before overfitting the training dataset.
Gradient boosting Ensemble: is a framework for boosting ensemble algorithms and an extension to AdaBoost. It re-frames boosting as an additive model under a statistical framework and allows for the use of arbitrary loss functions to make it more flexible and loss penalties (shrinkage) to reduce overfitting.
Voting Ensemble: use simple statistics to combine the predictions from multiple models. Typically, this involves fitting multiple different model types on the same training dataset, and then calculating the average prediction in the case of regression or the class label with the most votes for classification called hard voting.
Stacking Ensemble: involves combining the predictions of multiple different types of base models, much like voting. The important difference from voting is that another machine learning model is used to learn how to best combine the predictions of the base models. This is often a linear model, such as linear regression for regression problems or logistic regression for classification, but can be any machine learning model you like.
The next code creates the same dataset from previous section, but with a different parameter value, i.e., cluster_std equals to 5 instead of 3.
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
def get_train_test(test_size=0.33, SEED = 1):
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=SEED, cluster_std=5)
# demonstrate that the train-test split procedure is repeatable
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=SEED)
return X_train, X_test, y_train, y_test
SEED = 1
X_train, X_test, y_train, y_test = get_train_test(SEED)
# A look at the data
print('X_test = ', X_test)
print('y_test = ', y_test)
X_test = [[ 3.99858703 12.00557395] [ -5.11786366 2.42272223] [ -2.61873767 -0.03165495] [-15.99405267 0.36337804] [ -5.01287134 6.2943088 ] [ -5.09542341 0.18046166] [ 4.33502949 5.33227196] [ -0.85937456 8.78733447] [ 0.45791186 4.79319021] [ -3.34042694 -5.38988786] [ -2.70403107 7.33960582] [-10.67993622 -4.54861949] [ -4.75956413 7.89665004] [ -9.60860686 -0.86144724] [ -6.70246352 -12.09054025] [-12.58318479 -8.93848269] [ -7.3721509 2.65977626] [ -0.66806131 5.0015331 ] [ -4.30041867 -0.95835324] [-10.19119005 -12.03221032] [ -8.88012893 1.88416055] [ 2.53535716 9.06200028] [ -0.23162328 8.83219569] [ -7.48178844 2.51278087] [-15.47227173 -3.10643638] [-10.93056136 -4.46207791] [ -4.39062396 -1.90884586] [ -0.15870831 2.64524064] [ 2.8483937 6.91896156] [-10.90231401 -6.97295169] [-16.36050628 -2.38560995] [ 7.06449892 0.60045536] [ -1.05876514 7.49250542]]
y_test = [0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0]
The next code helps to train, predict values, and evaluate the score of two models: Linear regression, and Gaussian Mixture.
import numpy as np
import pandas as pd
# ROC and AUC Score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.mixture import GaussianMixture
def get_models():
"""Generate a library of base learners."""
lr = LogisticRegression(C=100, random_state=SEED)
gm = GaussianMixture(n_components=2, random_state=SEED)
models = {'logistic': lr,
'gaussian': gm
}
return models
def train_predict(model_list):
"""Fit models in list on training set and return preds"""
P = np.zeros((y_test.shape[0], len(model_list)))
P = pd.DataFrame(P)
print("Fitting models.")
cols = list()
for i, (name, m) in enumerate(models.items()):
print("%s..." % name, end=" ", flush=False)
m.fit(X_train, y_train)
P.iloc[:, i] = m.predict_proba(X_test)[:, 1]
cols.append(name)
print("done")
P.columns = cols
print("Done.\n")
return P
def score_models(P, y):
"""Score model in prediction DF"""
print("Scoring models.")
for m in P.columns:
score = roc_auc_score(y, P.loc[:, m])
print("%-26s: %.3f" % (m, score))
print("Done.\n")
models = get_models()
P = train_predict(models)
print('P = ',P)
score_models(P, y_test)
Fitting models.
logistic... done
gaussian... done
Done.
P = logistic gaussian
0 0.000201 0.000127
1 0.562759 0.321583
2 0.451088 0.337373
3 0.999179 0.999934
4 0.215037 0.036274
5 0.750645 0.650172
6 0.002165 0.003317
7 0.010276 0.001585
8 0.022628 0.010160
9 0.906035 0.933811
10 0.048193 0.006970
11 0.997605 0.999668
12 0.113842 0.011478
13 0.982316 0.989557
14 0.998790 0.999860
15 0.999846 0.999998
16 0.804899 0.622029
17 0.038477 0.013044
18 0.749700 0.691259
19 0.999823 0.999996
20 0.927950 0.889530
21 0.001407 0.000671
22 0.007142 0.001306
23 0.822709 0.661517
24 0.999711 0.999993
25 0.997846 0.999722
26 0.819469 0.803962
27 0.069383 0.038095
28 0.002692 0.001934
29 0.999166 0.999947
30 0.999767 0.999995
31 0.002917 0.016437
32 0.018732 0.003600
Scoring models.
logistic : 0.988
gaussian : 0.988
Done.
We're now ready to create a prediction matrix P, where each feature corresponds to the predictions made by a given model, and score each model against the test set. First, let's install the library mlens.
!pip install mlens
Collecting mlens Downloading mlens-0.2.3-py2.py3-none-any.whl (227 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 227.7/227.7 kB 4.7 MB/s eta 0:00:00 Requirement already satisfied: numpy>=1.11 in /usr/local/lib/python3.10/dist-packages (from mlens) (1.23.5) Requirement already satisfied: scipy>=0.17 in /usr/local/lib/python3.10/dist-packages (from mlens) (1.11.3) Installing collected packages: mlens Successfully installed mlens-0.2.3
The next code compute the matrix P visualization.
# You need ML-Ensemble for this figure: you can install it with: pip install mlens
from mlens.visualization import corrmat
import matplotlib.pyplot as plt
corrmat(P.corr(), inflate=False)
plt.show()
To create an ensemble, we proceed as before and average predictions, and as we might expect the ensemble outperforms the baseline. Averaging is a simple process, and if we store model predictions, we can start with a simple ensemble and increase its size on the fly as we train new models. The next code shows that the Voting Ensemble by average performs better than both methods.
print("Ensemble ROC-AUC score: %.3f" % roc_auc_score(y_test, P.mean(axis=1)))
Ensemble ROC-AUC score: 0.992
The Ensemble's best performance is confirmed with a graphical visual analysis as done in the next code.
from sklearn.metrics import roc_curve
def plot_roc_curve(y_test, P_base_learners, P_ensemble, labels, ens_label):
"""Plot the roc curve for base learners and ensemble."""
plt.figure(figsize=(10, 8))
plt.plot([0, 1], [0, 1], 'k--')
cm = [plt.cm.rainbow(i)
for i in np.linspace(0, 1.0, P_base_learners.shape[1] + 1)]
for i in range(P_base_learners.shape[1]):
p = P_base_learners[:, i]
fpr, tpr, _ = roc_curve(y_test, p)
plt.plot(fpr, tpr, label=labels[i], c=cm[i + 1])
fpr, tpr, _ = roc_curve(y_test, P_ensemble)
plt.plot(fpr, tpr, label=ens_label, c=cm[0])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.title('ROC curve')
plt.legend(frameon=False)
plt.show()
plot_roc_curve(y_test, P.values, P.mean(axis=1), list(P.columns), "ensemble")
The next code is a modification of the previous one to help in the visualization of the performance of each method and the performance of the Ensemble formed by the two methods compared with the individual method's performances.
from sklearn.metrics import roc_curve
def plot_roc_curve(y_test, P_base_learners, P_ensemble, labels, ens_label):
"""Plot the roc curve for base learners and ensemble."""
#plt.figure(figsize=(10, 8))
fig = plt.figure()
gs = fig.add_gridspec(3, hspace=0)
axs = gs.subplots(sharex=True, sharey=True)
fig.suptitle('ROC Curve')
axs[P_base_learners.shape[1]].plot([0, 1], [0, 1], 'k--')
cm = [plt.cm.rainbow(i)
for i in np.linspace(0, 1.0, P_base_learners.shape[1] + 1)]
for i in range(P_base_learners.shape[1]):
p = P_base_learners[:, i]
fpr, tpr, _ = roc_curve(y_test, p)
axs[i].plot(fpr, tpr, label=labels[i], c=cm[i + 1])
axs[P_base_learners.shape[1]].plot(fpr, tpr, label=labels[i], c=cm[i + 1])
fpr, tpr, _ = roc_curve(y_test, P_ensemble)
axs[P_base_learners.shape[1]].plot(fpr, tpr, label=ens_label, c=cm[0])
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
#plt.title('ROC curve')
plt.legend(frameon=False)
plt.show()
plot_roc_curve(y_test, P.values, P.mean(axis=1), list(P.columns), "ensemble")
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1guqdIBa7PmeOIxFWL9twwE_rX51v2gdx?usp=sharing
[1] https://www.dataquest.io/blog/introduction-to-ensembles/
[2] https://machinelearningmastery.com/ensemble-machine-learning-with-python-7-day-mini-course/
[3] https://towardsdatascience.com/ensembles-in-machine-learning-9128215629d1
[4] https://www.datacamp.com/tutorial/ensemble-learning-python
Historical and visual examples
https://towardsdatascience.com/ensembles-in-machine-learning-9128215629d1
Types
https://scikit-learn.org/stable/modules/ensemble.html
Python code
Numerical example using LLM for customer segmentation with LLM and K-mean, and LLM + Kmeans
https://towardsdatascience.com/mastering-customer-segmentation-with-llm-3d9008235f41