1.4. Training, Validation and Test

Splitting your dataset is essential for an unbiased evaluation of prediction performance [1]. In most cases, it’s enough to split your dataset randomly into three subsets:

The training set: is applied to train, or fit, your model. For example, you use the training set to find the optimal weights, or coefficients, for linear regression, logistic regression, or neural networks.

The validation set: is used for unbiased model evaluation during hyperparameter tuning. For example, when you want to find the optimal number of neurons in a neural network or the best kernel for a support vector machine, you experiment with different values. For each considered setting of hyperparameters, you fit the model with the training set and assess its performance with the validation set.

The test set: is needed for an unbiased evaluation of the final model. You shouldn’t use it for fitting or validation.

In less complex cases, when you don’t have to tune hyperparameters, it’s okay to work with only the training and test sets. In summary, the three sets will be employed in the following workflow [2]:

Avoiding overfitting and validation set

When evaluating different settings (“hyperparameters”) for estimators there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance [3].

To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set. However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets (other approaches are described below, but generally follow the same principles). The following procedure is followed for each of the k “folds”:

A model is trained using k-1 of the folds as training data;
The resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

The next illustrates the k-fold cross-validation approach.

Types of cross-validation methods

There are several methods for performing cross-validation [4, 5]:

K-Fold
Stratified K-Fold
Group K-Fold
Stratified Group K-Fold
Leave-one-out
Leave-one Group out
Leave-P-Out
Leave-P Groups Out
Shuffle Split
Strafied Shuffle Split
Group Shuffle Split
Monte Carlo cross-validation
Bootstrapping

Some classification problems can exhibit a large imbalance in the distribution of the target classes: for instance there could be several times more negative samples than positive samples. In such cases it is recommended to use stratified sampling as implemented in StratifiedKFold and StratifiedShuffleSplit to ensure that relative class frequencies is approximately preserved in each train and validation fold [5].

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set [5].

The next figure illustrates how the StratifiedKFold works [4].

More techniques on times series data for cross-validation techniques could be found at [6].

A small numerical example

A small numerical example is provided to illustrate the previous concepts [7]. First, let's load the data.

from sklearn.datasets import make_blobs

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1, cluster_std=3)

X, y

(array([[ 0.93666298, -2.49812622],

[ -7.45923056, -6.53189637],

[ -5.99190132, 2.89309228], ...

[ -7.56485749, -0.82002226],

[ 3.57487539, 2.12286917]]),

array([0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0]))

The next code help in the visualization of the data.

import pandas as pd

import matplotlib.pyplot as plt

# scatter plot, dots colored by class value

df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))

colors = {0:'red', 1:'blue'}

fig, ax = plt.subplots()

grouped = df.groupby('label')

for key, group in grouped:

group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])

plt.show()

Another important consideration is that rows are assigned to the train and test sets randomly.

# demonstrate that the train-test split procedure is repeatable

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

y, y_train, y_test

(array([0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0]),

array([1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1]),

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]))

Now, let's create the model using train dataset.

# train-test split evaluation random forest on the sonar dataset

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# fit the model

#model = RandomForestClassifier(random_state=1)

model = LogisticRegression(random_state=1)

model.fit(X_train, y_train)

y_train

array([1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1])

After, training, the Logistic Regression model for classification could be used for prediction.

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

acc = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % acc)

Accuracy: 0.970

Now, let's compare the result for the test dataset and values predicted.

[y_test, yhat]

[array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]),

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0])]

It is possible to create a graphics to visualize the classification Logistic regression performance to classify data.

# decision surface for logistic regression on a binary classification dataset

import numpy as np

from sklearn.datasets import make_blobs

from sklearn.linear_model import LogisticRegression

# define bounds of the domain

min1, max1 = X[:, 0].min()-1, X[:, 0].max()+1

min2, max2 = X[:, 1].min()-1, X[:, 1].max()+1

# define the x and y scale

x1grid = np.arange(min1, max1, 0.1)

x2grid = np.arange(min2, max2, 0.1)

# create all of the lines and rows of the grid

xx, yy = np.meshgrid(x1grid, x2grid)

# flatten each grid to a vector

r1, r2 = xx.flatten(), yy.flatten()

r1, r2 = r1.reshape((len(r1), 1)), r2.reshape((len(r2), 1))

# horizontal stack vectors to create x1,x2 input for the model

grid = np.hstack((r1,r2))

# define the model

model = LogisticRegression()

# fit the model

model.fit(X_train, y_train)

# make predictions for the grid

yhat = model.predict(grid)

# reshape the predictions back into a grid

zz = yhat.reshape(xx.shape)

# plot the grid of x, y and z values as a surface

plt.contourf(xx, yy, zz, cmap='Paired')

# create scatter plot for samples from each class

for class_value in range(2):

# get row indexes for samples with this class

row_ix = np.where(y == class_value)

# create scatter of these samples

plt.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Paired')

It is possible to adapt the previous visualization to relate it with a probability of certain data to belong to class 0.

# probability decision surface for logistic regression on a binary classification dataset

import numpy as np

from sklearn.datasets import make_blobs

from sklearn.linear_model import LogisticRegression

# generate dataset

X, y = make_blobs(n_samples=1000, centers=2, n_features=2, random_state=1, cluster_std=3)

# define bounds of the domain

min1, max1 = X[:, 0].min()-1, X[:, 0].max()+1

min2, max2 = X[:, 1].min()-1, X[:, 1].max()+1

# define the x and y scale

x1grid = np.arange(min1, max1, 0.1)

x2grid = np.arange(min2, max2, 0.1)

# create all of the lines and rows of the grid

xx, yy = np.meshgrid(x1grid, x2grid)

# flatten each grid to a vector

r1, r2 = xx.flatten(), yy.flatten()

r1, r2 = r1.reshape((len(r1), 1)), r2.reshape((len(r2), 1))

# horizontal stack vectors to create x1,x2 input for the model

grid = np.hstack((r1,r2))

# define the model

model = LogisticRegression()

# fit the model

model.fit(X, y)

# make predictions for the grid

yhat = model.predict_proba(grid)

# keep just the probabilities for class 0

yhat = yhat[:, 0]

# reshape the predictions back into a grid

zz = yhat.reshape(xx.shape)

# plot the grid of x, y and z values as a surface

c = plt.contourf(xx, yy, zz, cmap='RdBu')

# add a legend, called a color bar

plt.colorbar(c)

# create scatter plot for samples from each class

for class_value in range(2):

# get row indexes for samples with this class

row_ix = np.where(y == class_value)

# create scatter of these samples

plt.scatter(X[row_ix, 0], X[row_ix, 1], cmap='Paired')

Testing K-Fold Cross Validation

One big problem with simply doing train-test split is that you're a setting aside a chunk of your data, so you won't be able to use it to train your algorithm. And since your data is sampled at random, it has a chance of being skewed in some way, not representing the whole dataset properly. K-fold cross validation addresses these problems. To do that, first you split the data into several (10 for example, if k = 10) subsets, called folds. Then you train and evaluate your model 10 times, setting aside each one of the folds in turn, and training the model on the remaining 9 folds. This algorithm will return an array of 10 different performance scores, and you can summarize them by calculating their mean and standard deviation [7].

That way you'll know the average score(which will be more accurate), and the spread of the scores. The obvious downside of cross-validation is that you have to train your model multiple times (10 in this case), which can be very slow if your dataset is large.

from sklearn.model_selection import KFold

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

num_folds = 10

kfold = KFold(n_splits=num_folds, random_state=None)

model = LogisticRegression()

scores = cross_val_score(model, X_train, y_train, cv=kfold)

print("Scores:", scores)

print("Mean:", scores.mean())

print("Standard deviation:", scores.std())

Scores: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Mean: 1.0

Standard deviation: 0.0

Testing Repeated Stratified K-Fold Cross Validation

The estimate of model performance via k-fold cross-validation can be noisy. This means that each time the procedure is run, a different split of the dataset into k-folds can be implemented, and in turn, the distribution of performance scores can be different, resulting in a different mean estimate of model performance [8].

One solution to reduce the noise in the estimated model performance is to increase the k-value. This will reduce the bias in the model’s estimated performance, although it will increase the variance: e.g. tie the result more to the specific dataset used in the evaluation.

An alternate approach is to repeat the k-fold cross-validation process multiple times and report the mean performance across all folds and all repeats. This approach is generally referred to as repeated k-fold cross-validation.

from sklearn.model_selection import RepeatedStratifiedKFold

#from sklearn.model_selection import StratifiedKFold

# define evaluation

rskfold = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

model = LogisticRegression()

scores = cross_val_score(model, X_train, y_train, cv=rskfold)

print("Scores:", scores)

print("Mean:", scores.mean())

print("Standard deviation:", scores.std())

Scores: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

Mean: 1.0

Standard deviation: 0.0