1.2. Train-Test Split

According to [4], supervised algorithms are said to have underfitting when a model is too simple to capture data complexities. It represents the inability of the model to learn the training data effectively resulting in poor performance both on the training and testing data. In simple terms, an underfit model is inaccurate, especially when applied to new, unseen examples. The reasons for Underfitting are:

The model is too simple, so it may be not capable of representing the complexities in the data.
The input features which is used to train the model is not adequate representations of underlying factors influencing the target variable.
The size of the training dataset used is not enough. Excessive regularization is used to prevent overfitting, which constrains the model to capture the data well.
Features are not scaled.

Techniques to Reduce Underfitting are:

Increase model complexity.
Increase the number of features, performing feature engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of training to get better results.

Overfitting Effect

From [4], supervised algorithms are said to be overfitted when the model does not make accurate predictions on testing data. When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our data set. When testing with test data results in High variance.

In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from unseen data. Reasons for Overfitting are:

High variance and low bias.
The model is too complex.
The size of the training data.

Techniques to Reduce Overfitting are:

Increase training data.
Reduce model complexity.
Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
For linear regression employ Ridge Regularization and Lasso Regularization.
Use dropout for neural networks to tackle overfitting.

The next figure extracted from [3] summarizes in visual terms both effects and the right adjustment of the models (in the middle).

According to [5], when you’re building a new machine learning model, you can fine-tune it to gain more insights from the training data. The trick is to know when to stop fine-tuning. After a certain point, the model will start to overfit. That is, it may perform well on the training data. But it’ll give disappointing results when it encounters unseen data. How do you stop overfitting? How can you measure your model’s expected performance in the real world? The "Train-Test split" technique is the answer to these questions.

What is a Train-Test Split?

The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model [2]. In summary, train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data. Here is how the procedure works [1]:

It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset, and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced.

Setting a Train-Test Split

Common split percentages include:

* Train: 80%, Test: 20%

* Train: 67%, Test: 33%

* Train: 50%, Test: 50%

Another important consideration is that rows are assigned to the train and test sets randomly. This is done to ensure that datasets are a representative sample (e.g. random sample) of the original dataset, which in turn, should be a representative sample of observations from the problem domain.

When comparing machine learning algorithms, it is desirable (perhaps required) that they are fit and evaluated on the same subsets of the dataset. This can be achieved by fixing the seed for the pseudo-random number generator used when splitting the dataset.

A numerical example dataset

From the library sklearn, there are datasets, and the make_blobs() function can be used to generate blobs of points with a Gaussian distribution. You can control how many blobs to generate and the number of samples to generate, as well as a host of other properties.

The problem is suitable for linear classification problems given the linearly separable nature of the blobs.

The example below generates a 2D dataset of samples with three blobs as a multi-class classification prediction problem. Each observation has two inputs and 0, 1, or 2 class values.

from sklearn.datasets import make_blobs

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=3, n_features=2)

X, y

(array([[ 9.97311297, -0.97851102], [ 7.41386573, 5.00235941], [ 7.51682746, 7.59502601], [ 8.05202055, 9.43975074], [ 7.09376237, 7.6233349 ], [ 6.99291892, 7.58410502], [ 9.06480093, -0.09575033],

...

[ 6.42139017, 7.53751411], [-0.85171844, 10.24318899]]),

array([0, 1, 1, 1, 1, 1, 0, 2, 1, 1, 0, 2, 1, 0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 2, 0, 1, 0, 1, 0, 2, 2, 2, 2, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 2, 0, 0, 1, 1, 2, 1, 2, 0, 2, 0, 2, 1, 1, 1, 2, 0, 0, 2, 1, 0, 1, 1, 2, 0, 2, 2, 2, 2, 0, 1, 2, 0, 0, 2, 2, 1, 0, 0, 2, 1, 2, 1, 1, 2, 1, 0, 2, 1, 2, 2, 0, 0, 2, 2, 1, 2]))

The next figure illustrates how the previous data is distributed in a 2D plot.

import pandas as pd

import matplotlib.pyplot as plt

# scatter plot, dots colored by class value

df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))

colors = {0:'red', 1:'blue', 2:'green'}

fig, ax = plt.subplots()

grouped = df.groupby('label')

for key, group in grouped:

group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])

plt.show()

A numerical example of a Train-Test Split

The previous numerical dataset example could be employed to illustrate the Train-Test Split as done in the next code.

# demonstrate that the train-test split procedure is repeatable

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

# create dataset

X, y = make_blobs(n_samples=100)

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize first 5 rows

print(X_train[:5, :])

# split again, and we should see the same split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

# summarize first 5 rows

print(X_train[:5, :])

[[ 3.79151698 -10.26047806] [ 0.19956406 0.6193105 ] [ -2.59640747 -1.08876243] [ -1.29976893 -1.31720619] [ 2.62546206 0.93525546]]

A numerical example of a Train-Test Split for Linear Regression

We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the housing dataset. The housing dataset is a standard machine learning dataset composed of 506 rows of data with 13 numerical input variables and a numerical target variable. The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston:

Housing Dataset (housing.csv)
Housing Description (housing.names)

No need to download the dataset; we will download it automatically as part of our work examples. The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset.

# load and summarize the housing dataset

from pandas import read_csv

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'

dataframe = read_csv(url, header=None)

dataframe

Obtaining the size of the input and output data.

# split into inputs (all columns but the last) and outputs (last column)

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

(506, 13) (506,)

Now, let's split the data into two classes: train and test.

# split into train test sets

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(339, 13) (167, 13) (339,) (167,)

Before creating a linear regression model using the training dataset, verify which input variables have a higher correlation with the output variable.

import seaborn as sns

import pandas as pd

import numpy as np

# Concatenate X_train and y_train

concatenated_data = np.column_stack((X_train, y_train))

# Create a DataFrame from the concatenated data

concatenated_df = pd.DataFrame(concatenated_data)

# Now, you can use sns.pairplot on concatenated_df

sns.pairplot(concatenated_df, diag_kind='kde')

# Display the pairplot

plt.show()

Now, let's create a multiple linear regression model.

# fit the model

#from sklearn.ensemble import RandomForestRegressor

#model = RandomForestRegressor(random_state=1)

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit(X_train, y_train)

It is possible to extract the model parameters using Sklearn library.

b0 = model.intercept_

b1 = model.coef_

print("Model parameters")

print(f"intercept: {b0}")

print(f"slope: {b1}")

xs = X_train

ys = y_train

r2 = model.score(xs, ys)

r2adjusted = 1 - (1-r2)*(len(ys)-1)/(len(ys)-xs.shape[1]-1)

print("Model adjustment")

print('r2 = ',r2)

print('Adjusted r2 = ',r2adjusted)

Model parameters intercept: 39.68280792749965 slope: [-9.91995332e-02 6.27806786e-02 7.25812126e-02 3.01077411e+00 -2.06617880e+01 3.44496595e+00 3.78004138e-03 -1.44827470e+00 3.09305140e-01 -1.16485640e-02 -9.46002716e-01 7.20017685e-03 -5.28368116e-01] Model adjustment r2 = 0.7217346524579973 Adjusted r2 = 0.7106040385563173

Another possibility is to use statsmodels library to extract model parameters.

# with statsmodels

import statsmodels.api as sm

xm = sm.add_constant(X_train) # adding a constant

model2 = sm.OLS(y_train, xm).fit()

print_model = model2.summary()

print(print_model)

The previous table presents a column "P > |t| " with each coefficient's P-value. This indicates if some coefficients have statistical significance, and it should be considered in a linear equation that relates input variables and output variable.

For values lower than 0.01, the P-value is lower enough to provide evidence that we cannot accept the null hypothesis with a significance with a confidence level of 99% (1-α, where α = 0.01). Then, these variables, with lower P-values, have statistical evidence of these input variables' effect on the output variable. This means:

house_price = const + 0.0628*x2 + 3.0108*x4 - 20.6618*x5 + 3.4450*x6 - 1.4483*x8 + 0.3093*x9 - 0.9460*x11 -0.5284*x13

Then, for an unknown house price, the multiple linear regression model would give a prediction of its value through v other feature values: x2, x4, x5, x6, x8, x9, x11, and x13.

The last sequence of commands enables the calculation of regression prediction metrics.

from sklearn.metrics import mean_absolute_error, mean_squared_error

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

mse = mean_squared_error(y_test, yhat)

mae = mean_absolute_error(y_test, yhat)

rmse = mse**0.5

print('MSE: %.3f' % mse)

print('MAE: %.3f' % mae)

print('RMSE: %.3f' % rmse)

MSE: 20.698

MAE: 3.417

RMSE: 4.550

Logistic regression vs. linear regression

The main difference between logistic and linear regression is that logistic regression provides a constant output, while linear regression provides a continuous output [6].

In logistic regression, the outcome, or dependent variable, has only two possible values. However, in linear regression, the outcome is continuous, which means that it can have any one of an infinite number of possible values. Logistic regression is used when the response variable is categorical, such as yes/no, true/false, and pass/fail. Linear regression is used when the response variable is continuous, such as hours, height, and weight.

For example, given data on the time a student spent studying and that student's exam scores, logistic regression and linear regression can predict different things.

With logistic regression predictions, only specific values or categories are allowed. Therefore, logistic regression predicts whether the student passed or failed. Since linear regression predictions are continuous, such as numbers in a range, it can predict the student's test score on a scale of 0 to 100.

The next figure helps to illustrate a comparison between both regressions [7].

Further mathematical developments about Logistic regression could be found at [7].

A numerical example of a Train-Test Split for Classification

We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the sonar dataset. The sonar dataset is a standard machine learning dataset composed of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification. The dataset involves predicting whether sonar returns indicate a rock or simulated mine:

Sonar Dataset (sonar.csv),

Sonar Dataset Description (sonar.names).

The example below downloads the dataset and summarizes its shape.

# train-test split evaluation random forest on the sonar dataset

from pandas import read_csv

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# load dataset

url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'

dataframe = read_csv(url, header=None)

dataframe

Let's separate the data in input and output variables.

# split into inputs and outputs

data = dataframe.values

X, y = data[:, :-1], data[:, -1]

print(X.shape, y.shape)

(208, 60) (208,)

Then, let's do the Train-Test Split.

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(139, 60) (69, 60) (139,) (69,)

Finally, it is possible to employ code to perform the Logistic regression.

# fit the model

#model = RandomForestClassifier(random_state=1)

model = LogisticRegression(random_state=1)

model.fit(X_train, y_train)

Using the Test dataset is possible to evaluate the model precision.

# make predictions

yhat = model.predict(X_test)

# evaluate predictions

acc = accuracy_score(y_test, yhat)

print('Accuracy: %.3f' % acc)

Accuracy: 0.754

The next code gives a detailed component-wise comparison about the given values and model prediction in the Test dataset.

[y_test, yhat]

[array(['M', 'M', 'M', 'M', 'R', 'R', 'M', 'R', 'M', 'R', 'R', 'R', 'M', 'M', 'M', 'M', 'R', 'M', 'R', 'R', 'M', 'R', 'R', 'R', 'M', 'M', 'R', 'R', 'R', 'M', 'M', 'M', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'M', 'M', 'M', 'R', 'M', 'M', 'M', 'R', 'R', 'M', 'M', 'M', 'M', 'R', 'R', 'M', 'M', 'R', 'M', 'R', 'R', 'M', 'M', 'R', 'M', 'R', 'R', 'R', 'M', 'M'], dtype=object),

array(['M', 'R', 'R', 'M', 'R', 'R', 'R', 'R', 'M', 'R', 'M', 'M', 'M', 'M', 'M', 'M', 'R', 'M', 'M', 'R', 'M', 'R', 'M', 'R', 'M', 'M', 'R', 'M', 'R', 'M', 'M', 'M', 'R', 'R', 'M', 'R', 'R', 'R', 'R', 'M', 'M', 'M', 'R', 'M', 'M', 'M', 'M', 'R', 'M', 'M', 'R', 'M', 'R', 'R', 'M', 'M', 'M', 'R', 'R', 'R', 'M', 'M', 'M', 'M', 'M', 'M', 'R', 'R', 'M'], dtype=object)]

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1XyS0H_UFLUCFbWLCpezpMeW6gsaiwnre?usp=sharing