1. Concepts & Definitions
1.1. Regression versus Classification
1.3. Parameter versus Hyperparameter
1.4. Training, Validation, and Test
2. Problem & Solution
2.1. Gaussian Mixture x K-means on HS6 Weight
2.2. Evaluation of classification method using ROC curve
2.3. Comparing logistic regression, neural network, and ensemble
2.4. Fruits or not, split or encode and scale first?
According to [4], supervised algorithms are said to have underfitting when a model is too simple to capture data complexities. It represents the inability of the model to learn the training data effectively resulting in poor performance both on the training and testing data. In simple terms, an underfit model is inaccurate, especially when applied to new, unseen examples. The reasons for Underfitting are:
The model is too simple, so it may be not capable of representing the complexities in the data.
The input features which is used to train the model is not adequate representations of underlying factors influencing the target variable.
The size of the training dataset used is not enough. Excessive regularization is used to prevent overfitting, which constrains the model to capture the data well.
Features are not scaled.
Techniques to Reduce Underfitting are:
Increase model complexity.
Increase the number of features, performing feature engineering.
Remove noise from the data.
Increase the number of epochs or increase the duration of training to get better results.
From [4], supervised algorithms are said to be overfitted when the model does not make accurate predictions on testing data. When a model gets trained with so much data, it starts learning from the noise and inaccurate data entries in our data set. When testing with test data results in High variance.
In a nutshell, Overfitting is a problem where the evaluation of machine learning algorithms on training data is different from unseen data. Reasons for Overfitting are:
High variance and low bias.
The model is too complex.
The size of the training data.
Techniques to Reduce Overfitting are:
Increase training data.
Reduce model complexity.
Early stopping during the training phase (have an eye over the loss over the training period as soon as loss begins to increase stop training).
For linear regression employ Ridge Regularization and Lasso Regularization.
Use dropout for neural networks to tackle overfitting.
The next figure extracted from [3] summarizes in visual terms both effects and the right adjustment of the models (in the middle).
According to [5], when you’re building a new machine learning model, you can fine-tune it to gain more insights from the training data. The trick is to know when to stop fine-tuning. After a certain point, the model will start to overfit. That is, it may perform well on the training data. But it’ll give disappointing results when it encounters unseen data. How do you stop overfitting? How can you measure your model’s expected performance in the real world? The "Train-Test split" technique is the answer to these questions.
The train-test split procedure is used to estimate the performance of machine learning algorithms when they are used to make predictions on data not used to train the model [2]. In summary, train test split is a model validation procedure that allows you to simulate how a model would perform on new/unseen data. Here is how the procedure works [1]:
It is a fast and easy procedure to perform, the results of which allow you to compare the performance of machine learning algorithms for your predictive modeling problem. Although simple to use and interpret, there are times when the procedure should not be used, such as when you have a small dataset, and situations where additional configuration is required, such as when it is used for classification and the dataset is not balanced.
Common split percentages include:
* Train: 80%, Test: 20%
* Train: 67%, Test: 33%
* Train: 50%, Test: 50%
Another important consideration is that rows are assigned to the train and test sets randomly. This is done to ensure that datasets are a representative sample (e.g. random sample) of the original dataset, which in turn, should be a representative sample of observations from the problem domain.
When comparing machine learning algorithms, it is desirable (perhaps required) that they are fit and evaluated on the same subsets of the dataset. This can be achieved by fixing the seed for the pseudo-random number generator used when splitting the dataset.
From the library sklearn, there are datasets, and the make_blobs() function can be used to generate blobs of points with a Gaussian distribution. You can control how many blobs to generate and the number of samples to generate, as well as a host of other properties.
The problem is suitable for linear classification problems given the linearly separable nature of the blobs.
The example below generates a 2D dataset of samples with three blobs as a multi-class classification prediction problem. Each observation has two inputs and 0, 1, or 2 class values.
from sklearn.datasets import make_blobs
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=3, n_features=2)
X, y
(array([[ 9.97311297, -0.97851102], [ 7.41386573, 5.00235941], [ 7.51682746, 7.59502601], [ 8.05202055, 9.43975074], [ 7.09376237, 7.6233349 ], [ 6.99291892, 7.58410502], [ 9.06480093, -0.09575033],
...
[ 6.42139017, 7.53751411], [-0.85171844, 10.24318899]]),
array([0, 1, 1, 1, 1, 1, 0, 2, 1, 1, 0, 2, 1, 0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 2, 0, 1, 0, 1, 0, 2, 2, 2, 2, 0, 1, 0, 1, 0, 1, 2, 0, 0, 0, 2, 0, 0, 1, 1, 2, 1, 2, 0, 2, 0, 2, 1, 1, 1, 2, 0, 0, 2, 1, 0, 1, 1, 2, 0, 2, 2, 2, 2, 0, 1, 2, 0, 0, 2, 2, 1, 0, 0, 2, 1, 2, 1, 1, 2, 1, 0, 2, 1, 2, 2, 0, 0, 2, 2, 1, 2]))
The next figure illustrates how the previous data is distributed in a 2D plot.
import pandas as pd
import matplotlib.pyplot as plt
# scatter plot, dots colored by class value
df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue', 2:'green'}
fig, ax = plt.subplots()
grouped = df.groupby('label')
for key, group in grouped:
group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
plt.show()
The previous numerical dataset example could be employed to illustrate the Train-Test Split as done in the next code.
# demonstrate that the train-test split procedure is repeatable
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# create dataset
X, y = make_blobs(n_samples=100)
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])
# split again, and we should see the same split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
# summarize first 5 rows
print(X_train[:5, :])
[[ 3.79151698 -10.26047806] [ 0.19956406 0.6193105 ] [ -2.59640747 -1.08876243] [ -1.29976893 -1.31720619] [ 2.62546206 0.93525546]]
[[ 3.79151698 -10.26047806] [ 0.19956406 0.6193105 ] [ -2.59640747 -1.08876243] [ -1.29976893 -1.31720619] [ 2.62546206 0.93525546]]
We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the housing dataset. The housing dataset is a standard machine learning dataset composed of 506 rows of data with 13 numerical input variables and a numerical target variable. The dataset involves predicting the house price given details of the house’s suburb in the American city of Boston:
Housing Dataset (housing.csv)
Housing Description (housing.names)
No need to download the dataset; we will download it automatically as part of our work examples. The example below downloads and loads the dataset as a Pandas DataFrame and summarizes the shape of the dataset.
# load and summarize the housing dataset
from pandas import read_csv
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/housing.csv'
dataframe = read_csv(url, header=None)
dataframe
Obtaining the size of the input and output data.
# split into inputs (all columns but the last) and outputs (last column)
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
(506, 13) (506,)
Now, let's split the data into two classes: train and test.
# split into train test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(339, 13) (167, 13) (339,) (167,)
Before creating a linear regression model using the training dataset, verify which input variables have a higher correlation with the output variable.
import seaborn as sns
import pandas as pd
import numpy as np
# Concatenate X_train and y_train
concatenated_data = np.column_stack((X_train, y_train))
# Create a DataFrame from the concatenated data
concatenated_df = pd.DataFrame(concatenated_data)
# Now, you can use sns.pairplot on concatenated_df
sns.pairplot(concatenated_df, diag_kind='kde')
# Display the pairplot
plt.show()
Now, let's create a multiple linear regression model.
# fit the model
#from sklearn.ensemble import RandomForestRegressor
#model = RandomForestRegressor(random_state=1)
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
It is possible to extract the model parameters using Sklearn library.
b0 = model.intercept_
b1 = model.coef_
print("Model parameters")
print(f"intercept: {b0}")
print(f"slope: {b1}")
xs = X_train
ys = y_train
r2 = model.score(xs, ys)
r2adjusted = 1 - (1-r2)*(len(ys)-1)/(len(ys)-xs.shape[1]-1)
print("Model adjustment")
print('r2 = ',r2)
print('Adjusted r2 = ',r2adjusted)
Model parameters intercept: 39.68280792749965 slope: [-9.91995332e-02 6.27806786e-02 7.25812126e-02 3.01077411e+00 -2.06617880e+01 3.44496595e+00 3.78004138e-03 -1.44827470e+00 3.09305140e-01 -1.16485640e-02 -9.46002716e-01 7.20017685e-03 -5.28368116e-01] Model adjustment r2 = 0.7217346524579973 Adjusted r2 = 0.7106040385563173
Another possibility is to use statsmodels library to extract model parameters.
# with statsmodels
import statsmodels.api as sm
xm = sm.add_constant(X_train) # adding a constant
model2 = sm.OLS(y_train, xm).fit()
print_model = model2.summary()
print(print_model)
The previous table presents a column "P > |t| " with each coefficient's P-value. This indicates if some coefficients have statistical significance, and it should be considered in a linear equation that relates input variables and output variable.
For values lower than 0.01, the P-value is lower enough to provide evidence that we cannot accept the null hypothesis with a significance with a confidence level of 99% (1-α, where α = 0.01). Then, these variables, with lower P-values, have statistical evidence of these input variables' effect on the output variable. This means:
house_price = const + 0.0628*x2 + 3.0108*x4 - 20.6618*x5 + 3.4450*x6 - 1.4483*x8 + 0.3093*x9 - 0.9460*x11 -0.5284*x13
Then, for an unknown house price, the multiple linear regression model would give a prediction of its value through v other feature values: x2, x4, x5, x6, x8, x9, x11, and x13.
The last sequence of commands enables the calculation of regression prediction metrics.
from sklearn.metrics import mean_absolute_error, mean_squared_error
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
mse = mean_squared_error(y_test, yhat)
mae = mean_absolute_error(y_test, yhat)
rmse = mse**0.5
print('MSE: %.3f' % mse)
print('MAE: %.3f' % mae)
print('RMSE: %.3f' % rmse)
MSE: 20.698
MAE: 3.417
RMSE: 4.550
The main difference between logistic and linear regression is that logistic regression provides a constant output, while linear regression provides a continuous output [6].
In logistic regression, the outcome, or dependent variable, has only two possible values. However, in linear regression, the outcome is continuous, which means that it can have any one of an infinite number of possible values. Logistic regression is used when the response variable is categorical, such as yes/no, true/false, and pass/fail. Linear regression is used when the response variable is continuous, such as hours, height, and weight.
For example, given data on the time a student spent studying and that student's exam scores, logistic regression and linear regression can predict different things.
With logistic regression predictions, only specific values or categories are allowed. Therefore, logistic regression predicts whether the student passed or failed. Since linear regression predictions are continuous, such as numbers in a range, it can predict the student's test score on a scale of 0 to 100.
The next figure helps to illustrate a comparison between both regressions [7].
Further mathematical developments about Logistic regression could be found at [7].
We will demonstrate how to use the train-test split to evaluate a random forest algorithm on the sonar dataset. The sonar dataset is a standard machine learning dataset composed of 208 rows of data with 60 numerical input variables and a target variable with two class values, e.g. binary classification. The dataset involves predicting whether sonar returns indicate a rock or simulated mine:
Sonar Dataset (sonar.csv),
Sonar Dataset Description (sonar.names).
The example below downloads the dataset and summarizes its shape.
# train-test split evaluation random forest on the sonar dataset
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# load dataset
url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/sonar.csv'
dataframe = read_csv(url, header=None)
dataframe
Let's separate the data in input and output variables.
# split into inputs and outputs
data = dataframe.values
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
(208, 60) (208,)
Then, let's do the Train-Test Split.
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(139, 60) (69, 60) (139,) (69,)
Finally, it is possible to employ code to perform the Logistic regression.
# fit the model
#model = RandomForestClassifier(random_state=1)
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
Using the Test dataset is possible to evaluate the model precision.
# make predictions
yhat = model.predict(X_test)
# evaluate predictions
acc = accuracy_score(y_test, yhat)
print('Accuracy: %.3f' % acc)
Accuracy: 0.754
The next code gives a detailed component-wise comparison about the given values and model prediction in the Test dataset.
[y_test, yhat]
[array(['M', 'M', 'M', 'M', 'R', 'R', 'M', 'R', 'M', 'R', 'R', 'R', 'M', 'M', 'M', 'M', 'R', 'M', 'R', 'R', 'M', 'R', 'R', 'R', 'M', 'M', 'R', 'R', 'R', 'M', 'M', 'M', 'R', 'R', 'R', 'R', 'R', 'R', 'R', 'M', 'M', 'M', 'R', 'M', 'M', 'M', 'R', 'R', 'M', 'M', 'M', 'M', 'R', 'R', 'M', 'M', 'R', 'M', 'R', 'R', 'M', 'M', 'R', 'M', 'R', 'R', 'R', 'M', 'M'], dtype=object),
array(['M', 'R', 'R', 'M', 'R', 'R', 'R', 'R', 'M', 'R', 'M', 'M', 'M', 'M', 'M', 'M', 'R', 'M', 'M', 'R', 'M', 'R', 'M', 'R', 'M', 'M', 'R', 'M', 'R', 'M', 'M', 'M', 'R', 'R', 'M', 'R', 'R', 'R', 'R', 'M', 'M', 'M', 'R', 'M', 'M', 'M', 'M', 'R', 'M', 'M', 'R', 'M', 'R', 'R', 'M', 'M', 'M', 'R', 'R', 'R', 'M', 'M', 'M', 'M', 'M', 'M', 'R', 'R', 'M'], dtype=object)]
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1XyS0H_UFLUCFbWLCpezpMeW6gsaiwnre?usp=sharing
[1] https://builtin.com/data-science/train-test-split
[2] https://machinelearningmastery.com/train-test-split-for-evaluating-machine-learning-algorithms/
[3] https://towardsai.net/p/l/underfitting-and-overfitting-with-python-examples
[4] https://www.geeksforgeeks.org/underfitting-and-overfitting-in-machine-learning/
[5] https://proclusacademy.com/blog/train-test-split/
[6] https://www.techtarget.com/searchbusinessanalytics/definition/logistic-regression
[1] https://stackabuse.com/linear-regression-in-python-with-scikit-learn/
[2] https://realpython.com/train-test-split-python-data/
[3] https://www.w3schools.com/python/python_ml_train_test.asp
[4] https://www.geeksforgeeks.org/understanding-logistic-regression/
Machine Learning using Python [Print Replica] Kindle Edition