1. Concepts & Definitions
1.1. Regression versus Classification
1.3. Parameter versus Hyperparameter
1.4. Training, Validation, and Test
2. Problem & Solution
2.1. Gaussian Mixture x K-means on HS6 Weight
2.2. Evaluation of classification method using ROC curve
2.3. Comparing logistic regression, neural network, and ensemble
2.4. Fruits or not, split or encode and scale first?
In previous sections, the Binary Classification problem had been addressed in Section 1.4 using Logistics Regression, Section 1.5 using neural network, Section 1.7 using K-means and Gaussian Mixture Model. The principle some binary class input data is given with two possible labels: Positive or Negative. Then, a proper model could be applied and then it will produce a way to classify input data, leading to four possible outcomes:
True Positive,
True Negative,
False Negative,
False Positive.
The following figure illustrates the steps of this procedure.
The previous figure helps to devise the four possible outcomes from a binary classification model. These combinations of outcomes could be organized into the schematic described in the next figure.
These four combinations could be organized in an another representation: Confusion matrix.
A better way to organize the possible outcomes of model in a binary classification problem is the confusion matrix.
A confusion matrix is a table that is used to describe the performance of a classifier on a set of test data, for which the true values are known. The confusion matrix indicates the actual values vs. predicted values and summarizes the true negative (TN), false positive (FP), false negative (FN), and true positive (TP) values in a matrix format [1].
The definition of this outcomes are [2]:
True Positive(TP) [Correct Detection]:
A result that was predicted as positive by the classification model and also is positive,
True Negative(TN) [Correction Rejection]:
A result that was predicted as negative by the classification model and also is negative,
False Positive(FP) [Incorrect Detection]:
A result that was predicted as positive by the classification model but actually is negative, also referred to as Type I Error,
False Negative(FN) [Incorrect Rejection]:
A result that was predicted as negative by the classification model but actually is positive, also referred to as Type II Error.
The next figure presents how a confusion matrix organize the possible outcomes of a specific model.
The previous metrics are usefull to define measures to evaluate the performance of classification models [3, 4]:
Accuracy: is the total number of true (correct) classifications (TP and TN) divided by the total number of classifications (TP + TN + FP + FN).
Accuracy = (TP + TN)/(TP + TN + FP + FN)
Precision or Positive Predictive Value (PPV): is a measure of amongst all the positive predictions (TP + FP), how many of them were actually true (TP).
Precision = TP/(TP + FP)
Recall or True Positive Rate (TPR): is a measure of the total number of positive results (TP + FN) and how many positives were correctly predicted by the model (TP).
Recall = TP/(TP + FN)
Specificity or True Negative Rate (TNR): is the probability of a person testing negative (TN) who does not have a disease (TN + FP).
Specificity = TN/(TN+FP)
F1 Score: is the harmonic mean of precision and recall, so it’s an overall measure of the quality of a classifier’s predictions. It is usually the metric of choice for most people because it captures both precision and recall. While Precision tries to minimize FPs and Recall tries to minimize FNs, the F-1 metric maintains a balance between precision and recall and is defined as a harmonic mean between the two [4].
F1 = 2/((1/Precision)+(1/Recall)) = 2(Precision*Recall)/(Precision + Recall)
First, let's load a dataset with two classes and make a Train-Test Split.
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1, cluster_std=3)
# demonstrate that the train-test split procedure is repeatable
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
Let's train the logistic regression model and make predictions in the test dataset.
from sklearn.linear_model import LogisticRegression
# fit the model
#model = RandomForestClassifier(random_state=1)
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
# make predictions
yhat = model.predict(X_test)
yhat
array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0])
Now, is possible to obtain and plot the confusion matrix. But, beware the plot will be inverted since value 1 means positive and the value 0 means negative.
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
print("confusion matrix")
cm=confusion_matrix(y_test, yhat)
print(cm)
print('\n')
fig, ax = plot_confusion_matrix(conf_mat=cm,figsize=(10, 10),
show_absolute=True,
show_normed=True,
colorbar=True)
confusion matrix [[19 1] [ 0 13]]
The previous code enables the creation of a report derivated from the confusion metrics employing equations defined previously.
TP = cm[1][1]
FP = cm[1][0]
FN = cm[0][1]
TN = cm[0][0]
Accuracy = (TP + TN)/(TP + TN + FP + FN)
Precision = TP/(TP + FP)
Recall = TP/(TP + FN)
Specificity = TN/(TN + FP)
F1 = 2*(Precision*Recall)/(Precision + Recall)
print('TP = ', TP)
print('FP = ', FP)
print('FN = ', FN)
print('TN = ', TN)
print('Accuracy = ',Accuracy)
print('Precision = ',Precision)
print('Recall = ',Recall)
print('Specificity = ',Specificity)
print('F1 = ',F1)
TP = 13
FP = 0
FN = 1
TN = 19
Accuracy = 0.9696969696969697
Precision = 1.0
Recall = 0.9285714285714286
Specificity = 1.0 F1 = 0.962962962962963
It is also possible to create a report derived from the confusion metrics using commands from SKlearn library.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
accuracy = accuracy_score(y_test, yhat)
recall = recall_score(y_test, yhat)
precision = precision_score(y_test, yhat)
specificity = recall_score(y_test, yhat, pos_label = 0)
f1_score = f1_score(y_test, yhat)
print('Accuracy = ',accuracy)
print('Precision = ',precision)
print('Recall = ',recall)
print('Specificity = ',specificity)
print('F1_score = ',f1_score)
Accuracy = 0.9696969696969697
Precision = 0.9285714285714286
Recall = 1.0
Specificity = 0.95
F1_score = 0.962962962962963
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1hR7qRr8a8AVP8rJ4aepyQMZ54ovE5Ia6?usp=sharing
In a previous text, it was discussed the performance metrics that can be applied to the assessment of a classifier. To review: Most classifiers produce a score, which is then thresholded to decide the classification. If a classifier produces a score between 0.0 (definitely negative) and 1.0 (definitely positive), it is common to consider anything over 0.5 as positive.
However, any threshold applied to a dataset (in which PP is the positive population and NP is the negative population) is going to produce true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) as shown in the next Figure [5].
The threshold value raises the following question. In a classification problem, we may decide to predict the class values directly. Alternatively, it can be more flexible to predict the probabilities for each class instead.
After computing the probability, it is possible to choose and even calibrate the threshold for how to interpret the predicted probabilities.
For example, a default might be to use a threshold of 0.5, meaning that a probability in [0.0, 0.49] is a negative outcome (0) and a probability in [0.5, 1.0] is a positive outcome (1). This threshold can be adjusted to tune the behavior of the model for a specific problem. An example would be to reduce more of one or another type of error.
When predicting a binary or two-class classification problem, there are two types of errors that we could make:
False Positive. Predict an event when there was no event.
False Negative. Predict no event when in fact there was an event.
By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen by the operator of the model [6], i.e., an appropriate choice of a threshold value. Two common metrics employed are:
Recall = True Positive Rate (TPR) = True Positives / (True Positives + False Negatives)
False Positive Rate (FPR) = 1 - Specificity where: Specificity = True Negatives / (True Negatives + False Positives)
One way to visualize the impact of threshold in a classification method is the following: TP (sensitivity) can then be plotted against FP (1 – specificity) for each threshold used. The resulting graph is called a Receiver Operating Characteristic (ROC) curve. ROC curves were developed for use in signal detection in radar returns in the 1950’s, and have since been applied to a wide range of problems [5].
AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease [8].
There are four combinations of how ROC curve and AUC are connected [7]:
Case 1: ROC curve with an AUC = 1. This is an ideal situation. When two curves don't overlap at all means model has an ideal measure of separability. It is perfectly able to distinguish between positive class and negative class.
Case 2: ROC curve with an AUC = 0.7. When two distributions overlap, we introduce type 1 and type 2 errors. Depending upon the threshold, we can minimize or maximize them. When AUC is 0.7, it means there is a 70% chance that the model will be able to distinguish between positive class and negative class.
Case 3: ROC curve with an AUC = 0.5. This is the worst situation. When AUC is approximately 0.5, the model has no discrimination capacity to distinguish between positive class and negative class.
Case 4: ROC curve with an AUC = 0. When AUC is approximately 0, the model reciprocates the classes. It means the model is predicting a negative class as a positive class and vice versa.
The next figure illustrates these four combinations. Observe that the red distribution curve is of the positive class (patients with disease), and the blue distribution curve is of the negative class(patients with no disease).
First, start loading the data and making a Train-Test Split.
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1, cluster_std=3)
# demonstrate that the train-test split procedure is repeatable
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)
Now, let's train the logistic regression model and make predictions in the test dataset. In a classification problem, we may decide to predict the class values directly. Alternatively, it can be more flexible to predict the probabilities for each class instead.
from sklearn.linear_model import LogisticRegression
# fit the model
#model = RandomForestClassifier(random_state=1)
model = LogisticRegression(random_state=1)
model.fit(X_train, y_train)
# make predictions using threeshold: probability
y_predict_prob = model.predict_proba(X_test)
# Extracting predicted probability of class 1 (positive)
y_predict_prob_class_1 = y_predict_prob[:,1]
y_predict_prob_class_1
array([6.22613278e-07, 8.83529383e-02, 4.67263385e-02, 9.99999380e-01, 1.22132884e-02, 2.30430623e-01, 1.45052820e-05, 1.38690821e-04, 3.84580408e-04, 9.99408588e-01, 1.21112942e-03, 9.99996908e-01, 4.46221526e-03, 9.99953385e-01, 9.99998572e-01, 9.99999925e-01, 3.29780967e-01, 8.26576477e-04, 2.24465063e-01, 9.99999901e-01, 7.01590605e-01, 8.69737921e-06, 8.33646557e-05, 9.98645234e-01, 9.99999842e-01, 9.99997339e-01, 9.98449327e-01, 1.85465989e-03, 2.03953049e-05, 9.99999246e-01, 9.99999885e-01, 1.96449285e-05, 3.13539713e-04])
After computing the probability, it is possible to choose and even calibrate the threshold for how to interpret the predicted probabilities. For example, a default might be to use a threshold of 0.5, meaning that a probability in [0.0, 0.49] is a negative outcome (0) and a probability in [0.5, 1.0] is a positive outcome (1).
y_predict_class_tes = {}
# Define threshold 0.1
tes = 0.1
y_predict_class_tes[0.1] = [1 if prob > tes else 0 for prob in y_predict_prob_class_1]
# Define threshold 0.5
tes = 0.5
y_predict_class_tes[0.5] = [1 if prob > tes else 0 for prob in y_predict_prob_class_1]
# Define threshold 0.9
tes = 0.9
y_predict_class_tes[0.9] = [1 if prob > tes else 0 for prob in y_predict_prob_class_1]
for (key, value) in y_predict_class_tes.items():
print('y_predict_class with threshold = '+str(key*100)+'%: ',y_predict_class_tes[key])
y_predict_class with threshold = 10.0%:
[0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]
y_predict_class with threshold = 50.0%:
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]
y_predict_class with threshold = 90.0%:
[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]
The threshold can be adjusted to tune the behavior of the model for a specific problem. An example would be to reduce more of one or another type of error. When predicting a binary or two-class classification problem, there are two types of errors that we could make:
Recall = True Positive Rate (TPR) = True Positives / (True Positives + False Negatives),
False Positive Rate (FPR) = 1 - Specificity, where: Specificity = True Negatives / (True Negatives + False Positives).
By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen by the operator of the model.
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_predict_class_tes[0.1])
specificity = recall_score(y_test, y_predict_class_tes[0.1], pos_label = 0)
print('TPR = ',recall)
print('FPR = ',1-specificity)
TPR = 1.0
FPR = 0.19999999999999996
The next commands automatically obtain TPR and FPR for several values of threshold.
from sklearn.metrics import roc_curve, roc_auc_score
# calculate roc curve
fpr, tpr, thresholds = roc_curve(y_test, y_predict_prob_class_1)
print('thresholds = ',thresholds)
print('tpr = ',tpr)
print('fpr = ',fpr)
# calculate scores
log_auc = roc_auc_score(y_test, y_predict_prob_class_1)
print('Logistic: ROC AUC=%.3f' % (log_auc))
thresholds = [1.99999992e+00 9.99999925e-01 9.98449327e-01 6.22613278e-07]
tpr = [0. 0.07692308 1. 1. ]
fpr = [0. 0. 0. 1.]
Logistic: ROC AUC=1.000
The next commands are useful tools to predict the probability of a binary outcome using the Receiver Operating Characteristic curve, or ROC curve.
import matplotlib.pyplot as plt
# plot the roc curve for the mode
plt.plot(fpr, tpr, linestyle='--', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1hYlyyOKeLrFYHafA9O7Mji9QXOzmIGov?usp=sharing
[1] https://pub.towardsai.net/quantify-the-performance-of-classifiers-f73c33199631
[3] https://pub.towardsai.net/quantify-the-performance-of-classifiers-f73c33199631
[4] https://pub.towardsai.net/deep-dive-into-confusion-matrix-6b8111d5c3f7
[5]https://machinelearningmastery.com/assessing-comparing-classifier-performance-roc-curves-2/
[7] https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
[8] https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc
True Positive Rate (TPR), Sensitivity, Recall: It is the probability of a person testing positive who has a disease. In other words, Recall is the proportion of examples of a particular class predicted by the model as belonging to that class.
Text about performance of classification methods:
https://pub.towardsai.net/quantify-the-performance-of-classifiers-f73c33199631
Evaluating classification methods:
https://medium.com/@Coursesteach/binary-classification-model-evaluation-d4232ad55a48
About confusion matrix equations:
https://pub.towardsai.net/deep-dive-into-confusion-matrix-6b8111d5c3f7
Summary figure about the equations employed to build confusion matrix:
https://devopedia.org/confusion-matrix
Python code to obtain colored confusion matrix:
ROC curve and relation with confusion matrix:
https://www.v7labs.com/blog/confusion-matrix-guide
Expanding for multiple classes:
https://www.v7labs.com/blog/confusion-matrix-guide
https://devopedia.org/confusion-matrix
Very didactic with a customs example
https://medium.com/@Coursesteach/binary-classification-model-evaluation-d4232ad55a48
ROC Curve
Explaining the ROC Curve and the related equations
https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5
Details the meaning of AUC in a didactic example
https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc?hl=pt-br
A small numerical example and ROC curve graphics interesting
A small example of Logistic regression with one input variable:
https://www.w3schools.com/python/python_ml_logistic_regression.asp
A small example of Logistic regression with a confusion matrix as a heatmap:
https://www.datacamp.com/tutorial/understanding-logistic-regression-python
A small example of Logistic regression with Train-Test scheme and also a graphical separation:
https://www.geeksforgeeks.org/ml-logistic-regression-using-python/
Confusion matrix example with handwriting data and logistic regression
Some didactic curves and a comparison KNN versus logistic regression
https://www.analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/
A small numerical example
https://stackabuse.com/understanding-roc-curves-with-python/
Another small example
Illustrated small numerical example
https://medium.com/@nesrine.ammar/multiple-confusion-matrices-into-one-curve-roc-77f5c3d4e357
Example with unbalanced data
https://www.w3schools.com/python/python_ml_auc_roc.asp
More on graphics on code, graphics, and excellent explanation
https://medium.com/computer-architecture-club/what-is-the-auc-roc-curve-47fbdcbf7a4a
https://medium.com/analytics-vidhya/what-is-a-confusion-matrix-d1c0f8feda5