1.8. ROC curve

In previous sections, the Binary Classification problem had been addressed in Section 1.4 using Logistics Regression, Section 1.5 using neural network, Section 1.7 using K-means and Gaussian Mixture Model. The principle some binary class input data is given with two possible labels: Positive or Negative. Then, a proper model could be applied and then it will produce a way to classify input data, leading to four possible outcomes:

True Positive,
True Negative,
False Negative,
False Positive.

The following figure illustrates the steps of this procedure.

The previous figure helps to devise the four possible outcomes from a binary classification model. These combinations of outcomes could be organized into the schematic described in the next figure.

These four combinations could be organized in an another representation: Confusion matrix.

What is the Confusion Matrix?

A better way to organize the possible outcomes of model in a binary classification problem is the confusion matrix.

A confusion matrix is a table that is used to describe the performance of a classifier on a set of test data, for which the true values are known. The confusion matrix indicates the actual values vs. predicted values and summarizes the true negative (TN), false positive (FP), false negative (FN), and true positive (TP) values in a matrix format [1].

The definition of this outcomes are [2]:

True Positive(TP) [Correct Detection]:

A result that was predicted as positive by the classification model and also is positive,

True Negative(TN) [Correction Rejection]:

A result that was predicted as negative by the classification model and also is negative,

False Positive(FP) [Incorrect Detection]:

A result that was predicted as positive by the classification model but actually is negative, also referred to as Type I Error,

False Negative(FN) [Incorrect Rejection]:

A result that was predicted as negative by the classification model but actually is positive, also referred to as Type II Error.

The next figure presents how a confusion matrix organize the possible outcomes of a specific model.

Metrics from Confusion Matrix

The previous metrics are usefull to define measures to evaluate the performance of classification models [3, 4]:

Accuracy: is the total number of true (correct) classifications (TP and TN) divided by the total number of classifications (TP + TN + FP + FN).

Accuracy = (TP + TN)/(TP + TN + FP + FN)

Precision or Positive Predictive Value (PPV): is a measure of amongst all the positive predictions (TP + FP), how many of them were actually true (TP).

Precision = TP/(TP + FP)

Recall or True Positive Rate (TPR): is a measure of the total number of positive results (TP + FN) and how many positives were correctly predicted by the model (TP).

Recall = TP/(TP + FN)

Specificity or True Negative Rate (TNR): is the probability of a person testing negative (TN) who does not have a disease (TN + FP).

Specificity = TN/(TN+FP)

F1 Score: is the harmonic mean of precision and recall, so it’s an overall measure of the quality of a classifier’s predictions. It is usually the metric of choice for most people because it captures both precision and recall. While Precision tries to minimize FPs and Recall tries to minimize FNs, the F-1 metric maintains a balance between precision and recall and is defined as a harmonic mean between the two [4].

F1 = 2/((1/Precision)+(1/Recall)) = 2(Precision*Recall)/(Precision + Recall)

A numerical example using a Logistic Regression

First, let's load a dataset with two classes and make a Train-Test Split.

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1, cluster_std=3)

# demonstrate that the train-test split procedure is repeatable

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Let's train the logistic regression model and make predictions in the test dataset.

from sklearn.linear_model import LogisticRegression

# fit the model

#model = RandomForestClassifier(random_state=1)

model = LogisticRegression(random_state=1)

model.fit(X_train, y_train)

# make predictions

yhat = model.predict(X_test)

yhat

array([0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0])

Now, is possible to obtain and plot the confusion matrix. But, beware the plot will be inverted since value 1 means positive and the value 0 means negative.

from mlxtend.plotting import plot_confusion_matrix

from sklearn.metrics import classification_report, confusion_matrix

print("confusion matrix")

cm=confusion_matrix(y_test, yhat)

print(cm)

print('\n')

fig, ax = plot_confusion_matrix(conf_mat=cm,figsize=(10, 10),

show_absolute=True,

show_normed=True,

colorbar=True)

confusion matrix [[19 1] [ 0 13]]

The previous code enables the creation of a report derivated from the confusion metrics employing equations defined previously.

TP = cm[1][1]

FP = cm[1][0]

FN = cm[0][1]

TN = cm[0][0]

Accuracy = (TP + TN)/(TP + TN + FP + FN)

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

Specificity = TN/(TN + FP)

F1 = 2*(Precision*Recall)/(Precision + Recall)

print('TP = ', TP)

print('FP = ', FP)

print('FN = ', FN)

print('TN = ', TN)

print('Accuracy = ',Accuracy)

print('Precision = ',Precision)

print('Recall = ',Recall)

print('Specificity = ',Specificity)

print('F1 = ',F1)

TP = 13

FP = 0

FN = 1

TN = 19

Accuracy = 0.9696969696969697

Precision = 1.0

Recall = 0.9285714285714286

Specificity = 1.0 F1 = 0.962962962962963

It is also possible to create a report derived from the confusion metrics using commands from SKlearn library.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, yhat)

recall = recall_score(y_test, yhat)

precision = precision_score(y_test, yhat)

specificity = recall_score(y_test, yhat, pos_label = 0)

f1_score = f1_score(y_test, yhat)

print('Accuracy = ',accuracy)

print('Precision = ',precision)

print('Recall = ',recall)

print('Specificity = ',specificity)

print('F1_score = ',f1_score)

Accuracy = 0.9696969696969697

Precision = 0.9285714285714286

Recall = 1.0

Specificity = 0.95

F1_score = 0.962962962962963

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1hR7qRr8a8AVP8rJ4aepyQMZ54ovE5Ia6?usp=sharing

Evaluating the performance of a binary classification method - Threshold

In a previous text, it was discussed the performance metrics that can be applied to the assessment of a classifier. To review: Most classifiers produce a score, which is then thresholded to decide the classification. If a classifier produces a score between 0.0 (definitely negative) and 1.0 (definitely positive), it is common to consider anything over 0.5 as positive.

However, any threshold applied to a dataset (in which PP is the positive population and NP is the negative population) is going to produce true positives (TP), false positives (FP), true negatives (TN) and false negatives (FN) as shown in the next Figure [5].

From Threshold to True Positive Rate (TPR) and False Positive Rate (FPR)

The threshold value raises the following question. In a classification problem, we may decide to predict the class values directly. Alternatively, it can be more flexible to predict the probabilities for each class instead.

After computing the probability, it is possible to choose and even calibrate the threshold for how to interpret the predicted probabilities.

For example, a default might be to use a threshold of 0.5, meaning that a probability in [0.0, 0.49] is a negative outcome (0) and a probability in [0.5, 1.0] is a positive outcome (1). This threshold can be adjusted to tune the behavior of the model for a specific problem. An example would be to reduce more of one or another type of error.

When predicting a binary or two-class classification problem, there are two types of errors that we could make:

False Positive. Predict an event when there was no event.
False Negative. Predict no event when in fact there was an event.

By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen by the operator of the model [6], i.e., an appropriate choice of a threshold value. Two common metrics employed are:

Recall = True Positive Rate (TPR) = True Positives / (True Positives + False Negatives)
False Positive Rate (FPR) = 1 - Specificity where: Specificity = True Negatives / (True Negatives + False Positives)

Visualization of TPR and FPR for a given classification method

One way to visualize the impact of threshold in a classification method is the following: TP (sensitivity) can then be plotted against FP (1 – specificity) for each threshold used. The resulting graph is called a Receiver Operating Characteristic (ROC) curve. ROC curves were developed for use in signal detection in radar returns in the 1950’s, and have since been applied to a wide range of problems [5].

AUC - ROC curve is a performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes. The higher the AUC, the better the model is at predicting 0 classes as 0 and 1 classes as 1. By analogy, the Higher the AUC, the better the model is at distinguishing between patients with the disease and no disease [8].

There are four combinations of how ROC curve and AUC are connected [7]:

Case 1: ROC curve with an AUC = 1. This is an ideal situation. When two curves don't overlap at all means model has an ideal measure of separability. It is perfectly able to distinguish between positive class and negative class.
Case 2: ROC curve with an AUC = 0.7. When two distributions overlap, we introduce type 1 and type 2 errors. Depending upon the threshold, we can minimize or maximize them. When AUC is 0.7, it means there is a 70% chance that the model will be able to distinguish between positive class and negative class.
Case 3: ROC curve with an AUC = 0.5. This is the worst situation. When AUC is approximately 0.5, the model has no discrimination capacity to distinguish between positive class and negative class.
Case 4: ROC curve with an AUC = 0. When AUC is approximately 0, the model reciprocates the classes. It means the model is predicting a negative class as a positive class and vice versa.

The next figure illustrates these four combinations. Observe that the red distribution curve is of the positive class (patients with disease), and the blue distribution curve is of the negative class(patients with no disease).

A numerical example with ROC/AUC using a Logistic Regression

First, start loading the data and making a Train-Test Split.

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=1, cluster_std=3)

# demonstrate that the train-test split procedure is repeatable

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1)

Now, let's train the logistic regression model and make predictions in the test dataset. In a classification problem, we may decide to predict the class values directly. Alternatively, it can be more flexible to predict the probabilities for each class instead.

from sklearn.linear_model import LogisticRegression

# fit the model

#model = RandomForestClassifier(random_state=1)

model = LogisticRegression(random_state=1)

model.fit(X_train, y_train)

# make predictions using threeshold: probability

y_predict_prob = model.predict_proba(X_test)

# Extracting predicted probability of class 1 (positive)

y_predict_prob_class_1 = y_predict_prob[:,1]

y_predict_prob_class_1

array([6.22613278e-07, 8.83529383e-02, 4.67263385e-02, 9.99999380e-01, 1.22132884e-02, 2.30430623e-01, 1.45052820e-05, 1.38690821e-04, 3.84580408e-04, 9.99408588e-01, 1.21112942e-03, 9.99996908e-01, 4.46221526e-03, 9.99953385e-01, 9.99998572e-01, 9.99999925e-01, 3.29780967e-01, 8.26576477e-04, 2.24465063e-01, 9.99999901e-01, 7.01590605e-01, 8.69737921e-06, 8.33646557e-05, 9.98645234e-01, 9.99999842e-01, 9.99997339e-01, 9.98449327e-01, 1.85465989e-03, 2.03953049e-05, 9.99999246e-01, 9.99999885e-01, 1.96449285e-05, 3.13539713e-04])

After computing the probability, it is possible to choose and even calibrate the threshold for how to interpret the predicted probabilities. For example, a default might be to use a threshold of 0.5, meaning that a probability in [0.0, 0.49] is a negative outcome (0) and a probability in [0.5, 1.0] is a positive outcome (1).

y_predict_class_tes = {}

# Define threshold 0.1

tes = 0.1

y_predict_class_tes[0.1] = [1 if prob > tes else 0 for prob in y_predict_prob_class_1]

# Define threshold 0.5

tes = 0.5

y_predict_class_tes[0.5] = [1 if prob > tes else 0 for prob in y_predict_prob_class_1]

# Define threshold 0.9

tes = 0.9

y_predict_class_tes[0.9] = [1 if prob > tes else 0 for prob in y_predict_prob_class_1]

for (key, value) in y_predict_class_tes.items():

print('y_predict_class with threshold = '+str(key*100)+'%: ',y_predict_class_tes[key])

y_predict_class with threshold = 10.0%:

[0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]

y_predict_class with threshold = 50.0%:

[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]

y_predict_class with threshold = 90.0%:

[0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0]

The threshold can be adjusted to tune the behavior of the model for a specific problem. An example would be to reduce more of one or another type of error. When predicting a binary or two-class classification problem, there are two types of errors that we could make:

Recall = True Positive Rate (TPR) = True Positives / (True Positives + False Negatives),
False Positive Rate (FPR) = 1 - Specificity, where: Specificity = True Negatives / (True Negatives + False Positives).

By predicting probabilities and calibrating a threshold, a balance of these two concerns can be chosen by the operator of the model.

from sklearn.metrics import recall_score

recall = recall_score(y_test, y_predict_class_tes[0.1])

specificity = recall_score(y_test, y_predict_class_tes[0.1], pos_label = 0)

print('TPR = ',recall)

print('FPR = ',1-specificity)

TPR = 1.0

FPR = 0.19999999999999996

The next commands automatically obtain TPR and FPR for several values of threshold.

from sklearn.metrics import roc_curve, roc_auc_score

# calculate roc curve

fpr, tpr, thresholds = roc_curve(y_test, y_predict_prob_class_1)

print('thresholds = ',thresholds)

print('tpr = ',tpr)

print('fpr = ',fpr)

# calculate scores

log_auc = roc_auc_score(y_test, y_predict_prob_class_1)

print('Logistic: ROC AUC=%.3f' % (log_auc))

thresholds = [1.99999992e+00 9.99999925e-01 9.98449327e-01 6.22613278e-07]

tpr = [0. 0.07692308 1. 1. ]

fpr = [0. 0. 0. 1.]

Logistic: ROC AUC=1.000

The next commands are useful tools to predict the probability of a binary outcome using the Receiver Operating Characteristic curve, or ROC curve.

import matplotlib.pyplot as plt

# plot the roc curve for the mode

plt.plot(fpr, tpr, linestyle='--', label='Logistic')

# axis labels

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

# show the legend

plt.legend()

# show the plot

plt.show()

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1hYlyyOKeLrFYHafA9O7Mji9QXOzmIGov?usp=sharing

References

[1] https://pub.towardsai.net/quantify-the-performance-of-classifiers-f73c33199631

[2] https://www.kdnuggets.com/2022/09/visualizing-confusion-matrix-scikitlearn.html?fbclid=IwAR3OI45bveM3POALQExO7zmlalCKN_S5ctMdr_maj6MlJ20wfFlHpF2CHVI

[3] https://pub.towardsai.net/quantify-the-performance-of-classifiers-f73c33199631

[4] https://pub.towardsai.net/deep-dive-into-confusion-matrix-6b8111d5c3f7

[5]https://machinelearningmastery.com/assessing-comparing-classifier-performance-roc-curves-2/

[6] https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/

[7] https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5

[8] https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc

Additional References about Confusion Matrix

True Positive Rate (TPR), Sensitivity, Recall: It is the probability of a person testing positive who has a disease. In other words, Recall is the proportion of examples of a particular class predicted by the model as belonging to that class.

Text about performance of classification methods:

https://pub.towardsai.net/quantify-the-performance-of-classifiers-f73c33199631

Evaluating classification methods:

https://medium.com/@Coursesteach/binary-classification-model-evaluation-d4232ad55a48

About confusion matrix equations:

https://pub.towardsai.net/deep-dive-into-confusion-matrix-6b8111d5c3f7

Summary figure about the equations employed to build confusion matrix:

https://devopedia.org/confusion-matrix

Python code to obtain colored confusion matrix:

https://www.kdnuggets.com/2022/09/visualizing-confusion-matrix-scikitlearn.html?fbclid=IwAR3OI45bveM3POALQExO7zmlalCKN_S5ctMdr_maj6MlJ20wfFlHpF2CHVI

ROC curve and relation with confusion matrix:

https://www.v7labs.com/blog/confusion-matrix-guide