2.2. Evaluation of Classification Methods using ROC curve

Financial organizations are concerned with reducing the risk of default. Therefore, It is important that credit risk is assessed. Credit risk is the possibility of a Borrower fails to make timely payments and defaults on his debt. That is, this risk allows financial organizations to identify the possibility that they will not receive the interest or the amount owed to you within the term. In this sense, the use of mathematical models such as logistic regression and neural networks to generate a risk probability from metrics about an individual. Such an approach can be expanded to analyze products or companies and also to the context of customs.

In this sense, information from individuals, which can be used to study the performance of machine learning methods, can be found in the following database:

https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset/

The code given below aims to:

Perform data cleaning and preparation,
Application of classification methods to data,
Analysis of the performance of the classification method for predicted risk metrics.

Data cleaning and preparation

The next code makes data cleaning and preparation for the application of the logistic regression model. But, let's load libraries and read the dataset.

# Importing libraries

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

# Reading the data

import pandas as pd

# https://drive.google.com/file/d/1vtlqZ791L0uXcUde4jS3769EdTJ35tVB/view?usp=sharing

url = 'https://drive.google.com/uc?export=download&id=1vtlqZ791L0uXcUde4jS3769EdTJ35tVB'

credit_risk= pd.read_csv(url)

credit_risk.head()

Copying and Verifying columns data type

Next, we create a copy from the original data and observe the nature of each column data.

# Here I am doing to copy the original data in data frame called df.

df= credit_risk.copy()

# Lets see the information of data

df.info()

Removing 'ID' Column

The column ['ID'] should be removed since it will have no function on building a model to predict payment on the next month.

# As we seen Column ID has no meaning here so, we will remove it

df.drop(["ID"], axis=1, inplace= True) #axis=1 -- column removal and inplcae= True --means change in the original data

# Lets check the statistics of data

df.describe()

Checking for missing data

The next code will check if 'df' has any missing values.

# checking for missing values

df.isnull().sum()

Separating input and output data sets

To create a logistic regression model. it is necessary to separate dataset columns into input and output data to be used by logistic regression.

# Independent features

X = df.drop(['default.payment.next.month'], axis=1)

# Dependent feature

y = df['default.payment.next.month']

X.head()

Scaling the features

So, scaling the independent features is very important so that our model is not biased toward the higher range of values. To make all features in the same range.

from sklearn.preprocessing import StandardScaler

scaler= StandardScaler()

X= scaler.fit_transform(X)

Separating data into train and test datasets

Now, it is possible to separate data into train and test datasets.

# demonstrate that the train-test split procedure is repeatable

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

y, y_train, y_test

Logistic Regression Model

The next code create a classification model using logistic regression.

# train-test split evaluation random forest on the sonar dataset

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score

# fit the model

#model = RandomForestClassifier(random_state=1)

logistic_model = LogisticRegression(random_state=42)

logistic_model.fit(X_train, y_train)

y_train

Logistic Regression Model Predictions

Now, it is possible to use trained model to predict if a client will pay or not using test dataset.

# make predictions with a certain probability

#Or:

#yhat_logistic = logistic_model.predict(X_test)

# Getting the probability to be in the second class.

yhat = logistic_model.predict_proba(X_test)[:,1]

predictions = (yhat > 0.5).astype(int)

yhat_logistic = np.hstack(predictions)

# evaluate predictions

acc = accuracy_score(y_test, yhat_logistic)

print('Accuracy: %.3f' % acc)

Checking precision using K-Fold Cross-Validation

The next code employs K-fold cross-validation to verify that sample is not skewed in some way, or not representing the whole dataset.

from sklearn.model_selection import KFold

from sklearn.model_selection import cross_val_score

from sklearn.linear_model import LogisticRegression

num_folds = 10

kfold = KFold(n_splits=num_folds, random_state=None)

model_kfold = LogisticRegression()

scores = cross_val_score(model_kfold, X_train, y_train, cv=kfold)

print("Scores:", scores)

print("Mean:", scores.mean())

print("Standard deviation:", scores.std())

Computing Confusion Matrix

To evaluate the logistic regression model performance, one useful metric is the confusion matrix which could be computed employing the following code.

from mlxtend.plotting import plot_confusion_matrix

from sklearn.metrics import classification_report, confusion_matrix

print("confusion matrix")

cm=confusion_matrix(y_test, yhat_logistic)

print(cm)

print('\n')

fig, ax = plot_confusion_matrix(conf_mat=cm,figsize=(10, 10),

show_absolute=True,

show_normed=True,

colorbar=True)

confusion matrix

[[4549 138]

[1004 309]]

Confusion Matrix related metrics

With the confusion matrix is possible to compute the following related metrics: Accuracy, Precision, Specificity, Recall, and F1.

TP = cm[1][1]

FP = cm[1][0]

FN = cm[0][1]

TN = cm[0][0]

Accuracy = (TP + TN)/(TP + TN + FP + FN)

Precision = TP/(TP + FP)

Recall = TP/(TP + FN)

Specificity = TN/(TN + FP)

F1 = 2*(Precision*Recall)/(Precision + Recall)

print('TP = ', TP)

print('FP = ', FP)

print('FN = ', FN)

print('TN = ', TN)

print('Accuracy = ',Accuracy)

print('Precision = ',Precision)

print('Recall = ',Recall)

print('Specificity = ',Specificity)

print('F1 = ',F1)

Computing the probability of belonging to class '0' or '1'

A useful tool when predicting the probability of a binary outcome is the Receiver Operating Characteristic curve or ROC curve. To obtain the ROC curve it is necessary to compute the probability of each model output belonging to a certain class, i.e., the first value is the probability of a certain output belonging to the class labeled as “0”, and the second value is the probability of the output belong to the class labeled as “1”.

# make predictions using threshold: probability

y_predict_prob_logistic = logistic_model.predict_proba(X_test)

y_predict_prob_logistic

array([[0.76860217, 0.23139783],

[0.83705663, 0.16294337],

[0.79600644, 0.20399356],

...,

[0.75081168, 0.24918832],

[0.69824336, 0.30175664],

[0.86139885, 0.13860115]])

Selecting only the probability of each value to belong to the class labeled as “1”

# keep probabilities for the positive outcome only

y_prob_logistic = y_predict_prob_logistic[:, 1]

y_prob_logistic

array([0.23139783, 0.16294337, 0.20399356, ..., 0.24918832, 0.30175664,

0.13860115])

Obtaining ROC Curve metrics

With the confusion matrix metrics is possible to compute data to obtain ROC Curve metrics related to logistic regression model performance.

from sklearn.metrics import roc_curve, roc_auc_score

# make predictions using threeshold: probability

y_predict_prob_logistic = logistic_model.predict_proba(X_test)

# keep probabilities for the positive outcome only

y_prob_logistic = y_predict_prob_logistic[:, 1]

# calculate roc curve

fpr_logistic, tpr_logistic, thresholds_logistic = roc_curve(y_test, y_prob_logistic)

print('thresholds = ',thresholds_logistic)

print('tpr = ',tpr_logistic)

print('fpr = ',fpr_logistic)

# calculate scores

log_auc = roc_auc_score(y_test, y_prob_logistic)

print('Logistic: ROC AUC=%.3f' % (log_auc))

Obtaining ROC Curve Graphics

Employing the previously computed ROC Curve metrics is possible to build the ROC Curve Graphics related to the logistic regression model.

import matplotlib.pyplot as plt

# plot the roc curve for the mode

plt.plot(fpr_logistic, tpr_logistic, linestyle='--', label='Logistic')

# axis labels

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

# show the legend

plt.legend()

# show the plot

plt.grid()

plt.show()

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1Z0AFU2Xnkm9Z7IYx0SVkYvEen_eZKGGg?usp=sharing

Page updated

Google Sites

Report abuse