1. Concepts & Definitions
1.1. Regression versus Classification
1.3. Parameter versus Hyperparameter
1.4. Training, Validation, and Test
2. Problem & Solution
2.1. Gaussian Mixture x K-means on HS6 Weight
2.2. Evaluation of classification method using ROC curve
2.3. Comparing logistic regression, neural network, and ensemble
2.4. Fruits or not, split or encode and scale first?
Financial organizations are concerned with reducing the risk of default. Therefore, It is important that credit risk is assessed. Credit risk is the possibility of a Borrower fails to make timely payments and defaults on his debt. That is, this risk allows financial organizations to identify the possibility that they will not receive the interest or the amount owed to you within the term. In this sense, the use of mathematical models such as logistic regression and neural networks to generate a risk probability from metrics about an individual. Such an approach can be expanded to analyze products or companies and also to the context of customs.
In this sense, information from individuals, which can be used to study the performance of machine learning methods, can be found in the following database:
https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset/
The code given below aims to:
Perform data cleaning and preparation,
Application of classification methods to data,
Analysis of the performance of the classification method for predicted risk metrics.
The next code makes data cleaning and preparation for the application of the logistic regression model. But, let's load libraries and read the dataset.
# Importing libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Reading the data
import pandas as pd
# https://drive.google.com/file/d/1vtlqZ791L0uXcUde4jS3769EdTJ35tVB/view?usp=sharing
url = 'https://drive.google.com/uc?export=download&id=1vtlqZ791L0uXcUde4jS3769EdTJ35tVB'
credit_risk= pd.read_csv(url)
credit_risk.head()
Next, we create a copy from the original data and observe the nature of each column data.
# Here I am doing to copy the original data in data frame called df.
df= credit_risk.copy()
# Lets see the information of data
df.info()
The column ['ID'] should be removed since it will have no function on building a model to predict payment on the next month.
# As we seen Column ID has no meaning here so, we will remove it
df.drop(["ID"], axis=1, inplace= True) #axis=1 -- column removal and inplcae= True --means change in the original data
# Lets check the statistics of data
df.describe()
The next code will check if 'df' has any missing values.
# checking for missing values
df.isnull().sum()
To create a logistic regression model. it is necessary to separate dataset columns into input and output data to be used by logistic regression.
# Independent features
X = df.drop(['default.payment.next.month'], axis=1)
# Dependent feature
y = df['default.payment.next.month']
X.head()
So, scaling the independent features is very important so that our model is not biased toward the higher range of values. To make all features in the same range.
from sklearn.preprocessing import StandardScaler
scaler= StandardScaler()
X= scaler.fit_transform(X)
Now, it is possible to separate data into train and test datasets.
# demonstrate that the train-test split procedure is repeatable
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)
y, y_train, y_test
The next code create a classification model using logistic regression.
# train-test split evaluation random forest on the sonar dataset
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# fit the model
#model = RandomForestClassifier(random_state=1)
logistic_model = LogisticRegression(random_state=42)
logistic_model.fit(X_train, y_train)
y_train
Now, it is possible to use trained model to predict if a client will pay or not using test dataset.
# make predictions with a certain probability
#Or:
#yhat_logistic = logistic_model.predict(X_test)
# Getting the probability to be in the second class.
yhat = logistic_model.predict_proba(X_test)[:,1]
predictions = (yhat > 0.5).astype(int)
yhat_logistic = np.hstack(predictions)
# evaluate predictions
acc = accuracy_score(y_test, yhat_logistic)
print('Accuracy: %.3f' % acc)
The next code employs K-fold cross-validation to verify that sample is not skewed in some way, or not representing the whole dataset.
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
num_folds = 10
kfold = KFold(n_splits=num_folds, random_state=None)
model_kfold = LogisticRegression()
scores = cross_val_score(model_kfold, X_train, y_train, cv=kfold)
print("Scores:", scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
To evaluate the logistic regression model performance, one useful metric is the confusion matrix which could be computed employing the following code.
from mlxtend.plotting import plot_confusion_matrix
from sklearn.metrics import classification_report, confusion_matrix
print("confusion matrix")
cm=confusion_matrix(y_test, yhat_logistic)
print(cm)
print('\n')
fig, ax = plot_confusion_matrix(conf_mat=cm,figsize=(10, 10),
show_absolute=True,
show_normed=True,
colorbar=True)
confusion matrix
[[4549 138]
[1004 309]]
With the confusion matrix is possible to compute the following related metrics: Accuracy, Precision, Specificity, Recall, and F1.
TP = cm[1][1]
FP = cm[1][0]
FN = cm[0][1]
TN = cm[0][0]
Accuracy = (TP + TN)/(TP + TN + FP + FN)
Precision = TP/(TP + FP)
Recall = TP/(TP + FN)
Specificity = TN/(TN + FP)
F1 = 2*(Precision*Recall)/(Precision + Recall)
print('TP = ', TP)
print('FP = ', FP)
print('FN = ', FN)
print('TN = ', TN)
print('Accuracy = ',Accuracy)
print('Precision = ',Precision)
print('Recall = ',Recall)
print('Specificity = ',Specificity)
print('F1 = ',F1)
A useful tool when predicting the probability of a binary outcome is the Receiver Operating Characteristic curve or ROC curve. To obtain the ROC curve it is necessary to compute the probability of each model output belonging to a certain class, i.e., the first value is the probability of a certain output belonging to the class labeled as “0”, and the second value is the probability of the output belong to the class labeled as “1”.
# make predictions using threshold: probability
y_predict_prob_logistic = logistic_model.predict_proba(X_test)
y_predict_prob_logistic
array([[0.76860217, 0.23139783],
[0.83705663, 0.16294337],
[0.79600644, 0.20399356],
...,
[0.75081168, 0.24918832],
[0.69824336, 0.30175664],
[0.86139885, 0.13860115]])
Selecting only the probability of each value to belong to the class labeled as “1”
# keep probabilities for the positive outcome only
y_prob_logistic = y_predict_prob_logistic[:, 1]
y_prob_logistic
array([0.23139783, 0.16294337, 0.20399356, ..., 0.24918832, 0.30175664,
0.13860115])
With the confusion matrix metrics is possible to compute data to obtain ROC Curve metrics related to logistic regression model performance.
from sklearn.metrics import roc_curve, roc_auc_score
# make predictions using threeshold: probability
y_predict_prob_logistic = logistic_model.predict_proba(X_test)
# keep probabilities for the positive outcome only
y_prob_logistic = y_predict_prob_logistic[:, 1]
# calculate roc curve
fpr_logistic, tpr_logistic, thresholds_logistic = roc_curve(y_test, y_prob_logistic)
print('thresholds = ',thresholds_logistic)
print('tpr = ',tpr_logistic)
print('fpr = ',fpr_logistic)
# calculate scores
log_auc = roc_auc_score(y_test, y_prob_logistic)
print('Logistic: ROC AUC=%.3f' % (log_auc))
Employing the previously computed ROC Curve metrics is possible to build the ROC Curve Graphics related to the logistic regression model.
import matplotlib.pyplot as plt
# plot the roc curve for the mode
plt.plot(fpr_logistic, tpr_logistic, linestyle='--', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.grid()
plt.show()
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1Z0AFU2Xnkm9Z7IYx0SVkYvEen_eZKGGg?usp=sharing