2.2. Predictions on Cumulative Probability

1. Concepts & Definitions

1.1. Linear regression: Concepts and equations

1.2. Linear regression: Numerical example

1.3. Correlation is no causation

1.4. Dummy and categorical variables

1.5. Multiple linear regression

1.6. Dummy multiple linear regression

2. Problem & Solution

2.1. Predicting Exportation & Importation Volume

2.2. Cumulative Probability Predictions

2.3. Multiple Linear Regression Philippine Revenue

What is the probability of infraction given a certain risk score level?

In the article "Risk management systems: using data mining in developing countries’ customs administrations", Bertrand Laporte developed a risk management system that uses statistical scoring techniques [1]. At best, the score should reflect risk of infraction (or even the probability of an infraction occurring). The article also compiled three table related with three ways of computing score: simple average, weighted average, and maximum value. For these three scores it was given a proportion (or can be seen as a probability) of given a score interval what are the cumulative number of declarations with infraction.

These three score values from article is compile and summarized in an Excel file:

https://docs.google.com/spreadsheets/d/1lVEJjvqxdOaDSTpjWu8812vb__ymIG4z/edit?usp=sharing&ouid=106640872116257813737&rtpof=true&sd=true

For a more detailed description about the context of the problem addressed by the article [1], article [2] could be useful.

From the data, one interesting question that emerges is: "What is the probability of infraction given a certain risk score level". This probability could be estimated employing linear regression as will be done by the next Python commands.

Reading and cleaning data

The next commands read data from a Google Spreadsheet and transform it to a dataframe format.

import pandas as pd

# Original shared link: https://docs.google.com/spreadsheets/d/1lVEJjvqxdOaDSTpjWu8812vb__ymIG4z/edit?usp=sharing&ouid=106640872116257813737&rtpof=true&sd=true

url = "https://drive.google.com/file/d/1lVEJjvqxdOaDSTpjWu8812vb__ymIG4z/view?usp=sharing"

url2='https://drive.google.com/uc?id=' + url.split('/')[-2]

df = pd.read_excel(url2)

df

Before employ the columns to build a linear regression model, is necessary to verify the types of variables of each column.

# get the data types of each column

print("\nData types of each column:")

print(df.dtypes)

Since the data format of all columns are apropriate to employ them into build a linear regression model, we can proceed.

First, let's obtain the names of the columns with variable values.

list(df.columns)[1:]

['Category', 'Simple Average (1)', 'Weighted Average (2)', 'Maximum Value (3)']

Now, a visual verification enables to check if exists linear correlation among which variables.

import seaborn as sns

# Selection: interest_rate unemployment_rate index_price

X = df[list(df.columns)[1:]]

sns.pairplot(X);

Since the three scores presented a linear correlation with the variable category, with a notable exception in the first category, then a linear regression to predict the probability of a declaration has an infraction, given that it belongs to a certain score interval, could be built.

import pandas as pd

from sklearn import linear_model

def linear_regression_report(x, y):

regr_ = linear_model.LinearRegression()

regr_.fit(x, y)

return regr_

def model_report(model, name='None '):

print('-------------------------------')

print(name,' model')

print('-------------------------------')

print('Intercept: \n', model.intercept_)

print('Coefficients: \n', model.coef_)

# Defining independent and dependent variables

x = df[['Category']]

y_sa = df['Simple Average (1)']

y_wa = df['Weighted Average (2)']

y_mv = df['Maximum Value (3)']

# Linear regression with sklearn to predict score using simple average

model_sa = linear_regression_report(x, y_sa)

# Report about the linear regression model using score as an output

model_report(model_sa, 'Score using Simple Average')

# Linear regression with sklearn to predict score using weighted average

model_wa = linear_regression_report(x, y_wa)

# Report about the linear regression model using importation as an output

model_report(model_wa, 'Score using Weighted Average')

# Linear regression with sklearn to predict score using maximum value

model_mv = linear_regression_report(x, y_mv)

# Report about the linear regression model using score as an output

model_report(model_mv, 'Score using Maximum Value')

-------------------------------

Score using Simple Average model

-------------------------------

Intercept:

3.886666666666656

Coefficients:

[10.21151515]

-------------------------------

Score using Weighted Average model

-------------------------------

Intercept:

16.146666666666647

Coefficients:

[9.10060606]

-------------------------------

Score using Maximum Value model

-------------------------------

Intercept:

33.873333333333335

Coefficients:

[7.67030303]

A more detailed linear regression model could be obtained with the following commands.

import statsmodels.api as sm

def get_model(x, y):

# Obtaining multiple linear regression using statsmodels: exportation is the output

xm = sm.add_constant(x) # adding a constant

model = sm.OLS(y, xm).fit()

return model, xm

def print_model(model, name='None'):

summary = model.summary()

print('--------------------------------------')

print(name,'model')

print(summary)

print('--------------------------------------')

# Obtaining multiple linear regression using statsmodels: sa is the input

model_sa, xm_sa = get_model(x, y_sa)

print_model(model_sa,'Simple Average')

# Obtaining multiple linear regression using statsmodels: wa is the input

model_wa, xm_wa = get_model(x, y_wa)

print_model(model_wa,'Weighted Average')

# Obtaining multiple linear regression using statsmodels: mv is the input

model_mv, xm_mv = get_model(x, y_mv)

print_model(model_mv,'Maximum value')

--------------------------------------

Simple Average model

OLS Regression Results

==============================================================================

Dep. Variable: Simple Average (1) R-squared: 0.959

Model: OLS Adj. R-squared: 0.954

Method: Least Squares F-statistic: 186.8

Date: Mon, 06 Nov 2023 Prob (F-statistic): 7.91e-07

Time: 20:29:56 Log-Likelihood: -32.222

No. Observations: 10 AIC: 68.44

Df Residuals: 8 BIC: 69.05

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

const 3.8867 4.636 0.838 0.426 -6.804 14.577

Category 10.2115 0.747 13.668 0.000 8.489 11.934

==============================================================================

Omnibus: 4.516 Durbin-Watson: 1.548

Prob(Omnibus): 0.105 Jarque-Bera (JB): 1.623

Skew: -0.958 Prob(JB): 0.444

Kurtosis: 3.471 Cond. No. 13.7

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

--------------------------------------

Weighted Average model

OLS Regression Results

================================================================================

Dep. Variable: Weighted Average (2) R-squared: 0.937

Model: OLS Adj. R-squared: 0.930

Method: Least Squares F-statistic: 119.8

Date: Mon, 06 Nov 2023 Prob (F-statistic): 4.31e-06

Time: 20:29:56 Log-Likelihood: -33.291

No. Observations: 10 AIC: 70.58

Df Residuals: 8 BIC: 71.19

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

const 16.1467 5.159 3.130 0.014 4.251 28.043

Category 9.1006 0.831 10.946 0.000 7.183 11.018

==============================================================================

Omnibus: 8.987 Durbin-Watson: 1.240

Prob(Omnibus): 0.011 Jarque-Bera (JB): 3.858

Skew: -1.413 Prob(JB): 0.145

Kurtosis: 4.126 Cond. No. 13.7

==============================================================================

Notes:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

--------------------------------------

Maximum value model

OLS Regression Results

==============================================================================

Dep. Variable: Maximum Value (3) R-squared: 0.771

Model: OLS Adj. R-squared: 0.742

Method: Least Squares F-statistic: 26.92

Date: Mon, 06 Nov 2023 Prob (F-statistic): 0.000834

Time: 20:29:56 Log-Likelihood: -39.046

No. Observations: 10 AIC: 82.09

Df Residuals: 8 BIC: 82.70

Df Model: 1

Covariance Type: nonrobust

==============================================================================

coef std err t P>|t| [0.025 0.975]

------------------------------------------------------------------------------

const 33.8733 9.173 3.693 0.006 12.721 55.025

Category 7.6703 1.478 5.189 0.001 4.261 11.079

==============================================================================

Omnibus: 9.301 Durbin-Watson: 1.114

Prob(Omnibus): 0.010 Jarque-Bera (JB): 3.984

Skew: -1.416 Prob(JB): 0.136

Kurtosis: 4.242 Cond. No. 13.7

==============================================================================

It is also possible to obtain a graphical comparison between the values from the dataset and from the model's predictions.

import matplotlib.pyplot as plt

def model_predict(model, xm):

y_hat = model.predict(xm)

return y_hat

def draw_model(model, y, yname, y_hat):

xp = list(range(1,len(x)+1))

plt.plot(xp,y,'ob',label='Data')

plt.plot(xp,y,'-r')

plt.plot(xp,y_hat,'--g',label='Prediction')

plt.title('Prediction on ' + str(yname))

plt.xlabel('Time')

plt.ylabel(yname)

plt.legend()

plt.grid()

plt.show()

# Predict the values used to build the sa model coefficients.

y_hat_sa = model_predict(model_sa, xm_sa)

#print(y_hat_export)

# Drawing to compare data x prediction in score data.

draw_model(model_sa, y_sa, 'SA', y_hat_sa)

# Predict the values used to build the wa model coefficients.

y_hat_wa = model_predict(model_wa, xm_wa)

#print(y_hat_export)

# Drawing to compare data x prediction in score data.

draw_model(model_wa, y_wa, 'WA', y_hat_wa)

# Predict the values used to build the mv model coefficients.

y_hat_mv = model_predict(model_mv, xm_mv)

#print(y_hat_export)

# Drawing to compare data x prediction in score data.

draw_model(model_mv, y_mv, 'MV', y_hat_mv)

It is possible to extract models performance metrics using the following commands.

from sklearn import metrics

import numpy as np

#Extracting model performance metrics.

def get_metrics(y, y_hat):

mae = metrics.mean_absolute_error(y, y_hat)

mse = metrics.mean_squared_error(y, y_hat)

rmse = np.sqrt(metrics.mean_squared_error(y, y_hat))

metrics_dict={}

metrics_dict['MAE'] = mae

metrics_dict['MSE'] = mse

metrics_dict['RMSE'] = rmse

return metrics_dict

# Print a summary about models performance metrics.

def print_metrics(metrics_dict, name):

print(name + ' model metrics')

for (key, value) in metrics_dict.items():

print(key+' = '+str(value))

print('---------------------------')

# Metrics for sa model

sa_metrics = get_metrics(y_sa, y_hat_sa)

print_metrics(sa_metrics,'Simple Average')

# Metrics for wa model

wa_metrics = get_metrics(y_wa, y_hat_wa)

print_metrics(wa_metrics,'Weighted Average')

# Metrics for mv model

mv_metrics = get_metrics(y_mv, y_hat_mv)

print_metrics(mv_metrics,'Maximum Value')

Simple Average model metrics

MAE = 4.439999999999996

MSE = 36.841406060606076

RMSE = 6.06971218927274

---------------------------

Weighted Average model metrics

MAE = 5.070424242424241

MSE = 45.620496969696966

RMSE = 6.754294705570446

---------------------------

Maximum Value model metrics

MAE = 9.09236363636364

MSE = 144.23162424242423

RMSE = 12.009647132302607

---------------------------

Since linear regression model based on Simple Average Score showed the best model metrics, it will shown the probability of a certain declaration has an infraction for being in a certain score interval for this model. First, let's obtain the linear regression model.

y_hat_sa = model_predict(model_sa, xm_sa.iloc[0,:])

y_hat_sa

None 14.098182

dtype: float64

Now, use the obtained linear regression to obtain the probability of each category employing a decoding function as done in the next code. The decode function employed the concept of cummulative probability.

def decode_Category(model, xm, category):

if (category > 1):

value = xm.iloc[category-2,:]

a = model_predict(model, value)

value = xm.iloc[category-1,:]

b = model_predict(model, value)

else:

a = 0

value = xm.iloc[category-1,:]

b = model_predict(model, value)

prob = b - a

return prob

category = list(range(1,11))

for cat in category:

prob = decode_Category(model_wa, xm_wa, cat)

print('Class = ',cat,'Probability = ',prob)

Class = 1 Probability = None 25.247273

dtype: float64

Class = 2 Probability = None 9.100606

dtype: float64

Class = 3 Probability = None 9.100606

dtype: float64

Class = 4 Probability = None 9.100606

dtype: float64

Class = 5 Probability = None 9.100606

dtype: float64

Class = 6 Probability = None 9.100606

dtype: float64

Class = 7 Probability = None 9.100606

dtype: float64

Class = 8 Probability = None 9.100606

dtype: float64

Class = 9 Probability = None 9.100606

dtype: float64

Class = 10 Probability = None 9.100606

dtype: float64

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1e31Pv5_3Q98G-pHJPv7Y8AKaoBKLuF96?usp=sharing

References

[1]https://worldcustomsjournal.org/Archives/Volume%205%2C%20Number%201%20(Mar%202011)/03%20Laporte.pdf

[2]https://worldcustomsjournal.org/Archives/Volume%2012%2C%20Number%202%20%28Sep%202018%29/1855%2001%20WCJ%20v12n2%20Hoffman.pdf