1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
In the article "Risk management systems: using data mining in developing countries’ customs administrations", Bertrand Laporte developed a risk management system that uses statistical scoring techniques [1]. At best, the score should reflect risk of infraction (or even the probability of an infraction occurring). The article also compiled three table related with three ways of computing score: simple average, weighted average, and maximum value. For these three scores it was given a proportion (or can be seen as a probability) of given a score interval what are the cumulative number of declarations with infraction.
These three score values from article is compile and summarized in an Excel file:
For a more detailed description about the context of the problem addressed by the article [1], article [2] could be useful.
From the data, one interesting question that emerges is: "What is the probability of infraction given a certain risk score level". This probability could be estimated employing linear regression as will be done by the next Python commands.
The next commands read data from a Google Spreadsheet and transform it to a dataframe format.
import pandas as pd
# Original shared link: https://docs.google.com/spreadsheets/d/1lVEJjvqxdOaDSTpjWu8812vb__ymIG4z/edit?usp=sharing&ouid=106640872116257813737&rtpof=true&sd=true
url = "https://drive.google.com/file/d/1lVEJjvqxdOaDSTpjWu8812vb__ymIG4z/view?usp=sharing"
url2='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_excel(url2)
df
Before employ the columns to build a linear regression model, is necessary to verify the types of variables of each column.
# get the data types of each column
print("\nData types of each column:")
print(df.dtypes)
Since the data format of all columns are apropriate to employ them into build a linear regression model, we can proceed.
First, let's obtain the names of the columns with variable values.
list(df.columns)[1:]
['Category', 'Simple Average (1)', 'Weighted Average (2)', 'Maximum Value (3)']
Now, a visual verification enables to check if exists linear correlation among which variables.
import seaborn as sns
# Selection: interest_rate unemployment_rate index_price
X = df[list(df.columns)[1:]]
sns.pairplot(X);
Since the three scores presented a linear correlation with the variable category, with a notable exception in the first category, then a linear regression to predict the probability of a declaration has an infraction, given that it belongs to a certain score interval, could be built.
import pandas as pd
from sklearn import linear_model
def linear_regression_report(x, y):
regr_ = linear_model.LinearRegression()
regr_.fit(x, y)
return regr_
def model_report(model, name='None '):
print('-------------------------------')
print(name,' model')
print('-------------------------------')
print('Intercept: \n', model.intercept_)
print('Coefficients: \n', model.coef_)
# Defining independent and dependent variables
x = df[['Category']]
y_sa = df['Simple Average (1)']
y_wa = df['Weighted Average (2)']
y_mv = df['Maximum Value (3)']
# Linear regression with sklearn to predict score using simple average
model_sa = linear_regression_report(x, y_sa)
# Report about the linear regression model using score as an output
model_report(model_sa, 'Score using Simple Average')
# Linear regression with sklearn to predict score using weighted average
model_wa = linear_regression_report(x, y_wa)
# Report about the linear regression model using importation as an output
model_report(model_wa, 'Score using Weighted Average')
# Linear regression with sklearn to predict score using maximum value
model_mv = linear_regression_report(x, y_mv)
# Report about the linear regression model using score as an output
model_report(model_mv, 'Score using Maximum Value')
-------------------------------
Score using Simple Average model
-------------------------------
Intercept:
3.886666666666656
Coefficients:
[10.21151515]
-------------------------------
Score using Weighted Average model
-------------------------------
Intercept:
16.146666666666647
Coefficients:
[9.10060606]
-------------------------------
Score using Maximum Value model
-------------------------------
Intercept:
33.873333333333335
Coefficients:
[7.67030303]
A more detailed linear regression model could be obtained with the following commands.
import statsmodels.api as sm
def get_model(x, y):
# Obtaining multiple linear regression using statsmodels: exportation is the output
xm = sm.add_constant(x) # adding a constant
model = sm.OLS(y, xm).fit()
return model, xm
def print_model(model, name='None'):
summary = model.summary()
print('--------------------------------------')
print(name,'model')
print(summary)
print('--------------------------------------')
# Obtaining multiple linear regression using statsmodels: sa is the input
model_sa, xm_sa = get_model(x, y_sa)
print_model(model_sa,'Simple Average')
# Obtaining multiple linear regression using statsmodels: wa is the input
model_wa, xm_wa = get_model(x, y_wa)
print_model(model_wa,'Weighted Average')
# Obtaining multiple linear regression using statsmodels: mv is the input
model_mv, xm_mv = get_model(x, y_mv)
print_model(model_mv,'Maximum value')
--------------------------------------
Simple Average model
OLS Regression Results
==============================================================================
Dep. Variable: Simple Average (1) R-squared: 0.959
Model: OLS Adj. R-squared: 0.954
Method: Least Squares F-statistic: 186.8
Date: Mon, 06 Nov 2023 Prob (F-statistic): 7.91e-07
Time: 20:29:56 Log-Likelihood: -32.222
No. Observations: 10 AIC: 68.44
Df Residuals: 8 BIC: 69.05
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 3.8867 4.636 0.838 0.426 -6.804 14.577
Category 10.2115 0.747 13.668 0.000 8.489 11.934
==============================================================================
Omnibus: 4.516 Durbin-Watson: 1.548
Prob(Omnibus): 0.105 Jarque-Bera (JB): 1.623
Skew: -0.958 Prob(JB): 0.444
Kurtosis: 3.471 Cond. No. 13.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
--------------------------------------
--------------------------------------
Weighted Average model
OLS Regression Results
================================================================================
Dep. Variable: Weighted Average (2) R-squared: 0.937
Model: OLS Adj. R-squared: 0.930
Method: Least Squares F-statistic: 119.8
Date: Mon, 06 Nov 2023 Prob (F-statistic): 4.31e-06
Time: 20:29:56 Log-Likelihood: -33.291
No. Observations: 10 AIC: 70.58
Df Residuals: 8 BIC: 71.19
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 16.1467 5.159 3.130 0.014 4.251 28.043
Category 9.1006 0.831 10.946 0.000 7.183 11.018
==============================================================================
Omnibus: 8.987 Durbin-Watson: 1.240
Prob(Omnibus): 0.011 Jarque-Bera (JB): 3.858
Skew: -1.413 Prob(JB): 0.145
Kurtosis: 4.126 Cond. No. 13.7
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
--------------------------------------
--------------------------------------
Maximum value model
OLS Regression Results
==============================================================================
Dep. Variable: Maximum Value (3) R-squared: 0.771
Model: OLS Adj. R-squared: 0.742
Method: Least Squares F-statistic: 26.92
Date: Mon, 06 Nov 2023 Prob (F-statistic): 0.000834
Time: 20:29:56 Log-Likelihood: -39.046
No. Observations: 10 AIC: 82.09
Df Residuals: 8 BIC: 82.70
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 33.8733 9.173 3.693 0.006 12.721 55.025
Category 7.6703 1.478 5.189 0.001 4.261 11.079
==============================================================================
Omnibus: 9.301 Durbin-Watson: 1.114
Prob(Omnibus): 0.010 Jarque-Bera (JB): 3.984
Skew: -1.416 Prob(JB): 0.136
Kurtosis: 4.242 Cond. No. 13.7
==============================================================================
It is also possible to obtain a graphical comparison between the values from the dataset and from the model's predictions.
import matplotlib.pyplot as plt
def model_predict(model, xm):
y_hat = model.predict(xm)
return y_hat
def draw_model(model, y, yname, y_hat):
xp = list(range(1,len(x)+1))
plt.plot(xp,y,'ob',label='Data')
plt.plot(xp,y,'-r')
plt.plot(xp,y_hat,'--g',label='Prediction')
plt.title('Prediction on ' + str(yname))
plt.xlabel('Time')
plt.ylabel(yname)
plt.legend()
plt.grid()
plt.show()
# Predict the values used to build the sa model coefficients.
y_hat_sa = model_predict(model_sa, xm_sa)
#print(y_hat_export)
# Drawing to compare data x prediction in score data.
draw_model(model_sa, y_sa, 'SA', y_hat_sa)
# Predict the values used to build the wa model coefficients.
y_hat_wa = model_predict(model_wa, xm_wa)
#print(y_hat_export)
# Drawing to compare data x prediction in score data.
draw_model(model_wa, y_wa, 'WA', y_hat_wa)
# Predict the values used to build the mv model coefficients.
y_hat_mv = model_predict(model_mv, xm_mv)
#print(y_hat_export)
# Drawing to compare data x prediction in score data.
draw_model(model_mv, y_mv, 'MV', y_hat_mv)
It is possible to extract models performance metrics using the following commands.
from sklearn import metrics
import numpy as np
#Extracting model performance metrics.
def get_metrics(y, y_hat):
mae = metrics.mean_absolute_error(y, y_hat)
mse = metrics.mean_squared_error(y, y_hat)
rmse = np.sqrt(metrics.mean_squared_error(y, y_hat))
metrics_dict={}
metrics_dict['MAE'] = mae
metrics_dict['MSE'] = mse
metrics_dict['RMSE'] = rmse
return metrics_dict
# Print a summary about models performance metrics.
def print_metrics(metrics_dict, name):
print(name + ' model metrics')
for (key, value) in metrics_dict.items():
print(key+' = '+str(value))
print('---------------------------')
# Metrics for sa model
sa_metrics = get_metrics(y_sa, y_hat_sa)
print_metrics(sa_metrics,'Simple Average')
# Metrics for wa model
wa_metrics = get_metrics(y_wa, y_hat_wa)
print_metrics(wa_metrics,'Weighted Average')
# Metrics for mv model
mv_metrics = get_metrics(y_mv, y_hat_mv)
print_metrics(mv_metrics,'Maximum Value')
Simple Average model metrics
MAE = 4.439999999999996
MSE = 36.841406060606076
RMSE = 6.06971218927274
---------------------------
Weighted Average model metrics
MAE = 5.070424242424241
MSE = 45.620496969696966
RMSE = 6.754294705570446
---------------------------
Maximum Value model metrics
MAE = 9.09236363636364
MSE = 144.23162424242423
RMSE = 12.009647132302607
---------------------------
Since linear regression model based on Simple Average Score showed the best model metrics, it will shown the probability of a certain declaration has an infraction for being in a certain score interval for this model. First, let's obtain the linear regression model.
y_hat_sa = model_predict(model_sa, xm_sa.iloc[0,:])
y_hat_sa
None 14.098182
dtype: float64
Now, use the obtained linear regression to obtain the probability of each category employing a decoding function as done in the next code. The decode function employed the concept of cummulative probability.
def decode_Category(model, xm, category):
if (category > 1):
value = xm.iloc[category-2,:]
a = model_predict(model, value)
value = xm.iloc[category-1,:]
b = model_predict(model, value)
else:
a = 0
value = xm.iloc[category-1,:]
b = model_predict(model, value)
prob = b - a
return prob
category = list(range(1,11))
for cat in category:
prob = decode_Category(model_wa, xm_wa, cat)
print('Class = ',cat,'Probability = ',prob)
Class = 1 Probability = None 25.247273
dtype: float64
Class = 2 Probability = None 9.100606
dtype: float64
Class = 3 Probability = None 9.100606
dtype: float64
Class = 4 Probability = None 9.100606
dtype: float64
Class = 5 Probability = None 9.100606
dtype: float64
Class = 6 Probability = None 9.100606
dtype: float64
Class = 7 Probability = None 9.100606
dtype: float64
Class = 8 Probability = None 9.100606
dtype: float64
Class = 9 Probability = None 9.100606
dtype: float64
Class = 10 Probability = None 9.100606
dtype: float64
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1e31Pv5_3Q98G-pHJPv7Y8AKaoBKLuF96?usp=sharing