1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
The next section employs the developments made on the paper "A modified regression model for forecasting the volumes of Taiwan’s import containers" which propose a modified regression model for forecasting the volumes of Taiwan’s import containers [1]. For this task, it employed a new modified regression model and compares it with the accuracy of the traditional regression model using data for the period 1989–2001.
This kind of prediction is specially useful to try to verify tendencies or detect abnormal grow of volume in a specific category of products. Abnormal increase in import volumes could be explained in terms of common types of customs fraud which, according to [2], which could be:
Transshipment: Routing a shipment through a third country to disguise its origin,
Structuring: splitting a shipment into multiple shipments.
All the developed Python code is to employ a multiple regression using the available data.
First, let's read the data from an excel type file.
import pandas as pd
# Original shared link: https://docs.google.com/spreadsheets/d/1XAgkM8MjONGCx5eSzo4ElzjYTihz-vOJ/edit?usp=sharing&ouid=106640872116257813737&rtpof=true&sd=true
url = "https://drive.google.com/file/d/1XAgkM8MjONGCx5eSzo4ElzjYTihz-vOJ/view?usp=sharing"
url2='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_excel(url2)
df
Now, it is time to verify the types of variables of each column.
# get the data types of each column
print("\nData types of each column:")
print(df.dtypes)
Since some columns have numerical information stored as 'object' type, it is necessary to convert its values into float. But, before it is necessary to remove all commas characters " , " from the values. The next code creates a function df_remove_commas which replaces ' , ' to '' in every column.
# Creating a function to convert from string to numeric
def df_remove_commas(df):
# Extract all columns names
columns = df.columns
# Check all columns
for column in columns:
# Verify if the selected column should be converted
if (df[column].dtypes == 'object'):
print('Converting column: ',column)
df[column] = df[column].str.replace(',','')
return df
df = df_remove_commas(df)
df
The next code helps to transform every column value which is 'object' type to 'float' type employing a new function called df_string2numeric.
# Creating a function to convert from string to numeric
def df_string2numeric(df):
# Extract all columns names
columns = df.columns
# Check all columns
for column in columns:
# Verify if the selected column should be converted
if (df[column].dtypes == 'object'):
print('Converting column: ',column)
df[column] = pd.to_numeric(df[column], downcast="float")
#df[column] = pd.to_numeric(df[column])
return df
df = df_string2numeric(df)
df
Since all data in the dataframe has the correct format, the next step is to obtain the names of the columns with all input (Volume of export and import container), and ouput variables values (Variables 3 to 11).
list(df.columns)[1:]
It is possible to check which input variable has a linear relation with the output variable which is "volume of import container". This could be made employing a visual verification of correlation among input (Volume of import container- Variable 2) and output variables (Variables 3 to 11).
import seaborn as sns
# Selection: interest_rate unemployment_rate index_price
X = df[list(df.columns)[1:]]
sns.pairplot(X);
The next code helps to obtain the linear regression coeficients employing as an input data variables 3 from 11, and build one model which the output variable which is Volume of export container (Variable 1), and another output variable which is Volume of import container (Variable 2).
import pandas as pd
from sklearn import linear_model
def linear_regression_report(x, y):
regr_ = linear_model.LinearRegression()
regr_.fit(x, y)
return regr_
def model_report(model, name='None '):
print('-------------------------------')
print(name,' model')
print('-------------------------------')
print('Intercept: \n', model.intercept_)
print('Coefficients: \n', model.coef_)
# Defining independent and dependent variables
x = df[['Population (variable 3)',
'Industrial production index (variable 4)', 'GNP (variable 5)',
'GNP per capita (variable 6)', 'Wholesale price (variable 7)',
'GDP (variable 8)', 'Agricultural GDP (variable 9)',
'Industrial GDP (variable 10)', 'Service GDP (variable 11)']]
y_export = df['Volume of export container (variable 1)']
y_import = df['Volume of import container (variable 2)']
# Linear regression with sklearn to predict exportation
model_export = linear_regression_report(x, y_export)
# Report about the linear regression model using exportation as an output
model_report(model_export, 'Exportation')
# Linear regression with sklearn to predict importation
model_import = linear_regression_report(x, y_import)
# Report about the linear regression model using importation as an output
model_report(model_import, 'Importation ')
The next code produces a more detailed report about the linear regression made previously.
import statsmodels.api as sm
def get_model(x, y):
# Obtaining multiple linear regression using statsmodels: exportation is the output
xm = sm.add_constant(x) # adding a constant
model = sm.OLS(y, xm).fit()
return model, xm
def print_model(model, name='None'):
summary = model.summary()
print('--------------------------------------')
print(name,'model')
print(summary)
print('--------------------------------------')
# Obtaining multiple linear regression using statsmodels: exportation is the output
model_export, xm_export = get_model(x, y_export)
print_model(model_export,'Exportation ')
# Obtaining multiple linear regression using statsmodels: importation is the output
model_import, xm_import = get_model(x, y_import)
print_model(model_import,'Importation ')
The next code helps in the graphical evaluation of predictions made by the two linear regression models. It plots the model predictions against the colected dataset.
import matplotlib.pyplot as plt
def model_predict(model, xm):
y_hat = model.predict(xm)
return y_hat
def draw_model(model, y, yname, y_hat):
xp = list(range(1,len(x)+1))
plt.plot(xp,y,'ob',label='Data')
plt.plot(xp,y,'-r')
plt.plot(xp,y_hat,'--g',label='Prediction')
plt.title('Prediction on ' + str(yname))
plt.xlabel('Time')
plt.ylabel(yname)
plt.legend()
plt.grid()
plt.show()
# Predict the values used to build the exportation model coefficients.
y_hat_export = model_predict(model_export, xm_export)
#print(y_hat_export)
# Drawing to compare data x prediction in exportation data.
draw_model(model_export, y_export, 'Exportation', y_hat_export)
# Predict the values used to build the importation model coefficients.
y_hat_import = model_predict(model_import, xm_import)
#print(y_hat_import)
# Drawing to compare data x prediction in importation data.
draw_model(model_import, y_import, 'Importation', y_hat_import)
The next code define and employ Python code to verify the numerical performance of the both linear regression models.
from sklearn import metrics
import numpy as np
#Extracting model performance metrics.
def get_metrics(y, y_hat):
mae = metrics.mean_absolute_error(y, y_hat)
mse = metrics.mean_squared_error(y, y_hat)
rmse = np.sqrt(metrics.mean_squared_error(y, y_hat))
metrics_dict={}
metrics_dict['MAE'] = mae
metrics_dict['MSE'] = mse
metrics_dict['RMSE'] = rmse
return metrics_dict
# Print a summary about models performance metrics.
def print_metrics(metrics_dict, name):
print(name + ' model metrics')
for (key, value) in metrics_dict.items():
print(key+' = '+str(value))
print('---------------------------')
# Metrics for exportation model
exp_metrics = get_metrics(y_export, y_hat_export)
print_metrics(exp_metrics,'Exportation')
# Metrics for importation model
imp_metrics = get_metrics(y_import, y_hat_import)
print_metrics(imp_metrics,'Importation')
Exportation model metrics
MAE = 18757.246813260594
MSE = 559465941.0911782
RMSE = 23653.032386803563
---------------------------
Importation model metrics
MAE = 19053.69773872082
MSE = 677631064.988573
RMSE = 26031.34773669187
---------------------------
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1xoehKr6_dXfM_0g8Sk_aGBTKA0w5_bU2?usp=sharing