2.1. Exportation and Importation Volume Prediction

1. Concepts & Definitions

1.1. Linear regression: Concepts and equations

1.2. Linear regression: Numerical example

1.3. Correlation is no causation

1.4. Dummy and categorical variables

1.5. Multiple linear regression

1.6. Dummy multiple linear regression

2. Problem & Solution

2.1. Predicting Exportation & Importation Volume

2.2. Cumulative Probability Predictions

2.3. Multiple Linear Regression Philippine Revenue

Multiple linear regression with dummy variables

The next section employs the developments made on the paper "A modified regression model for forecasting the volumes of Taiwan’s import containers" which propose a modified regression model for forecasting the volumes of Taiwan’s import containers [1]. For this task, it employed a new modified regression model and compares it with the accuracy of the traditional regression model using data for the period 1989–2001.

This kind of prediction is specially useful to try to verify tendencies or detect abnormal grow of volume in a specific category of products. Abnormal increase in import volumes could be explained in terms of common types of customs fraud which, according to [2], which could be:

Transshipment: Routing a shipment through a third country to disguise its origin,
Structuring: splitting a shipment into multiple shipments.

All the developed Python code is to employ a multiple regression using the available data.

Reading data

First, let's read the data from an excel type file.

import pandas as pd

# Original shared link: https://docs.google.com/spreadsheets/d/1XAgkM8MjONGCx5eSzo4ElzjYTihz-vOJ/edit?usp=sharing&ouid=106640872116257813737&rtpof=true&sd=true

url = "https://drive.google.com/file/d/1XAgkM8MjONGCx5eSzo4ElzjYTihz-vOJ/view?usp=sharing"

url2='https://drive.google.com/uc?id=' + url.split('/')[-2]

df = pd.read_excel(url2)

Detecting unproper types on columns values

Now, it is time to verify the types of variables of each column.

# get the data types of each column

print("\nData types of each column:")

print(df.dtypes)

Since some columns have numerical information stored as 'object' type, it is necessary to convert its values into float. But, before it is necessary to remove all commas characters " , " from the values. The next code creates a function df_remove_commas which replaces ' , ' to '' in every column.

# Creating a function to convert from string to numeric

def df_remove_commas(df):

# Extract all columns names

columns = df.columns

# Check all columns

for column in columns:

# Verify if the selected column should be converted

if (df[column].dtypes == 'object'):

print('Converting column: ',column)

df[column] = df[column].str.replace(',','')

return df

df = df_remove_commas(df)

Changing from object to float type

The next code helps to transform every column value which is 'object' type to 'float' type employing a new function called df_string2numeric.

# Creating a function to convert from string to numeric

def df_string2numeric(df):

# Extract all columns names

columns = df.columns

# Check all columns

for column in columns:

# Verify if the selected column should be converted

if (df[column].dtypes == 'object'):

print('Converting column: ',column)

df[column] = pd.to_numeric(df[column], downcast="float")

#df[column] = pd.to_numeric(df[column])

return df

df = df_string2numeric(df)

Selecting model input variables

Since all data in the dataframe has the correct format, the next step is to obtain the names of the columns with all input (Volume of export and import container), and ouput variables values (Variables 3 to 11).

list(df.columns)[1:]

It is possible to check which input variable has a linear relation with the output variable which is "volume of import container". This could be made employing a visual verification of correlation among input (Volume of import container- Variable 2) and output variables (Variables 3 to 11).

import seaborn as sns

# Selection: interest_rate unemployment_rate index_price

X = df[list(df.columns)[1:]]

sns.pairplot(X);

Building linear regression models for importation and exportation

The next code helps to obtain the linear regression coeficients employing as an input data variables 3 from 11, and build one model which the output variable which is Volume of export container (Variable 1), and another output variable which is Volume of import container (Variable 2).

import pandas as pd

from sklearn import linear_model

def linear_regression_report(x, y):

regr_ = linear_model.LinearRegression()

regr_.fit(x, y)

return regr_

def model_report(model, name='None '):

print('-------------------------------')

print(name,' model')

print('-------------------------------')

print('Intercept: \n', model.intercept_)

print('Coefficients: \n', model.coef_)

# Defining independent and dependent variables

x = df[['Population (variable 3)',

'Industrial production index (variable 4)', 'GNP (variable 5)',

'GNP per capita (variable 6)', 'Wholesale price (variable 7)',

'GDP (variable 8)', 'Agricultural GDP (variable 9)',

'Industrial GDP (variable 10)', 'Service GDP (variable 11)']]

y_export = df['Volume of export container (variable 1)']

y_import = df['Volume of import container (variable 2)']

# Linear regression with sklearn to predict exportation

model_export = linear_regression_report(x, y_export)

# Report about the linear regression model using exportation as an output

model_report(model_export, 'Exportation')

# Linear regression with sklearn to predict importation

model_import = linear_regression_report(x, y_import)

# Report about the linear regression model using importation as an output

model_report(model_import, 'Importation ')

A more detailed report about linear regression

The next code produces a more detailed report about the linear regression made previously.

import statsmodels.api as sm

def get_model(x, y):

# Obtaining multiple linear regression using statsmodels: exportation is the output

xm = sm.add_constant(x) # adding a constant

model = sm.OLS(y, xm).fit()

return model, xm

def print_model(model, name='None'):

summary = model.summary()

print('--------------------------------------')

print(name,'model')

print(summary)

print('--------------------------------------')

# Obtaining multiple linear regression using statsmodels: exportation is the output

model_export, xm_export = get_model(x, y_export)

print_model(model_export,'Exportation ')

# Obtaining multiple linear regression using statsmodels: importation is the output

model_import, xm_import = get_model(x, y_import)

print_model(model_import,'Importation ')

Graphical evaluation of the models predictions

The next code helps in the graphical evaluation of predictions made by the two linear regression models. It plots the model predictions against the colected dataset.

import matplotlib.pyplot as plt

def model_predict(model, xm):

y_hat = model.predict(xm)

return y_hat

def draw_model(model, y, yname, y_hat):

xp = list(range(1,len(x)+1))

plt.plot(xp,y,'ob',label='Data')

plt.plot(xp,y,'-r')

plt.plot(xp,y_hat,'--g',label='Prediction')

plt.title('Prediction on ' + str(yname))

plt.xlabel('Time')

plt.ylabel(yname)

plt.legend()

plt.grid()

plt.show()

# Predict the values used to build the exportation model coefficients.

y_hat_export = model_predict(model_export, xm_export)

#print(y_hat_export)

# Drawing to compare data x prediction in exportation data.

draw_model(model_export, y_export, 'Exportation', y_hat_export)

# Predict the values used to build the importation model coefficients.

y_hat_import = model_predict(model_import, xm_import)

#print(y_hat_import)

# Drawing to compare data x prediction in importation data.

draw_model(model_import, y_import, 'Importation', y_hat_import)

Extracting models performance metrics

The next code define and employ Python code to verify the numerical performance of the both linear regression models.

from sklearn import metrics

import numpy as np

#Extracting model performance metrics.

def get_metrics(y, y_hat):

mae = metrics.mean_absolute_error(y, y_hat)

mse = metrics.mean_squared_error(y, y_hat)

rmse = np.sqrt(metrics.mean_squared_error(y, y_hat))

metrics_dict={}

metrics_dict['MAE'] = mae

metrics_dict['MSE'] = mse

metrics_dict['RMSE'] = rmse

return metrics_dict

# Print a summary about models performance metrics.

def print_metrics(metrics_dict, name):

print(name + ' model metrics')

for (key, value) in metrics_dict.items():

print(key+' = '+str(value))

print('---------------------------')

# Metrics for exportation model

exp_metrics = get_metrics(y_export, y_hat_export)

print_metrics(exp_metrics,'Exportation')

# Metrics for importation model

imp_metrics = get_metrics(y_import, y_hat_import)

print_metrics(imp_metrics,'Importation')

Exportation model metrics

MAE = 18757.246813260594

MSE = 559465941.0911782

RMSE = 23653.032386803563

---------------------------

Importation model metrics

MAE = 19053.69773872082

MSE = 677631064.988573

RMSE = 26031.34773669187

---------------------------

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1xoehKr6_dXfM_0g8Sk_aGBTKA0w5_bU2?usp=sharing

References

[1] https://www.sciencedirect.com/science/article/pii/S0895717707002105

[2] https://www.whistleblowerllc.com/what-we-do/financial-fraud/customs-fraud/

Page updated

Google Sites

Report abuse