2.3. Trading indicators to predict Easy of Doing Business Score

1. Concepts & Definitions

1.1. Linear regression: Concepts and equations

1.2. Linear regression: Numerical example

1.3. Correlation is no causation

1.4. Dummy and categorical variables

1.5. Multiple linear regression

1.6. Dummy multiple linear regression

2. Problem & Solution

2.1. Predicting Exportation & Importation Volume

2.2. Cumulative Probability Predictions

2.3. Trading indicators to estimate EDBS

What is the Ease of Doing Business Score (EDBS) from World Bank?

According to [1]: "The ease of doing business score helps assess the absolute level of regulatory performance over time. It captures the gap of each economy from the best regulatory performance observed on each of the indicators across all economies in the Doing Business sample since 2005. One can both see the gap between a particular economy’s performance and the best performance at any point in time and assess the absolute change in the economy’s regulatory environment over time as measured by Doing Business. An economy’s ease of doing business score is reflected on a scale from 0 to 100, where 0 represents the lowest and 100 represents the best performance. For example, an ease of doing business score of 75 in Doing Business 2019 means an economy was 25 percentage points away from the best regulatory performance constructed across all economies and across time."

The Ease of doing business score also considers some countries's trading indicators as pointed in [1]: "Doing Business recorded the time and cost associated with the logistical process of exporting and importing goods. Doing Business measured the time and cost (excluding tariffs) associated with three sets of procedures—documentary compliance, border compliance and domestic transport—within the overall process of exporting or importing a shipment of goods. The most recent round of data collection for the project was completed in May 2019." .

Estimating EDBS score using trading indicators

All the data from EDBS is available at the site:

https://archive.doingbusiness.org/content/dam/doingBusiness/excel/db2020/DB20_score_calculator.xlsx

Let's employ only trading indicators, from the online data, to build a multiple linear regression to predict EDBS score. But, first it is necessary to read and clean the data using the following Python commands. The next code declares which columns, from the Excel Table, will be employed.

# Selected columns

cols = [0]

cols.extend(list(range(99,119,1)))

cols.append(141)

cols

[0, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 141]

Estimating EDBS score using trading indicators

The next code employs the values in the previously defined variable cols to read only the columns related with trading performance.

import pandas as pd

url = "https://archive.doingbusiness.org/content/dam/doingBusiness/excel/db2020/DB20_score_calculator.xlsx"

df = pd.read_excel(url, skiprows=11, usecols=cols)

Removing columns (scores)

Some columns does not have a name that clearly relate it with some indicator. So, it is necessary to remove them using their index.

# Removing columns with scores using their index

df.drop(df.columns[[2, 4, 6, 8, 10, 12, 14, 16, 17, 18]], axis=1, inplace=True)

Removing empty rows (with NaN values)

Some rows does not have a number for some indicators (columns) and will be removed.

# First Removing all rows with columns with NaN.

df.dropna(axis = 0, how = 'all', inplace = True)

Removing also the last row

Since the last row does not have data of a specific country data, it should be removed.

df.drop(df.tail(1).index,inplace=True) # drop last 1 rows

Verifying the data type of every column

The data preparation also includes the verification on type of each column to enable the application on a linear regression model.

# get the data types of each column

print("\nData types of each column:")

print(df.dtypes)

Eliminating null values

Eliminating rows that have null values on the columns about 'Time to export'.

df = df[

pd.to_numeric(df['Time to export: Border compliance (hours)'], errors='coerce').notnull()

]

Converting all 'object' data into float

Although, some columns does not have float values and should be properly converted to float format to turn them ready to be employed in a linear regression model.

for column in df.columns[1:9]:

df[column] = df[column].astype(str).astype(float)

Converting all 'object' data into float

Although, some columns does not have float values and should be properly converted to float format to turn them ready to be employed in a linear regression model.

import seaborn as sns

# Selection: interest_rate unemployment_rate index_price

X = df[list(df.columns)[1:]]

sns.pairplot(X);

Selecting input factors for the multiple linear regression

Using the previous graphics, it could be said that the most promissing input factors are 1, 9, and 10. And exactly these factors have been selected to build a multiple linear regression model.

import pandas as pd

from sklearn import linear_model

def linear_regression_report(x, y):

regr_ = linear_model.LinearRegression()

regr_.fit(x, y)

return regr_

def model_report(model, name='None '):

print('-------------------------------')

print(name,' model')

print('-------------------------------')

print('Intercept: \n', model.intercept_)

print('Coefficients: \n', model.coef_)

# Defining independent and dependent variables

x = df[[list(df.columns)[1], list(df.columns)[9], list(df.columns)[10]]]

y = df[list(df.columns)[-1]]

# Linear regression with sklearn to predict score using simple average

model = linear_regression_report(x, y)

# Report about the linear regression model using score as an output

model_report(model, 'EDB Indicator using Factors 1, 9, and 10')

-------------------------------

EDB Indicator using Factors 1, 9, and 10 model

-------------------------------

Intercept:

56.19551585085501

Coefficients:

[-0.01300811 0.22942901 -0.08419487]

A more detailed multiple linear regression

It is possible to employ a more detailed multiple linear regression using the following Python commands.

import statsmodels.api as sm

def get_model(x, y):

# Obtaining multiple linear regression using statsmodels: exportation is the output

xm = sm.add_constant(x) # adding a constant

model = sm.OLS(y, xm).fit()

return model, xm

def print_model(model, name='None'):

summary = model.summary()

print('--------------------------------------')

print(name,'model')

print(summary)

print('--------------------------------------')

# Obtaining multiple linear regression using statsmodels: sa is the input

model, xm = get_model(x, y)

print_model(model,'Factors 1, 9, and 10')

Building model predictions for EDBS

Since the multiple linear regression model had been obtained, now it is necessary to obtain its predictions on EDBS.

import matplotlib.pyplot as plt

def model_predict(model, xm):

y_hat = model.predict(xm)

return y_hat

def draw_model(model, y, yname, y_hat):

xp = list(range(1,len(x)+1))

plt.plot(xp,y,'ob',label='Data')

plt.plot(xp,y,'-r')

plt.plot(xp,y_hat,'--g',label='Prediction')

plt.title('Prediction on ' + str(yname))

plt.xlabel('Time')

plt.ylabel(yname)

plt.legend()

plt.grid()

plt.show()

# Predict the values used to build the model coefficients.

y_hat = model_predict(model, xm)

print('%20s | %5s | %5s'%('Country','EDB Indicator','Predicted'))

for name,i,j in zip(df['Economy'],y, y_hat):

print(" %20s | %5.2f | %5.2f" % (name,i,j))

Country | EDB Indicator | Predicted

Afghanistan | 44.10 | 47.69

Albania | 67.70 | 76.07

Algeria | 48.60 | 49.48

Angola | 41.30 | 47.72

...

Vietnam | 69.80 | 62.97

West Bank and Gaza | 60.00 | 71.46

Zambia | 66.90 | 54.64

Zimbabwe | 54.50 | 54.12

Storing EDBS prediction

Now, let's store the predited EDBS into the original data table as a new column.

df['Predicted EDB'] = y_hat

Loading data to build a map

Plotly library offers an interesting feature: the possibility to draw a colored countour of each country according to the value of a specific metric from a dataframe column. The next code helps to build a choropleth map according to life expectancy for each country.

import plotly.express as px

df_map = px.data.gapminder().query("year==2007")

fig = px.choropleth(df_map, locations="iso_alpha",

color="lifeExp", # lifeExp is a column of gapminder

hover_name="country", # column to add to hover information

color_continuous_scale=px.colors.sequential.Plasma)

fig.show()