1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
2.2. Cumulative Probability Predictions
2.3. Trading indicators to estimate EDBS
According to [1]: "The ease of doing business score helps assess the absolute level of regulatory performance over time. It captures the gap of each economy from the best regulatory performance observed on each of the indicators across all economies in the Doing Business sample since 2005. One can both see the gap between a particular economy’s performance and the best performance at any point in time and assess the absolute change in the economy’s regulatory environment over time as measured by Doing Business. An economy’s ease of doing business score is reflected on a scale from 0 to 100, where 0 represents the lowest and 100 represents the best performance. For example, an ease of doing business score of 75 in Doing Business 2019 means an economy was 25 percentage points away from the best regulatory performance constructed across all economies and across time."
The Ease of doing business score also considers some countries's trading indicators as pointed in [1]: "Doing Business recorded the time and cost associated with the logistical process of exporting and importing goods. Doing Business measured the time and cost (excluding tariffs) associated with three sets of procedures—documentary compliance, border compliance and domestic transport—within the overall process of exporting or importing a shipment of goods. The most recent round of data collection for the project was completed in May 2019." .
All the data from EDBS is available at the site:
https://archive.doingbusiness.org/content/dam/doingBusiness/excel/db2020/DB20_score_calculator.xlsx
Let's employ only trading indicators, from the online data, to build a multiple linear regression to predict EDBS score. But, first it is necessary to read and clean the data using the following Python commands. The next code declares which columns, from the Excel Table, will be employed.
# Selected columns
cols = [0]
cols.extend(list(range(99,119,1)))
cols.append(141)
cols
[0, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 141]
The next code employs the values in the previously defined variable cols to read only the columns related with trading performance.
import pandas as pd
url = "https://archive.doingbusiness.org/content/dam/doingBusiness/excel/db2020/DB20_score_calculator.xlsx"
df = pd.read_excel(url, skiprows=11, usecols=cols)
df
Some columns does not have a name that clearly relate it with some indicator. So, it is necessary to remove them using their index.
# Removing columns with scores using their index
df.drop(df.columns[[2, 4, 6, 8, 10, 12, 14, 16, 17, 18]], axis=1, inplace=True)
df
Some rows does not have a number for some indicators (columns) and will be removed.
# First Removing all rows with columns with NaN.
df.dropna(axis = 0, how = 'all', inplace = True)
df
Since the last row does not have data of a specific country data, it should be removed.
df.drop(df.tail(1).index,inplace=True) # drop last 1 rows
df
The data preparation also includes the verification on type of each column to enable the application on a linear regression model.
# get the data types of each column
print("\nData types of each column:")
print(df.dtypes)
Eliminating rows that have null values on the columns about 'Time to export'.
df = df[
pd.to_numeric(df['Time to export: Border compliance (hours)'], errors='coerce').notnull()
]
df
Although, some columns does not have float values and should be properly converted to float format to turn them ready to be employed in a linear regression model.
for column in df.columns[1:9]:
df[column] = df[column].astype(str).astype(float)
df
Although, some columns does not have float values and should be properly converted to float format to turn them ready to be employed in a linear regression model.
import seaborn as sns
# Selection: interest_rate unemployment_rate index_price
X = df[list(df.columns)[1:]]
sns.pairplot(X);
Using the previous graphics, it could be said that the most promissing input factors are 1, 9, and 10. And exactly these factors have been selected to build a multiple linear regression model.
import pandas as pd
from sklearn import linear_model
def linear_regression_report(x, y):
regr_ = linear_model.LinearRegression()
regr_.fit(x, y)
return regr_
def model_report(model, name='None '):
print('-------------------------------')
print(name,' model')
print('-------------------------------')
print('Intercept: \n', model.intercept_)
print('Coefficients: \n', model.coef_)
# Defining independent and dependent variables
x = df[[list(df.columns)[1], list(df.columns)[9], list(df.columns)[10]]]
y = df[list(df.columns)[-1]]
# Linear regression with sklearn to predict score using simple average
model = linear_regression_report(x, y)
# Report about the linear regression model using score as an output
model_report(model, 'EDB Indicator using Factors 1, 9, and 10')
-------------------------------
EDB Indicator using Factors 1, 9, and 10 model
-------------------------------
Intercept:
56.19551585085501
Coefficients:
[-0.01300811 0.22942901 -0.08419487]
It is possible to employ a more detailed multiple linear regression using the following Python commands.
import statsmodels.api as sm
def get_model(x, y):
# Obtaining multiple linear regression using statsmodels: exportation is the output
xm = sm.add_constant(x) # adding a constant
model = sm.OLS(y, xm).fit()
return model, xm
def print_model(model, name='None'):
summary = model.summary()
print('--------------------------------------')
print(name,'model')
print(summary)
print('--------------------------------------')
# Obtaining multiple linear regression using statsmodels: sa is the input
model, xm = get_model(x, y)
print_model(model,'Factors 1, 9, and 10')
Since the multiple linear regression model had been obtained, now it is necessary to obtain its predictions on EDBS.
import matplotlib.pyplot as plt
def model_predict(model, xm):
y_hat = model.predict(xm)
return y_hat
def draw_model(model, y, yname, y_hat):
xp = list(range(1,len(x)+1))
plt.plot(xp,y,'ob',label='Data')
plt.plot(xp,y,'-r')
plt.plot(xp,y_hat,'--g',label='Prediction')
plt.title('Prediction on ' + str(yname))
plt.xlabel('Time')
plt.ylabel(yname)
plt.legend()
plt.grid()
plt.show()
# Predict the values used to build the model coefficients.
y_hat = model_predict(model, xm)
print('%20s | %5s | %5s'%('Country','EDB Indicator','Predicted'))
for name,i,j in zip(df['Economy'],y, y_hat):
print(" %20s | %5.2f | %5.2f" % (name,i,j))
Country | EDB Indicator | Predicted
Afghanistan | 44.10 | 47.69
Albania | 67.70 | 76.07
Algeria | 48.60 | 49.48
Angola | 41.30 | 47.72
...
Vietnam | 69.80 | 62.97
West Bank and Gaza | 60.00 | 71.46
Zambia | 66.90 | 54.64
Zimbabwe | 54.50 | 54.12
Now, let's store the predited EDBS into the original data table as a new column.
df['Predicted EDB'] = y_hat
df
Plotly library offers an interesting feature: the possibility to draw a colored countour of each country according to the value of a specific metric from a dataframe column. The next code helps to build a choropleth map according to life expectancy for each country.
import plotly.express as px
df_map = px.data.gapminder().query("year==2007")
fig = px.choropleth(df_map, locations="iso_alpha",
color="lifeExp", # lifeExp is a column of gapminder
hover_name="country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
It is important to stress that the data to generate the map had been loaded from Plotly library in the variable df_map as shown in the next code.
df_map
The next code helps to merge data from Easy of Doing Business and Plotly maps dataframes into a single dataframe.
result = pd.merge(df.iloc[:,[0,1,9,10,11,12]], df_map, left_on='Economy', right_on='country', how="inner")
result
With the new dataframe it is possible to create a choropleth map using EDBS to color the countries.
import plotly.express as px
fig = px.choropleth(result, locations="iso_alpha",
color="Overall ease of doing business score as of current data (0-100)", # lifeExp is a column of gapminder
hover_name="country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
With the new dataframe it is also possible to create a choropleth map using the predicted EDBS to color the countries.
import plotly.express as px
fig = px.choropleth(result, locations="iso_alpha",
color="Predicted EDB", # lifeExp is a column of gapminder
hover_name="country", # column to add to hover information
color_continuous_scale=px.colors.sequential.Plasma)
fig.show()
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1or0efvA0ZR14j5eiCJMgZdF4-DhErYLq?usp=sharing