1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
This study case is a simplified methodology to deal with the problem and data described in the article "Modelling the Philippines’ Revenue Collection Performance: A Study Using Multiple Linear Regression Analysis" and can be seen at:
https://ijrpr.com/uploads/V3ISSUE11/IJRPR8001.pdf
To build the linear regression model the dataset has five independent variables: inflation rate, forex rate, import value, export value, and stock price index. These variables can be used to predict the Philippines’ revenue value.
Let's employ only trading indicators, from the online data, to build a multiple linear regression to predict the Philippines’ revenue value. But, first, it is necessary to read and clean the data using the following Python commands. The next code declares which columns, from the Excel Table, will be employed.
import pandas as pd
# Original shared link: https://docs.google.com/spreadsheets/d/1m9bOb2Kw-kq6j_6Ji75h6cJK5-THhSL1/edit?usp=sharing&ouid=106640872116257813737&rtpof=true&sd=true
url = "https://drive.google.com/file/d/1m9bOb2Kw-kq6j_6Ji75h6cJK5-THhSL1/view?usp=sharing"
url2='https://drive.google.com/uc?id=' + url.split('/')[-2]
df = pd.read_excel(url2)
df
Now, let's check data validity by knowing the type of each column data.
print(df.dtypes)
Date object
Revenue Collections (LNY) float64
Inflation Rate(x1) float64
Forex Rate Average (x2) float64
Import (FOB Value in Million USD) (x3) float64
Export (FOB Value in Million USD) (x4) float64
Stock Price Index (x5) float64
dtype: object
It is also important to check data validity by verifying for Null values.
df.isnull().values.any()
False
Since there are no null values, let's filter columns to select only numerical columns.
# Remove non-numeric columns
df2 = df.drop(df.columns[[0]], axis=1, inplace=False)
df2.drop_duplicates()
df2
Since the data set had been cleaned, it is time to extract statistics to better understand how to model the multiple linear regression. The first step is to compute the Pearson correlation between variables.
pearsoncorr = df2.corr(method='pearson')
pearsoncorr
The Pearson correlation could be visualized using a heat map graphic.
# import required libraries
import seaborn as sns
# Displaying dataframe as an heatmap
# with diverging colourmap as RdYlGn
sns.heatmap(pearsoncorr, cmap ='RdYlGn', linewidths = 0.30, annot = True);
A dispersion graphic is another way to visualize patterns of correlation between independent and dependent variables as done in the next commands.
import seaborn as sns
X = df2[list(df2.columns)]
sns.pairplot(X);
First, select the column that will be the multiple linear regression model output.
# Y output column, revenue is the output
out_col = 'Revenue Collections (LNY)' # 'Revenue Collections (y)'
y_revenue = df2[out_col]
y_revenue
0 11.36
1 11.14
2 11.23
3 11.67
4 11.46
...
68 11.86
69 11.87
70 12.10
71 11.94
72 12.08
Name: Revenue Collections (LNY), Length: 73, dtype: float64
And define the columns that will be the independent variables of the model.
# X 5 predictor variables
x = df2[['Inflation Rate(x1)',
'Forex Rate Average (x2)',
'Import (FOB Value in Million USD) (x3)',
'Export (FOB Value in Million USD) (x4)',
'Stock Price Index (x5)']]
x
Another possibility is to select only the columns that showed higher values of Pearson correlation or dispersion graphic linear pattern, i.e., variables x3 and x5,
Now it is possible to build a multiple linear model with all variables and only the selected variables and print a detailed report about it.
# Obtain multiple linear regression using statsmodels
import statsmodels.api as sm
# Employing multiple linear regression and return its results.
def get_model(x, y):
xm = sm.add_constant(x) # adding a constant
model = sm.OLS(y, xm).fit()
return model, xm
# Printing a report about the model.
def print_model(model, name='None'):
summary = model.summary()
print('--------------------------------------')
print(name,'model')
print(summary)
print('--------------------------------------')
# Multiple linear regression using statsmodels
model_revenue, xm_import = get_model(x, y_revenue)
print_model(model_revenue,'Revenue - All five')
model_revenue_selected, xm_import_selected = get_model(x_selected, y_revenue)
print_model(model_revenue_selected,'Revenue - x3 and x5')
Finally, it is possible to use both models to predict the dependent variable and use graphics to verify the quality of the model's prediction.
import matplotlib.pyplot as plt
def model_predict(model, xm):
y_hat = model.predict(xm)
return y_hat
def draw_model(model, y, yname, y_hat):
xp = list(range(1,len(x)+1))
plt.plot(xp,y,'ob',label='Data')
plt.plot(xp,y,'-r')
plt.plot(xp,y_hat,'--g',label='Prediction')
plt.title('Prediction on ' + str(yname))
plt.xlabel('Time')
plt.ylabel(yname)
plt.legend()
plt.grid()
plt.show()
# Predict the values used to build the model with all coefficients.
y_hat = model_predict(model_revenue, xm_import)
#print(y_hat)
# Drawing to compare data x prediction in score data.
draw_model(model_revenue, y_revenue, 'Revenue', y_hat)
# Predict the values used to build the model with x3 and x5 coefficients.
y_hat_selected = model_predict(model_revenue_selected, xm_import_selected)
#print(y_hat_selected)
# Drawing to compare data x prediction in score data.
draw_model(model_revenue_selected, y_revenue, 'Selected Revenue', y_hat_selected)
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1zRkBkIjKUxieS2hR-bXSmv-9Wu_zL9eR?usp=sharing