1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
Multiple Linear Regression is an extension of Simple Linear regression as it takes more than one predictor variable to predict the response variable. It is an important regression algorithm that models the linear relationship between a single dependent continuous variable and more than one independent variable. It uses two or more independent variables to predict a dependent variable by fitting the best linear relationship [1].
It has two or more independent variables (X) and one dependent variable (Y), where Y is the value to be predicted. Thus, it is an approach for predicting a quantitative response using multiple features through the following linear equation:
Y = β0 + β1X1 + β2X2 + β3X3 + … + βnXn + e
Where:
Y = Dependent variable / Target variable,
β0 = Intercept of the regression line,
β1, β2, β3, …. βn = Slope of the regression line which tells whether the line is increasing or decreasing,
X1, X2, X3, ….Xn = Independent variable / Predictor variable,
e = Error.
Example: Predicting sales based on the money spent on TV, Radio, and YouTube for marketing. In this case, there are three independent variables, i.e., money spent on TV, Radio, and YouTube for marketing, and one dependent variable, i.e., sales, that is the value to be predicted.
Before you execute a linear regression model, it is advisable to validate that certain assumptions are met. As noted earlier, you may want to check that a linear relationship exists between the dependent variable and each independent variable. To perform a quick linearity check, you can use scatter diagrams (utilizing the Seaborn library) [2].
The next numerical example will explore a linear relationship exists between the [3]:
index_price (dependent variable) and interest_rate (independent variable),
index_price (dependent variable) and unemployment_rate (independent variable).
Let's define the data for the numerical example.
import pandas as pd
data = {'year': [2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016],
'month': [12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1],
'interest_rate': [2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75],
'unemployment_rate': [5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1],
'index_price': [1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719]
}
df = pd.DataFrame(data)
df
The next code automatically produces scatter plots among several variables organized as a matrix of experiments.
import seaborn as sns
# Selection: interest_rate unemployment_rate index_price
X = df[list(df.columns)[2:]]
sns.pairplot(X);
The last row of graphics it possible to extract the following conclusions:
A linear relationship exists between the index_price and the interest_rate. Specifically, when interest rates go up, the index price also goes up.
A linear relationship also exists between the index_price and the unemployment_rate – when the unemployment rates go up, the index price goes down (here we still have a linear relationship but with a negative slope).
The values of index_price follow a uniform distribution.
The next step is to obtain the model coefficients and give an interpretation of them. This is done in the next code.
import pandas as pd
from sklearn import linear_model
import statsmodels.api as sm
x = df[['interest_rate','unemployment_rate']]
y = df['index_price']
# with sklearn
regr = linear_model.LinearRegression()
regr.fit(x, y)
print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)
Intercept: 1798.4039776258544
Coefficients: [ 345.54008701 -250.14657137]
This output includes the intercept and coefficients. You can use this information to build the multiple linear regression equation as follows:
index_price = (intercept) + (interest_rate coef)*X1 + (unemployment_rate coef)*X2
And once you plug the numbers:
index_price = (1798.4040) + (345.5401)*X1 + (-250.1466)*X2
Before adopting the model coefficients to construct a predictor of values about prices, it is necessary to check their related statistics, in particular, how good correlation indexes (R squared and Adjusted R squared), and how trustable are each coefficient (P-value related with each coefficient). All this could be achieved through the following code.
# with statsmodels
xm = sm.add_constant(x) # adding a constant
model = sm.OLS(y, xm).fit()
print_model = model.summary()
print(print_model)
Now, the model could be employed to predict index_price using interest_rate_coef and unemployment_rate_coef and show the prediction and real data in a graphic.
import matplotlib.pyplot as plt
y_hat = model.predict(xm)
xp = list(range(1,len(x)+1))
plt.plot(xp,y,'ob',label='Data')
plt.plot(xp,y,'-r')
plt.plot(xp,y_hat,'--g',label='Prediction')
plt.xlabel('Time')
plt.ylabel('index_price')
plt.legend()
plt.grid()
plt.show()
For linear regression models there are three metrics for performance evaluation which are [4]:
Mean Absolute Error (MAE): Mean Absolute Error is the absolute difference between the actual or true values and the predicted values. The lower the value, the better the model’s performance. A mean absolute error of 0 means that your model is a perfect predictor of the outputs.
Mean Square Error (MSE): Mean Square Error is calculated by taking the average of the square of the difference between the original and predicted values of the data. The lower the value, the better the model’s performance.
Root Mean Square Error (RMSE): Root Mean Square Error is the standard deviation of the errors that occur when a prediction is made on a dataset. This is the same as the Mean Squared Error, but the root of the value is considered while determining the accuracy of the model. The lower the value, the better the model’s performance.
In mathematical terms, we have [5]:
#Model Evaluation
from sklearn import metrics
import numpy as np
meanAbErr = metrics.mean_absolute_error(y, y_hat)
meanSqErr = metrics.mean_squared_error(y, y_hat)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y, y_hat))
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)
Mean Absolute Error: 51.77251657344119
Mean Square Error: 4356.611357123128
Root Mean Square Error: 66.00463133086289
All three previous metrics are important to compare different linear regression models. For example, to verify if the inclusion of an additional variable could improve or not the model performance.
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1UCEYruS4At3frlJPxYhAXQxrN5CBNuls?usp=sharing