1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
Variables within a dataset can be related for lots of reasons[1]. For example:
One variable could cause or depend on the values of another variable.
One variable could be lightly associated with another variable.
Two variables could depend on a third unknown variable.
It can be useful in data analysis and modeling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.
Correlation is the statistical measure that defines to which extent two variables are linearly related to each other. In statistics, correlation is defined by the Pearson Correlation formula [2]:
Where: r: Correlation coefficient, xi: ith value first dataset X, x̄: Mean of first dataset X, yi: ith value second dataset Y, ȳ: Mean of second dataset Y.
A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neutral or zero, meaning that the variables are unrelated. In summary:
Positive Correlation: Both variables change in the same direction.
Neutral Correlation: No relationship in the change of the variables.
Negative Correlation: variables change in opposite directions.
The next figure provides a graphical illustration of the three previous possible cases.
The next code shows how to compute correlation using three techniques: using a step-by-step function, and using a built-in function from numpy and pandas libraries.
First, let's define the data that will be analyzed.
import numpy as np
# Define the dataset
x = np.array([1,3,5,7,8,9, 10, 15])
y = np.array([10, 20, 30, 40, 50, 60, 70, 80])
print(x)
print(y)
[ 1 3 5 7 8 9 10 15]
[10 20 30 40 50 60 70 80]
The next code create the related graphic.
import matplotlib.pyplot as plt
plt.plot(x,y,'ob',x,y,'-r')
plt.xlabel('X')
plt.ylabel('Y')
plt.grid()
plt.show()
The next code creates and test a function that computes the correlation between x & x, and x & y.
def Pearson_correlation(X,Y):
if len(X)==len(Y):
Sum_xy = sum((X-X.mean())*(Y-Y.mean()))
Sum_x_squared = sum((X-X.mean())**2)
Sum_y_squared = sum((Y-Y.mean())**2)
corr = Sum_xy / np.sqrt(Sum_x_squared * Sum_y_squared)
return corr
print(Pearson_correlation(x,y))
print(Pearson_correlation(x,x))
0.974894414261588
1.0
It is also possible to employ a numpy built-in function to do the same task for all possible combinations that will be returned in the form of matrix which the first line is x & x, and x & y, and the second is y & x, and y & y.
print(np.corrcoef(x, y))
[[1. 0.97489441]
[0.97489441 1. ]]
Finally, to employ the pandas command is necessary to convert all data into data frame format.
import numpy as np
import pandas as pd
data = np.array([x,y])
df = pd.DataFrame({'X': data[0, :], 'Y': data[1, :]})
print(df)
X Y
0 1 10
1 3 20
2 5 30
3 7 40
4 8 50
5 9 60
6 10 70
7 15 80
Then, employ the command to compute a correlation matrix in the same as done by the numpy command.
# Find the pearson correlations matrix
corr = df.corr(method = 'pearson')
print(corr)
X Y
X 1.000000 0.974894
Y 0.974894 1.000000
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1ewg9kfUYD-ucdpxDy76ipRmRPr8YZUi7?usp=sharing
One final warning is that correlation is not causation. The next figure helps to illustrate this point in a funny manner [3].
From the figure, it could be said that correlation is a relationship or connection between two variables where whenever one changes, the other is likely to also change [3]. But a change in one variable doesn’t cause the other to change. That’s a correlation, but it’s not causation.
Finnaly, five real-world examples could help to understand this aspect [4]:
Example 1: Ice Cream Sales & Shark Attacks,
Example 2: Master’s Degrees vs. Box Office Revenue,
Example 3: Pool Drownings vs. Nuclear Energy Production,
Example 4: Measles Cases vs. Marriage Rate,
Example 5: High School Graduates vs. Pizza Consumption.
Please, see [4] for more details.