1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
Load the notebook with commands developed in Track 02 - section 1.3. (Click on the link):
https://colab.research.google.com/drive/1OfDdV5wv_F8fkRpNWlcI8r553u0AsPZ1?usp=sharing
Reload all the commands about importation and cleaning of immigration data (section 3).
Choose the data of a country to predict the future data of immigration: exclude the name of the country and the total of immigration.
country_data = list(df.iloc[192,1:-1])
years = list(df.columns)[1:-1]
print(country_data)
print(years)
[1, 2, 1, 6, 0, 18, 7, 12, 7, 18, 4, 18, 41, 41, 39, 73, 144, 121, 141, 134, 122, 181, 171, 113, 124, 161, 140, 122, 133, 128, 211, 160, 174, 217]
[1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013]
Create a graph with the selected data to see if it is appropriate to employ a linear regression.
import matplotlib.pyplot as plt
y = country_data
x = list(range(1,len(country_data)+1))
plt.plot(x,y,'ob',x,y,'-r')
plt.xlabel('Years')
plt.ylabel('Immigration')
plt.grid()
plt.show()
Prepare the format of the data, and employing a library to find the linear regression coefficients.
import numpy as np
from sklearn.linear_model import LinearRegression
x = np.array(x).reshape((-1, 1))
y = np.array(y)
model = LinearRegression()
model = LinearRegression().fit(x, y)
Obtain the metrics about the obtained linear regression model.
b0 = model.intercept_
b1 = model.coef_
print(f"intercept: {b0}")
print(f"slope: {b1}")
intercept: -27.417112299465217
slope: [6.58349885]
Employ the command predict to evaluate the function related to the linear regression model in specific points.
y_pred = model.predict(x)
print(f"predicted response:\n{y_pred}")
predicted response: [-20.83361345 -14.25011459 -7.66661574 -1.08311688 5.50038197 12.08388083 18.66737968 25.25087853 31.83437739 38.41787624 45.0013751 51.58487395 58.1683728 64.75187166 71.33537051 77.91886937 84.50236822 91.08586707 97.66936593 104.25286478 110.83636364 117.41986249 124.00336134 130.5868602 137.17035905 143.75385791 150.33735676 156.92085561 163.50435447 170.08785332 176.67135218 183.25485103 189.83834989 196.42184874]
The next code obtains R-squared and Adjusted R-squared using the theoretical equations.
# compute with formulas from the theory
yhat = y_pred
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-x.shape[1]-1)
print('r squared = ',r_squared)
print('Adjusted r squared = ',adjusted_r_squared)
r squared = 0.8361742162984227
Adjusted r squared = 0.8310546605577485
For the R-squared is also possible to obtain it through a built-in command score, but for Adjusted R-squared the theoretical equations are still necessary.
r2 = model.score(x, y)
r2adjusted = 1 - (1-r2)*(len(y)-1)/(len(y)-x.shape[1]-1)
print('r2 = ',r2)
print('Adjusted r2 = ',r2adjusted)
r2 = 0.8361742162984227
Adjusted r2 = 0.8310546605577485
Building the final graphic which includes also the values obtained using the linear regression model.
import matplotlib.pyplot as plt
y = country_data
x = list(range(1,len(country_data)+1))
plt.plot(x,y,'ob',label='Data')
plt.plot(x,y,'-r')
plt.plot(x,y_pred,'--g',label='Prediction')
plt.xlabel('Years')
plt.ylabel('Immigration')
plt.legend()
plt.grid()
plt.show()
The next code explains how to obtain P-values for coefficients from the data [5].
import statsmodels.api as sm
#add constant to predictor variables
x = sm.add_constant(x)
#fit linear regression model
model = sm.OLS(y, x).fit()
#view model summary
print(model.summary())
To obtain just the P-values of each linear regression coefficient the next will help.
#extract p-values for all predictor variables
for x in range (0, 2):
print(model.pvalues[x])
0.012319135978706514
4.0970438709442893e-14
Remember
If the P-value for a variable is less than your significance level, your sample data provide enough evidence to reject the null hypothesis for the entire population. Your data favor the hypothesis that there is a non-zero correlation.
So, it can be concluded that both coefficients could be employed with a confidence level of 95%.
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1NaKfR6kyaXG8nW7m3xOR4CHnrrFEqm8x?usp=sharing