1.2. Linear regression: numerical example

1. Concepts & Definitions

1.1. Linear regression: Concepts and equations

1.2. Linear regression: Numerical example

1.3. Correlation is no causation

1.4. Dummy and categorical variables

1.5. Multiple linear regression

1.6. Dummy multiple linear regression

2. Problem & Solution

2.1. Predicting Exportation & Importation Volume

2.2. Cumulative Probability Predictions

2.3. Multiple Linear Regression Philippine Revenue

Reloading immigration data from Track 02

Load the notebook with commands developed in Track 02 - section 1.3. (Click on the link):

https://colab.research.google.com/drive/1OfDdV5wv_F8fkRpNWlcI8r553u0AsPZ1?usp=sharing

Reload all the commands about importation and cleaning of immigration data (section 3).

Data preparation and visualization

Choose the data of a country to predict the future data of immigration: exclude the name of the country and the total of immigration.

country_data = list(df.iloc[192,1:-1])

years = list(df.columns)[1:-1]

print(country_data)

print(years)

[1, 2, 1, 6, 0, 18, 7, 12, 7, 18, 4, 18, 41, 41, 39, 73, 144, 121, 141, 134, 122, 181, 171, 113, 124, 161, 140, 122, 133, 128, 211, 160, 174, 217]

[1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013]

Create a graph with the selected data to see if it is appropriate to employ a linear regression.

import matplotlib.pyplot as plt

y = country_data

x = list(range(1,len(country_data)+1))

plt.plot(x,y,'ob',x,y,'-r')

plt.xlabel('Years')

plt.ylabel('Immigration')

plt.grid()

plt.show()

Building the model

Prepare the format of the data, and employing a library to find the linear regression coefficients.

import numpy as np

from sklearn.linear_model import LinearRegression

x = np.array(x).reshape((-1, 1))

y = np.array(y)

model = LinearRegression()

model = LinearRegression().fit(x, y)

Obtain the metrics about the obtained linear regression model.

b0 = model.intercept_

b1 = model.coef_

print(f"intercept: {b0}")

print(f"slope: {b1}")

intercept: -27.417112299465217

slope: [6.58349885]

Employ the command predict to evaluate the function related to the linear regression model in specific points.

y_pred = model.predict(x)

print(f"predicted response:\n{y_pred}")

predicted response: [-20.83361345 -14.25011459 -7.66661574 -1.08311688 5.50038197 12.08388083 18.66737968 25.25087853 31.83437739 38.41787624 45.0013751 51.58487395 58.1683728 64.75187166 71.33537051 77.91886937 84.50236822 91.08586707 97.66936593 104.25286478 110.83636364 117.41986249 124.00336134 130.5868602 137.17035905 143.75385791 150.33735676 156.92085561 163.50435447 170.08785332 176.67135218 183.25485103 189.83834989 196.42184874]

Obtaining R-Squared and Adjusted R-Squared

The next code obtains R-squared and Adjusted R-squared using the theoretical equations.

# compute with formulas from the theory

yhat = y_pred

SS_Residual = sum((y-yhat)**2)

SS_Total = sum((y-np.mean(y))**2)

r_squared = 1 - (float(SS_Residual))/SS_Total

adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-x.shape[1]-1)

print('r squared = ',r_squared)

print('Adjusted r squared = ',adjusted_r_squared)

r squared = 0.8361742162984227

Adjusted r squared = 0.8310546605577485

For the R-squared is also possible to obtain it through a built-in command score, but for Adjusted R-squared the theoretical equations are still necessary.

r2 = model.score(x, y)

r2adjusted = 1 - (1-r2)*(len(y)-1)/(len(y)-x.shape[1]-1)

print('r2 = ',r2)

print('Adjusted r2 = ',r2adjusted)

r2 = 0.8361742162984227

Adjusted r2 = 0.8310546605577485

Final visualization

Building the final graphic which includes also the values obtained using the linear regression model.

import matplotlib.pyplot as plt

y = country_data

x = list(range(1,len(country_data)+1))

plt.plot(x,y,'ob',label='Data')

plt.plot(x,y,'-r')

plt.plot(x,y_pred,'--g',label='Prediction')

plt.xlabel('Years')

plt.ylabel('Immigration')

plt.legend()

plt.grid()

plt.show()

Obtaining P-Value

The next code explains how to obtain P-values for coefficients from the data [5].

import statsmodels.api as sm

#add constant to predictor variables

x = sm.add_constant(x)

#fit linear regression model

model = sm.OLS(y, x).fit()

#view model summary

print(model.summary())

To obtain just the P-values of each linear regression coefficient the next will help.

#extract p-values for all predictor variables

for x in range (0, 2):

print(model.pvalues[x])

0.012319135978706514

4.0970438709442893e-14

Remember

If the P-value for a variable is less than your significance level, your sample data provide enough evidence to reject the null hypothesis for the entire population. Your data favor the hypothesis that there is a non-zero correlation.

So, it can be concluded that both coefficients could be employed with a confidence level of 95%.

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1NaKfR6kyaXG8nW7m3xOR4CHnrrFEqm8x?usp=sharing

References

[1] https://realpython.com/linear-regression-in-python/

[2] https://medium.com/@mrinmoyborah/linear-regression-supervised-machine-learning-algorithm-17edc44a7bf7

[3] https://medium.com/@panData/mastering-statistical-analysis-with-statsmodel-library-with-code-d59e84c5b371

[4] https://stackoverflow.com/questions/42033720/python-sklearn-multiple-linear-regression-display-r-squared

[5] https://www.statology.org/statsmodels-linear-regression-p-value/

Page updated

Google Sites

Report abuse