1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
We will load the 50 startups dataset from Kaggle. The dataset is a CSV file with data collected from New York, California, and Florida with around 50 business Startups – 17 in each state. The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending. Our main goal is to predict the profits. The file with the data set can be found at the following internet address [1, 2]:
The next code helps to read the data set from an internet address.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score
url = 'https://gist.githubusercontent.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/faf8b1a0d58e251f48a647d3881e7a960c3f0925/50_Startups.csv'
startups_df = pd.read_csv(url)
startups_df.head()
The next code helps to visualize the correlation among quantitative variables using a heatmap and a scatter plot [3].
sns.heatmap(startups_df.corr(), annot=True)
import seaborn as sns
# Selection: interest_rate unemployment_rate index_price
X = startups_df
sns.pairplot(X);
The last row of graphics it possible to extract the following conclusions:
A linear relationship exists between the profit and the R&D Spend. Specifically, when R&D Spend go up, the profit also goes up.
A more diffuse linear relationship also exists between the profit and the Marketing Spend – when the Marketing Spend rates go up, the profit goes up (here we still have a linear relationship with a positive slope).
The values of profit follow a normal distribution.
The next code helps to transform one categorical variable into two dummy variables since the original variable could assume three values. Remember that is necessary to avoid the dummy trap, then only two new columns will created, instead of three columns [4].
# SPLITTING THE DATA INTO INDEPEDENT AND DEPENDENT VARIABLES(X, y)
X = startups_df.iloc[:, :-1] # Independent varibles
y = startups_df.iloc[:, -1] # dependent variable
X = pd.get_dummies(X, columns = ['State'], drop_first=True)
X.head()
The next code explains how to create a multiple linear regression with these new dummy variables, and obtain its corresponding coefficients.
# FITTING THE MODEL/TRAIN
regressor = LinearRegression() # Instatiate LinearREgression object
regressor.fit(X, y) # fit the model
print('Coefficients: ', regressor.coef_)
print('Intercept: ',regressor.intercept_)
Coefficients: [ 8.06023114e-01 -2.70043196e-02 2.69798610e-02 1.98788793e+02 -4.18870191e+01]
Intercept: 50125.343831604216
This output includes the intercept and coefficients. You can use this information to build the multiple linear regression equation as follows:
index_price = (intercept) + (R&D_Spend coef)*X1 + (Admin coef)*X2 + (Marketing coef)*X3 + (State_Florida coef) * X4 + (State_New_York coef)*X5
And once you plug the numbers:
index_price = (50125.3438) + (0.8060)*X1 + (-0.0270)*X2 + (0.0270)*X3 + (198.79)*X4 + (-41.88)*X5
Although the previous code created a multiple linear regression model and extracted its corresponding coefficients, how coefficients contribute to the effectiveness of predictions could be different, and maybe some should be dismissed. In this sense, R-squared, Adjusted R-squared, and more importantly, P-values could be a great tool to investigate this aspect. This is done by the next code [5].
As indicated by the statistics, maybe variables 'Administration', 'Marketing Spend', 'State_Florida', 'State_New_York' should not be considered in the model since their P-Values are higher than the α related with a confidence level of 95%.
The next code produces a graphic that helps to understand how good the predictions employ the obtained multiple linear regression model.
import matplotlib.pyplot as plt
y_hat = model.predict(xm)
xp = list(range(1,len(X)+1))
plt.plot(xp,y,'ob',label='Data')
plt.plot(xp,y,'-r')
plt.plot(xp,y_hat,'--g',label='Prediction')
plt.xlabel('Time')
plt.ylabel('index_price')
plt.legend()
plt.grid()
plt.show()
#Model Evaluation
from sklearn import metrics
import numpy as np
meanAbErr = metrics.mean_absolute_error(y, y_hat)
meanSqErr = metrics.mean_squared_error(y, y_hat)
rootMeanSqErr = np.sqrt(metrics.mean_squared_error(y, y_hat))
print('Mean Absolute Error:', meanAbErr)
print('Mean Square Error:', meanSqErr)
print('Root Mean Square Error:', rootMeanSqErr)
Mean Absolute Error: 6475.500708731112
Mean Square Error: 78406792.88803764
Root Mean Square Error: 8854.761029414494
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1yq8cwY4sdc1kpjWQMLnzCB3IRRd9XScp?usp=sharing
[1] https://www.machinelearningnuggets.com/python-linear-regression/
[3] https://www.sfu.ca/~mjbrydon/tutorials/BAinPy/10_multiple_regression.html
[4] https://saturncloud.io/blog/linear-regression-with-dummycategorical-variables/
[5] https://timeseriesreasoning.com/contents/dummy-variables-in-a-regression-model/