1.6. Dummy Multiple Linear Regression

1. Concepts & Definitions

1.1. Linear regression: Concepts and equations

1.2. Linear regression: Numerical example

1.3. Correlation is no causation

1.4. Dummy and categorical variables

1.5. Multiple linear regression

1.6. Dummy multiple linear regression

2. Problem & Solution

2.1. Predicting Exportation & Importation Volume

2.2. Cumulative Probability Predictions

2.3. Multiple Linear Regression Philippine Revenue

Multiple linear regression with dummy variables

We will load the 50 startups dataset from Kaggle. The dataset is a CSV file with data collected from New York, California, and Florida with around 50 business Startups – 17 in each state. The variables used in the dataset are Profit, R&D spending, Administration Spending, and Marketing Spending. Our main goal is to predict the profits. The file with the data set can be found at the following internet address [1, 2]:

https://gist.githubusercontent.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/faf8b1a0d58e251f48a647d3881e7a960c3f0925/50_Startups.csv

Reading the data set

The next code helps to read the data set from an internet address.

import pandas as pd

import numpy as np

import seaborn as sns

import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.preprocessing import OneHotEncoder

from sklearn.metrics import mean_absolute_error,mean_squared_error, r2_score

url = 'https://gist.githubusercontent.com/GaneshSparkz/b5662effbdae8746f7f7d8ed70c42b2d/raw/faf8b1a0d58e251f48a647d3881e7a960c3f0925/50_Startups.csv'

startups_df = pd.read_csv(url)

startups_df.head()

Correlation among the variables

The next code helps to visualize the correlation among quantitative variables using a heatmap and a scatter plot [3].

sns.heatmap(startups_df.corr(), annot=True)

import seaborn as sns

# Selection: interest_rate unemployment_rate index_price

X = startups_df

sns.pairplot(X);

The last row of graphics it possible to extract the following conclusions:

A linear relationship exists between the profit and the R&D Spend. Specifically, when R&D Spend go up, the profit also goes up.
A more diffuse linear relationship also exists between the profit and the Marketing Spend – when the Marketing Spend rates go up, the profit goes up (here we still have a linear relationship with a positive slope).
The values of profit follow a normal distribution.

Transforming categorical variables into dummy variables

The next code helps to transform one categorical variable into two dummy variables since the original variable could assume three values. Remember that is necessary to avoid the dummy trap, then only two new columns will created, instead of three columns [4].

# SPLITTING THE DATA INTO INDEPEDENT AND DEPENDENT VARIABLES(X, y)

X = startups_df.iloc[:, :-1] # Independent varibles

y = startups_df.iloc[:, -1] # dependent variable

X = pd.get_dummies(X, columns = ['State'], drop_first=True)

X.head()

Multiple linear regression model: coefficients

The next code explains how to create a multiple linear regression with these new dummy variables, and obtain its corresponding coefficients.

# FITTING THE MODEL/TRAIN

regressor = LinearRegression() # Instatiate LinearREgression object

regressor.fit(X, y) # fit the model

print('Coefficients: ', regressor.coef_)

print('Intercept: ',regressor.intercept_)

Coefficients: [ 8.06023114e-01 -2.70043196e-02 2.69798610e-02 1.98788793e+02 -4.18870191e+01]

Intercept: 50125.343831604216

This output includes the intercept and coefficients. You can use this information to build the multiple linear regression equation as follows:

index_price = (intercept) + (R&D_Spend coef)*X1 + (Admin coef)*X2 + (Marketing coef)*X3 + (State_Florida coef) * X4 + (State_New_York coef)*X5

And once you plug the numbers:

index_price = (50125.3438) + (0.8060)*X1 + (-0.0270)*X2 + (0.0270)*X3 + (198.79)*X4 + (-41.88)*X5

Multiple linear regression model: R squared, Adjusted R squared, and P-value

Although the previous code created a multiple linear regression model and extracted its corresponding coefficients, how coefficients contribute to the effectiveness of predictions could be different, and maybe some should be dismissed. In this sense, R-squared, Adjusted R-squared, and more importantly, P-values could be a great tool to investigate this aspect. This is done by the next code [5].

As indicated by the statistics, maybe variables 'Administration', 'Marketing Spend', 'State_Florida', 'State_New_York' should not be considered in the model since their P-Values are higher than the α related with a confidence level of 95%.