1.4. Dummy and categorical variables

1. Concepts & Definitions

1.1. Linear regression: Concepts and equations

1.2. Linear regression: Numerical example

1.3. Correlation is no causation

1.4. Dummy and categorical variables

1.5. Multiple linear regression

1.6. Dummy multiple linear regression

2. Problem & Solution

2.1. Predicting Exportation & Importation Volume

2.2. Cumulative Probability Predictions

2.3. Multiple Linear Regression Philippine Revenue

What is a categorical variable?

In many real-world scenarios, the independent variables can be categorical or qualitative in nature. For example, in a study analyzing the impact of smoking on lung cancer, the variable “smoking status” could have categories such as “never smoked,” “former smoker,” and “current smoker.” [1].

The categorical variables can be further subdivided into the following categories :

Binary or Dichotomous is essentially the variables that can have only two outcomes such as Win/Lose, On/Off, and so on.
Nominal Variables are used to represent groups with no particular ranking such as colors, brands, and so on.
Ordinal Variables represent groups with a specified ranking order such as Winners of a race, App Ratings to name a few [2].

What are dummy variables and its connection with categorical variables?

Dummy variables are a way of representing categorical variables in a linear regression model. A dummy variable takes on the value of 1 or 0 to indicate the presence or absence of a particular category. For example, if we have a categorical variable with two categories, “A” and “B,” we can create a dummy variable “X” that takes on the value 1 if the category is “A” and 0 if the category is “B” [1].

In the case of more than two categories, we create multiple dummy variables. For example, if we have a categorical variable with three categories, “A,” “B,” and “C,” we can create two dummy variables “X” and “Y.” “X” will take on the value 1 if the category is “A” and 0 if the category is “B” or “C,” while “Y” will take on the value 1 if the category is “B” and 0 if the category is “A” or “C.”

Once we have created the dummy variables, we can include them in the linear regression model along with the continuous variables.

Numerical example in Python

The next code shows how three dummy variables are created for the three categorical values of the temperature attribute. We can create dummy variables using pandas package with get_dummies() method [2, 3].

Let's start with one categorical variable with three categories.

# import required modules

import pandas as pd

import numpy as np

# create dataset

df = pd.DataFrame({'Temperature': ['Hot', 'Cold', 'Warm', 'Cold'],

})

# display dataset

print(df)

# create dummy variables

df1 = pd.get_dummies(df)

print(df1)

Temperature

0 Hot

1 Cold

2 Warm

3 Cold

Temperature_Cold Temperature_Hot Temperature_Warm

0 0 1 0

1 1 0 0

2 0 0 1

3 1 0 0

The next code shows how to deal with four categorical variables each one with different number of categories.

# importing the libraries

import pandas as pd

# creating the dictionary

dictionary = {'OUTLOOK': ['Rainy', 'Rainy',

'Overcast', 'Sunny',

'Sunny', 'Sunny',

'Overcast', 'Rainy',

'Rainy', 'Sunny',

'Rainy', 'Overcast',

'Overcast', 'Sunny'],

'TEMPERATURE': ['Hot', 'Hot', 'Hot',

'Mild', 'Cool',

'Cool', 'Cool',

'Mild', 'Cool',

'Mild', 'Mild',

'Mild', 'Hot', 'Mild'],

'HUMIDITY': ['High', 'High', 'High',

'High', 'Normal', 'Normal',

'Normal', 'High', 'Normal',

'Normal', 'Normal', 'High',

'Normal', 'High'],

'WINDY': ['No', 'Yes', 'No', 'No', 'No',

'Yes', 'Yes', 'No', 'No',

'No', 'Yes', 'Yes', 'No',

'Yes']}

# converting the dictionary to DataFrame

df2 = pd.DataFrame(dictionary)

print(df2)

The next code selects only two columns with their respective categorical variables, 'WINDY' and 'OUTLOOK', to convert to dummies variables.

# creating a copy of the original data frame

df3 = df2.copy()

# calling the get_dummies method

# the first parameter mentions the

# the name of the data frame to store the

# new data frame in

# the second parameter is the list of

# columns which if not mentioned

# returns the dummies for all

# categorical columns

df3 = pd.get_dummies(df3,columns = ['WINDY', 'OUTLOOK'])

df3

What is the Dummy Variable Trap?

When creating dummy variables, a problem that can arise is known as the dummy variable trap [4].

This occurs when we create k dummy variables instead of k-1 dummy variables. When this happens, at least two of the dummy variables will suffer from perfect multicollinearity. That is, they’ll be perfectly correlated. This causes incorrect calculations of regression coefficients and their corresponding p-values.

For example, suppose we converted WINDY status into the following dummy variables: WINDY_YES, and WINDY_NO.

In this case, WINDY_YES, and WINDY_NO are perfectly correlated and have a correlation coefficient of -1. Thus, when we go to perform multiple linear regression the calculations for the regression coefficients will be incorrect. The same observation could be done for the OUTLOOK categorical variable.

How to avoid Dummy Variable Trap?

You only need to remember one rule to avoid the dummy variable trap [5]:

If a categorical variable can take on k different values, then you should only create k-1 dummy variables to use in the regression model.

For example, suppose you’d like to convert a categorical variable OUTLOOK into dummy variables. Suppose this variable takes on the following values:

Overcast
Rainy
Sunny

Since this variable can take on 3 different values, we will only create 2 dummy variables. For example, our dummy variables might be:

X1 = 1 if Rainy; 0 otherwise
X2 = 1 if Overcast; 0 otherwise

Since the number of dummy variables is one less than the number of values that OUTLOOK can take on, we can avoid the dummy variable trap and the problem of multicollinearity.

The dummy variable trap could be easily addressed by pandas command get_dummies employing the parameter drop_first=True as done in the next code [6].