2.3. Employing standard deviation

1. Concepts & Definitions

1.1. Continous random distribution of probability

1.2. Normal distribution of probability

1.3. Standard normal distribution of probability

1.4. Inverse standard normal distribution

1.5. Student's T distribution

1.6. Inverse Student's T distribution

2. Problem & Solution

2.1. Weight, dimension, and value per HS6

2.2. How to fit a distribution

2.3. Employing standard deviation

2.4. Total time spent in a system

2.5. Application of Gaussian Mixture

2.6. Gaussian Mixture on OCDB database

Using standard deviation and normal distribution to detect outliers

Obtaining curves fitted to the data allows the use of an alternative for detecting outliers.

For example, if the curve fitted to the data follows a normal distribution, then it is possible to employ the fact that most of the data that follow this distribution are contained in the interval defined between the mean plus three standard deviations and the mean minus three standard deviations.

All data outside this range can be considered as extreme values, and eventually excluded for further analysis since they represent rare cases that need special treatment.

The next figure helps to illustrate this concept.

From here onwards, our main task starts, but before implementing the same, let’s first discuss the approach to dealing with bad data using Z-Score:

(1) The very first step will be setting the upper and lower limit. This range stimulates that every data point will be regarded as an outlier out of this range. Let’s see the formulae for both upper and lower limits.

Upper: Mean + 3 * standard deviation.

Lower: Mean – 3 * standard deviation.

Inference: In the output, the highest value is Zmax while the lowest value is Zmin. Hence any value out of this range is a bad data point or an outlier.

(2) The second step is to detect how many outliers are there in the dataset based on the upper and lower limit that we set up just

df[(df['data'] > Zmax) | (df['data'] < Zmin)]

Creating a Python code to identify classes of occurrences

The next Python code helps to illustrate how to create the previous figure which was helpful to illustrate the concept of employing a normal distribution to identify outliers.

import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import norm

# Normal distribution parameters

mu = 0

sigma = 1

# First interval

l1 = -1

u1 = 1

# Second interval

l2 = -2

u2 = 2

# calculate the z-transform for the first interval

zl1 = ( l1 - mu ) / sigma

zu1 = ( u1 - mu ) / sigma

# calculate the z-transform for the second interval

zl2 = ( l2 - mu ) / sigma

zu2 = ( u2 - mu ) / sigma

x1 = np.arange(l1, u1, 0.001) # range of x in spec

x2 = np.arange(l2, u2, 0.001) # range of x in spec

x_all = np.arange(-10, 10, 0.001) # entire range of x, both in and out of spec

# mean = 0, stddev = 1, since Z-transform was calculated

y1 = norm.pdf(x1,0,1)

y2 = norm.pdf(x2,0,1)

y_all = norm.pdf(x_all,0,1)

# build the plot

fig, ax = plt.subplots(figsize=(9,6))

plt.style.use('fivethirtyeight')

ax.plot(x_all,y_all)

ax.text(mu, 0.4, 'Usual', fontsize=14)

ax.text(mu+1.4*l2, 0.1, 'Unusual', fontsize=14)

ax.text(mu+u2, 0.1, 'Unusual', fontsize=14)

ax.text(mu-3.5, 0.1, 'Rare', fontsize=14)

ax.text(mu+3, 0.1, 'Rare', fontsize=14)

ax.fill_between(x1,y1,0, alpha=0.7, color='g')

ax.fill_between(x2,y2,0, alpha=0.4, color='y')

ax.fill_between(x_all,y_all,0, alpha=0.1, color='r')

ax.set_xlim([-4,4])

ax.set_xlabel('# of Standard Deviations Outside the Mean')

ax.set_yticklabels([])

ax.set_title('Normal Gaussian Curve')

The previous complete code is available in the following link:

https://colab.research.google.com/drive/1UVBBMXrVdq7ki7o3wNgK372b3Y9UUVQB?usp=sharing

Reference:

https://www.analyticsvidhya.com/blog/2022/08/dealing-with-outliers-using-the-z-score-method/

Page updated

Google Sites

Report abuse