1. Concepts & Definitions
1.1. Continous random distribution of probability
1.2. Normal distribution of probability
1.3. Standard normal distribution of probability
1.4. Inverse standard normal distribution
1.6. Inverse Student's T distribution
2. Problem & Solution
2.1. Weight, dimension, and value per HS6
2.2. How to fit a distribution
2.3. Employing standard deviation
2.4. Total time spent in a system
Obtaining curves fitted to the data allows the use of an alternative for detecting outliers.
For example, if the curve fitted to the data follows a normal distribution, then it is possible to employ the fact that most of the data that follow this distribution are contained in the interval defined between the mean plus three standard deviations and the mean minus three standard deviations.
All data outside this range can be considered as extreme values, and eventually excluded for further analysis since they represent rare cases that need special treatment.
The next figure helps to illustrate this concept.
From here onwards, our main task starts, but before implementing the same, let’s first discuss the approach to dealing with bad data using Z-Score:
(1) The very first step will be setting the upper and lower limit. This range stimulates that every data point will be regarded as an outlier out of this range. Let’s see the formulae for both upper and lower limits.
Upper: Mean + 3 * standard deviation.
Lower: Mean – 3 * standard deviation.
Inference: In the output, the highest value is Zmax while the lowest value is Zmin. Hence any value out of this range is a bad data point or an outlier.
(2) The second step is to detect how many outliers are there in the dataset based on the upper and lower limit that we set up just
df[(df['data'] > Zmax) | (df['data'] < Zmin)]
The next Python code helps to illustrate how to create the previous figure which was helpful to illustrate the concept of employing a normal distribution to identify outliers.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# Normal distribution parameters
mu = 0
sigma = 1
# First interval
l1 = -1
u1 = 1
# Second interval
l2 = -2
u2 = 2
# calculate the z-transform for the first interval
zl1 = ( l1 - mu ) / sigma
zu1 = ( u1 - mu ) / sigma
# calculate the z-transform for the second interval
zl2 = ( l2 - mu ) / sigma
zu2 = ( u2 - mu ) / sigma
x1 = np.arange(l1, u1, 0.001) # range of x in spec
x2 = np.arange(l2, u2, 0.001) # range of x in spec
x_all = np.arange(-10, 10, 0.001) # entire range of x, both in and out of spec
# mean = 0, stddev = 1, since Z-transform was calculated
y1 = norm.pdf(x1,0,1)
y2 = norm.pdf(x2,0,1)
y_all = norm.pdf(x_all,0,1)
# build the plot
fig, ax = plt.subplots(figsize=(9,6))
plt.style.use('fivethirtyeight')
ax.plot(x_all,y_all)
ax.text(mu, 0.4, 'Usual', fontsize=14)
ax.text(mu+1.4*l2, 0.1, 'Unusual', fontsize=14)
ax.text(mu+u2, 0.1, 'Unusual', fontsize=14)
ax.text(mu-3.5, 0.1, 'Rare', fontsize=14)
ax.text(mu+3, 0.1, 'Rare', fontsize=14)
ax.fill_between(x1,y1,0, alpha=0.7, color='g')
ax.fill_between(x2,y2,0, alpha=0.4, color='y')
ax.fill_between(x_all,y_all,0, alpha=0.1, color='r')
ax.set_xlim([-4,4])
ax.set_xlabel('# of Standard Deviations Outside the Mean')
ax.set_yticklabels([])
ax.set_title('Normal Gaussian Curve')
The previous complete code is available in the following link:
https://colab.research.google.com/drive/1UVBBMXrVdq7ki7o3wNgK372b3Y9UUVQB?usp=sharing