1.1. A review on Parametric Statistics

Imagine a snack bar would like the preferences of its clients between candy or salty food by carrying out survey research. One manner to make this is described in detail in the next Figure.

Definition of descriptive statistics elements

From Figure 1 it is possible to extract some important definitions:

• Population or target population: Consists of all elements – individuals, items, or objects – whose characteristics are being studied. The study population is also called the target population.

• Sample: A portion of the population selected for the study. If there are characteristics as close as possible to the population, then, is called representative, i.e., the same proportion of men and women, or even not having only people from the same city. In case all elements of the population have the chance to be selected, then the sample is said to be random. if this chance is equal for all elements, then the sample is simply random. If the elements were in alphabetical order and then selected, the sample would be non-random.

• Element or member: Represents a specific subject or object about which the information is obtained. For example, a person, a company, a country, an item, or a state.

• Variable: It corresponds to a characteristic under study that assumes different values.

• Observation or measurement: Value of a variable for an element.

• Data set: Set of measurements on one or more variables.

Types of Variables

The values of a variable could vary in its nature: quantitative or qualitative.

A quantitative variable means its values can be measured numerically and will be:

1. Discrete: 0, 1, 2, 3.

2. Continuous: 43.32 or [6.5; 7.8].

A qualitative variable means it describes data that fits into categories that may or may

not be sequential. For example:

1. Quality: Very good, good, bad, very bad.

2. States: Florida, New Jersey, Washington.

As a general rule, if you can apply some kind of math (like addition), it’s a quantitative

variable. Otherwise, it’s qualitative.

The next Figure summarizes previous definitions of data extraction from an element or a member.

Types of Statistical Distributions

For a more detailed information, please see contents at Track 05 and Track 06

Track 05

https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track05

Track 06

https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track06

The complete code in Python to build the discrete or continuous distributions is available at:

https://colab.research.google.com/drive/17angU-HCRwvjKVEuYK4z4qybhRH-NoaG?usp=sharing

References

[1] https://tinyheero.github.io/2016/03/17/prob-distr.html

[2] https://becominghuman.ai/statistical-distributions-533260f370f2

Fitting statistical distributions

For a more detailed information, please see content at Track 06 - Section 2.2

https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track06/how-to-fit-a-distribution

Fit a probability distribution from a database

Load the notebook with commands developed in step 2.1. (click on the link):

https://colab.research.google.com/drive/1Xo-2dWDgL-gmDJH3QmB6b4YMlntgQqtu?usp=sharing

2. Remember from previous section, the graph obtained:

A very useful library is one that automatically searches over a certain range of probability distribution and tries to find the one that best fits the data. First, let's install it.

!pip install fitter

The following will appear:

Collecting fitter Downloading fitter-1.5.2.tar.gz (27 kB) Preparing metadata (setup.py) ... done ...Successfully built fitter Installing collected packages: fitter Successfully installed fitter-1.5.2

Now, it is possible to employ the command fit just after making a list with the distributions that should be tested.

from fitter import Fitter, get_common_distributions, get_distributions

f = Fitter(weight,

distributions=['gamma',

'lognorm',

"beta",

"burr",

"norm"])

f.fit()

f.summary()

The following results will appear:

Fitting 5 distributions: 100%|██████████| 5/5 [00:01<00:00, 4.33it/s]

Although the normal distribution seems to be the best-adjusted distribution, it seems that there are two peaks. This aspect will better tackle in the next sections.

The previous complete code is available in the following link:

https://colab.research.google.com/drive/1ZYjHH1edDAQWfTnUPqihirbDnPggLRMk?usp=sharing

Central Limit Theorem

For a more detailed information, please see content at Track 07 - Section 1.2

https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track07/central-limit-theorem

What is Central Limit Theorem?

The central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

CLT is often used in conjunction with the law of large numbers, which states that the average of the sample means and standard deviations will come closer to equaling the population mean and standard deviation as the sample size grows, which is extremely useful in accurately predicting the characteristics of populations.

To illustrate the application of CLT, suppose that a sample is obtained containing many observations, each observation being randomly generated in a way that there is no matter whether the population has a normal, Poisson, binomial, or any other distribution, the arithmetic mean of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the probability distribution of the average will closely approximate a normal distribution. The next figure helps to understand the procedure to employ CLT.

Normal x Student's T distribution

For a more detailed information, please see content at Track 07 - Section 1.7

https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track07/normal-versus-students-t-distributions

A visual comparison: Normal x Student's T distribution

Now that we’ve seen both the standard normal distribution and a t-distribution with a single degree of freedom, let’s plot them together to see how they compare.

# Library imports

import numpy as np

import seaborn as sns

import scipy.stats as stats

import matplotlib.pyplot as plt

%matplotlib inline

# Normal distribution

x = np.linspace(-4, 4, 500)

y = stats.norm.pdf(x)

# T distribution

df = 1

y_t = stats.t.pdf(x, df)

# Plotting

plt.ylabel('Probability Density')

plt.xlabel('Standard Deviations')

plt.plot(x, y, color='blue', label='Normal Dist.')

plt.plot(x, y_t, color='green', label=f'T-Dist., df={df}')

plt.legend()

# Styling - optional

sns.set_context('notebook')

sns.despine()

# Library imports

import numpy as np

import seaborn as sns

import scipy.stats as stats

import matplotlib.pyplot as plt

%matplotlib inline

# Normal distribution

x = np.linspace(-5, 5, 500)

y = stats.norm.pdf(x)

plt.plot(x, y, color='blue', label='Normal Dist.')

# T distribution

# Plotting T-distribution curves for different degrees of freedom

for df in reversed(degrees_of_freedom):

y = t.pdf(x, df) # Using default location and scale parameters (0 and 1)

plt.plot(x, y, label=f"Degrees of Freedom = {df}")

# Plotting

plt.ylabel('Probability Density')

plt.xlabel('Standard Deviations')

#plt.plot(x, y_t, color='green', label=f'T-Dist., df={df}')

plt.legend()

# Styling - optional

sns.set_context('notebook')

sns.despine()

The previous complete code is available in the following link:

https://colab.research.google.com/drive/1oaJLYH-3HOWi5kRCqF4wszNVOQiaAAUg?usp=sharing

Page updated

Google Sites

Report abuse