1. Concepts & Definitions
1.1. A Review on Parametric Statistics
1.2. Parametric tests for Hypothesis Testing
1.3. Parametric vs. Non-Parametric Test
1.4. One sample z-test and their relation with two-sample z-test
1.5. One sample t-test and their relation with two-sample t-test
1.6. Welch's two-sample t-test: two populations with different variances
1.7. Non-Parametric test for Hypothesis Testing: Mann-Whitney U Test
1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test
1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test
1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-Fit
1.11. Non-Parametric test for Hypothesis Testing: Kolmogorov-Smirnov
1.12. Non-Parametric for comparing machine learning
2. Problem & Solution
2.1. Using Wilcoxon Sign Test to compare clustering methods
2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods
2.3. What is A/B testing and how to combine with hypothesis testing?
2.4. Using Chi-Square fit to check if Benford-Law holds or not
2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not
Defining Descriptive and Inference Statistics
For a more detailed information, please see content at Track 03 - Section 1.1
Two main kinds of statistics will be studied in this course:
1. Descriptive statistics: It consists of methods for organizing, displaying, and describing data using tables, graphs, and measures summarized.
2. Inference statistics: It consists of methods that use sample results to assist in decision-making, or in making predictions about a population.
Both kinds of statistics could be employed by the application of survey research.
Imagine a snack bar would like the preferences of its clients between candy or salty food by carrying out survey research. One manner to make this is described in detail in the next Figure.
From Figure 1 it is possible to extract some important definitions:
• Population or target population: Consists of all elements – individuals, items, or objects – whose characteristics are being studied. The study population is also called the target population.
• Sample: A portion of the population selected for the study. If there are characteristics as close as possible to the population, then, is called representative, i.e., the same proportion of men and women, or even not having only people from the same city. In case all elements of the population have the chance to be selected, then the sample is said to be random. if this chance is equal for all elements, then the sample is simply random. If the elements were in alphabetical order and then selected, the sample would be non-random.
• Element or member: Represents a specific subject or object about which the information is obtained. For example, a person, a company, a country, an item, or a state.
• Variable: It corresponds to a characteristic under study that assumes different values.
• Observation or measurement: Value of a variable for an element.
• Data set: Set of measurements on one or more variables.
The values of a variable could vary in its nature: quantitative or qualitative.
A quantitative variable means its values can be measured numerically and will be:
1. Discrete: 0, 1, 2, 3.
2. Continuous: 43.32 or [6.5; 7.8].
A qualitative variable means it describes data that fits into categories that may or may
not be sequential. For example:
1. Quality: Very good, good, bad, very bad.
2. States: Florida, New Jersey, Washington.
As a general rule, if you can apply some kind of math (like addition), it’s a quantitative
variable. Otherwise, it’s qualitative.
The next Figure summarizes previous definitions of data extraction from an element or a member.
Types of Statistical Distributions
For a more detailed information, please see contents at Track 05 and Track 06
Track 05
https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track05
Track 06
https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track06
The complete code in Python to build the discrete or continuous distributions is available at:
https://colab.research.google.com/drive/17angU-HCRwvjKVEuYK4z4qybhRH-NoaG?usp=sharing
Fitting statistical distributions
For a more detailed information, please see content at Track 06 - Section 2.2
https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track06/how-to-fit-a-distribution
Load the notebook with commands developed in step 2.1. (click on the link):
https://colab.research.google.com/drive/1Xo-2dWDgL-gmDJH3QmB6b4YMlntgQqtu?usp=sharing
2. Remember from previous section, the graph obtained:
A very useful library is one that automatically searches over a certain range of probability distribution and tries to find the one that best fits the data. First, let's install it.
!pip install fitter
The following will appear:
Collecting fitter Downloading fitter-1.5.2.tar.gz (27 kB) Preparing metadata (setup.py) ... done ...Successfully built fitter Installing collected packages: fitter Successfully installed fitter-1.5.2
Now, it is possible to employ the command fit just after making a list with the distributions that should be tested.
from fitter import Fitter, get_common_distributions, get_distributions
f = Fitter(weight,
distributions=['gamma',
'lognorm',
"beta",
"burr",
"norm"])
f.fit()
f.summary()
The following results will appear:
Fitting 5 distributions: 100%|██████████| 5/5 [00:01<00:00, 4.33it/s]
Although the normal distribution seems to be the best-adjusted distribution, it seems that there are two peaks. This aspect will better tackle in the next sections.
The previous complete code is available in the following link:
https://colab.research.google.com/drive/1ZYjHH1edDAQWfTnUPqihirbDnPggLRMk?usp=sharing
Central Limit Theorem
For a more detailed information, please see content at Track 07 - Section 1.2
https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track07/central-limit-theorem
The central limit theorem (CLT) establishes that, in many situations, when independent random variables are summed up, their properly normalized sum tends toward a normal distribution even if the original variables themselves are not normally distributed. The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.
CLT is often used in conjunction with the law of large numbers, which states that the average of the sample means and standard deviations will come closer to equaling the population mean and standard deviation as the sample size grows, which is extremely useful in accurately predicting the characteristics of populations.
To illustrate the application of CLT, suppose that a sample is obtained containing many observations, each observation being randomly generated in a way that there is no matter whether the population has a normal, Poisson, binomial, or any other distribution, the arithmetic mean of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the probability distribution of the average will closely approximate a normal distribution. The next figure helps to understand the procedure to employ CLT.
Normal x Student's T distribution
For a more detailed information, please see content at Track 07 - Section 1.7
Now that we’ve seen both the standard normal distribution and a t-distribution with a single degree of freedom, let’s plot them together to see how they compare.
# Library imports
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
# Normal distribution
x = np.linspace(-4, 4, 500)
y = stats.norm.pdf(x)
# T distribution
df = 1
y_t = stats.t.pdf(x, df)
# Plotting
plt.ylabel('Probability Density')
plt.xlabel('Standard Deviations')
plt.plot(x, y, color='blue', label='Normal Dist.')
plt.plot(x, y_t, color='green', label=f'T-Dist., df={df}')
plt.legend()
# Styling - optional
sns.set_context('notebook')
sns.despine()
# Library imports
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline
# Normal distribution
x = np.linspace(-5, 5, 500)
y = stats.norm.pdf(x)
plt.plot(x, y, color='blue', label='Normal Dist.')
# T distribution
# Plotting T-distribution curves for different degrees of freedom
for df in reversed(degrees_of_freedom):
y = t.pdf(x, df) # Using default location and scale parameters (0 and 1)
plt.plot(x, y, label=f"Degrees of Freedom = {df}")
# Plotting
plt.ylabel('Probability Density')
plt.xlabel('Standard Deviations')
#plt.plot(x, y_t, color='green', label=f'T-Dist., df={df}')
plt.legend()
# Styling - optional
sns.set_context('notebook')
sns.despine()
The previous complete code is available in the following link:
https://colab.research.google.com/drive/1oaJLYH-3HOWi5kRCqF4wszNVOQiaAAUg?usp=sharing