1. Concepts & Definitions
1.1. A Review on Parametric Statistics
1.2. Parametric tests for Hypothesis Testing
1.3. Parametric vs. Non-Parametric Test
1.4. One sample z-test and their relation with two-sample z-test
1.5. One sample t-test and their relation with two-sample t-test
1.6. Welch's two-sample t-test: two populations with different variances
1.7. Non-Parametric test for Hypothesis Testing: Mann-Whitney U Test
1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test
1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test
1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-Fit
1.11. Non-Parametric test for Hypothesis Testing: Kolmogorov-Smirnov
1.12. Non-Parametric for comparing machine learning
2. Problem & Solution
2.1. Using Wilcoxon Sign Test to compare clustering methods
2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods
2.3. What is A/B testing and how to combine with hypothesis testing?
2.4. Using Chi-Square fit to check if Benford-Law holds or not
2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not
Remembering Pareto distribution function
The Pareto probability density function (PDF) is given by:
Cumulative distribution function (CDF) is given by:
To show that the value of b parameter could change the shape of the Pareto curve, the effect of changing it on PDF and CDF is tested in the next code created based on [1].
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pareto
# Define a range of shape parameters for the Pareto distribution
loc = 0 # deslocation from 1
xm = 1 # scale
shape_parameters = [1.5, 2.0, 2.5, 3.0, 3.5]
# Generate x values for the plots
x = np.linspace(1, 5, 1000)
# Create a figure with two subplots side-by-side
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
# Plot PDFs for different shape parameters
for b in shape_parameters:
pdf = pareto.pdf(x, b= b, loc = loc, scale = xm)
ax1.plot(x, pdf, label=f'b = {b}')
# Set titles and labels for the PDF plot
ax1.set_title('PDF of Pareto Distribution')
ax1.set_xlabel('x')
ax1.set_ylabel('Density')
ax1.legend()
ax1.grid(True)
# Plot CDFs for different shape parameters
for b in shape_parameters:
cdf = pareto.cdf(x, b)
ax2.plot(x, cdf, label=f'b = {b}')
# Set titles and labels for the CDF plot
ax2.set_title('CDF of Pareto Distribution')
ax2.set_xlabel('x')
ax2.set_ylabel('Cumulative Probability')
ax2.legend()
ax2.grid(True)
# Show the plots
plt.show()
To verify if certain data fits in a Pareto distribution function, the command shape, loc, scale = pareto.fit(sample, floc=0) fits a Pareto distribution to the given data sample.
This fitting process estimates the parameters of the Pareto distribution that best describe the given data:
Related Equations: The Pareto distribution is defined by its shape parameter b (also called the tail index), location parameter xm (minimum value or scale), and scale parameter s. In scipy.stats.pareto, these parameters are denoted as:
shape (b): The shape parameter.
loc (loc): The location parameter (often the minimum value xm).
scale (s): The scale parameter. The fitting process involves finding the maximum likelihood estimates (MLE) for these parameters.
The pareto.fit function uses numerical methods to solve the optimization problem and find the parameters that best fit the provided data sample.
Q-Q plot(Quantile-Quantile plot) could be used to determine whether the continuous random variables follow Pareto distribution. A Q-Q (Quantile-Quantile) plot is a scatter plot that compares two probability distributions by plotting their quantiles against each other. Specifically, when checking for normality, the quantiles of the sample data are compared against the quantiles of a standard normal distribution [2].
A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions. Q–Q plots can be used to compare collections of data, or theoretical distributions [3].
While tests like the chi-square and Kolmogorov-Smirnov tests can evaluate overall distribution differences, Q-Q plots provide a nuanced perspective by directly comparing quantiles. This enables analysts to discern specific differences, such as shifts in location or changes in scale, which may not be evident from formal statistical tests alone [4].
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pareto, kstest
# Suppose 'data' is the list of values you want to test
# data = [your data here]
# For demonstration purposes, let's generate data from the Pareto distribution
b = 2.62 # Shape parameter
data = pareto.rvs(b, size=100)
# Step 1: Initial Visualization
plt.hist(data, bins=30, density=True, alpha=0.6, color='g', label='Empirical Data')
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = pareto.pdf(x, b)
plt.plot(x, p, 'k', linewidth=2, label='Theoretical Pareto Distribution')
plt.title('Histogram and PDF of the Pareto Distribution')
plt.legend()
plt.show()
# Step 2: Fitting the Pareto Distribution
shape, loc, scale = pareto.fit(data, floc=0)
print(f"Fitted parameters: shape={shape}, loc={loc}, scale={scale}")
# Step 3: Kolmogorov-Smirnov (KS) Test
ks_statistic, p_value = kstest(data, 'pareto', args=(shape, loc, scale))
print(f"KS Statistic: {ks_statistic}")
print(f"P-Value: {p_value}")
# Checking the test result
alpha = 0.05
if p_value > alpha:
print("Fail to reject the null hypothesis: the data follow a Pareto distribution.")
else:
print("Reject the null hypothesis: the data do not follow a Pareto distribution.")
# Step 4: Q-Q Plot
import scipy.stats as stats
fig, ax = plt.subplots()
res = stats.probplot(data, dist="pareto", sparams=(shape, loc, scale), plot=ax)
plt.title('Q-Q Plot for Pareto Distribution')
plt.show()
Fitted parameters: shape=2.677539872694408, loc=0, scale=1.0085619401300536
KS Statistic: 0.07374600773618062
P-Value: 0.6214275685193635
Fail to reject the null hypothesis: the data follow a Pareto distribution.
An additional graphic is useful to compare cumulative theoretical Pareto distribution and the distribution from the given data which is given in the next code.
# Data
sample = data
sample_size = len(sample)
# Plot the empirical distribution function (EDF) of the sample
sorted_sample = np.sort(sample)
y_vals = np.arange(1, sample_size + 1) / sample_size
# Plot the cumulative distribution function (CDF) of the reference normal distribution
x_vals = np.linspace(min(sample), max(sample), 100)
#cdf_vals = stats.norm.cdf(x_vals)
cdf_vals = pareto.cdf(x_vals, shape, loc, scale)
# Plotting
plt.figure(figsize=(10, 6))
plt.step(sorted_sample, y_vals, where='post', label='Empirical CDF')
plt.plot(x_vals, cdf_vals, label='Reference Normal CDF', color='red')
# Highlight the KS statistic on the plot
# Find the point of maximum difference
#d_max_index = np.argmax(np.abs(y_vals - stats.norm.cdf(sorted_sample)))
#d_max = np.abs(y_vals[d_max_index] - stats.norm.cdf(sorted_sample[d_max_index]))
d_max_index = np.argmax(np.abs(y_vals - pareto.cdf(sorted_sample, shape, loc, scale)))
d_max = np.abs(y_vals[d_max_index] - pareto.cdf(sorted_sample[d_max_index], shape, loc, scale))
plt.plot([sorted_sample[d_max_index], sorted_sample[d_max_index]],
[pareto.cdf(sorted_sample[d_max_index], shape, loc, scale), y_vals[d_max_index]],
'ko', label=f'KS Statistic = {d_statistic:.3f}', linewidth=3)
# Adding labels and legend
plt.xlabel('Sample Values')
plt.ylabel('Cumulative Probability')
plt.title('Kolmogorov-Smirnov Test for Pareto Distribution Adherence')
plt.legend()
plt.grid()
# Show plot
plt.show()
The code consists of several functions designed to visualize, fit, and perform hypothesis testing for data following a Pareto distribution. Below is a summary of each function's functionality, along with an example of how they work together.
Functions and Their Functionalities:
create_pareto_data:
Generates simulated data following a Pareto distribution.
Parameters: b (shape), loc (location), xm (scale).
Returns a sample of data from the Pareto distribution.
visualize_empirical_distribution:
Creates an initial visualization of empirical data as a histogram.
Parameter: data (data to be visualized).
Returns the plt object for further modifications.
visualize_theoretical_distribution:
Adds the theoretical Pareto distribution curve to the empirical histogram.
Parameters: plt (plot object), args (parameters of the distribution: shape, loc, scale).
fit_pareto_distribution:
Fits the data to the Pareto distribution.
Parameter: data (data to be fitted).
Returns the fitted parameters (shape, loc, scale).
test_pareto_distribution:
Performs the Kolmogorov-Smirnov (KS) test to check if the data follow a Pareto distribution.
Parameters: data (data to be tested), args (fitted parameters of the distribution).
plot_qq_plot:
Creates a Q-Q plot to compare the data distribution with the theoretical Pareto distribution.
Parameters: data (data to be compared), args (fitted parameters of the distribution).
7. plot_cumulative_distribution:
Compares empirical and theoretical cumulative distribution functions (CDFs).
Parameters: data (data to be compared), args (fitted parameters of the distribution).
# Defining several function for visualization, fitting, and hypothesis testing
# for Pareto distribution function
import scipy.stats as stats
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import pareto, kstest
# Suppose 'data' is the list of values you want to test
# data = [your data here]
def create_pareto_data(b = 2.62, loc = 0, xm = 1):
# For demonstration purposes, let's generate data from the Pareto distribution
#b = 2.62 # Shape parameter
#data = pareto.rvs(b, size=100)
np.random.seed(42)
data = pareto.rvs(b, loc = loc, scale = xm, size=10000)
return data
def visualize_empirical_distribution(data):
# Step 1: Initial Visualization
plt.hist(data, bins=30, density=True, alpha=0.6, color='g', label='Empirical Data')
return plt
def visualize_theoretical_distribution(plt, args = (2.62, 0, 1)):
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
(b, loc, xm) = args
p = pareto.pdf(x, b, loc=loc, scale = xm)
#p = pareto.pdf(x, args=args)
plt.plot(x, p, 'k', linewidth=2, label='Theoretical Pareto Distribution')
plt.title('Histogram and PDF of the Pareto Distribution')
plt.legend()
plt.show()
def fit_pareto_distribution(data):
# Step 2: Fitting the Pareto Distribution
shape, loc, scale = pareto.fit(data, floc=0)
#args = (shape, loc, scale)
print(f"Fitted parameters: shape={shape}, loc={loc}, scale={scale}")
return shape, loc, scale
def test_pareto_distribution(data, args):
# Step 3: Kolmogorov-Smirnov (KS) Test
# ks_statistic, p_value = kstest(data, 'pareto', args=(shape, loc, scale))
ks_statistic, p_value = kstest(data, 'pareto', args=args)
print(f"KS Statistic: {ks_statistic}")
print(f"P-Value: {p_value}")
# Checking the test result
alpha = 0.05
if p_value > alpha:
print("Fail to reject the null hypothesis: the data follow a Pareto distribution.")
else:
print("Reject the null hypothesis: the data do not follow a Pareto distribution.")
# Step 4: Q-Q Plot
def plot_qq_plot(data, args):
fig, ax = plt.subplots()
#res = stats.probplot(data, dist="pareto", sparams=(shape, loc, scale), plot=ax)
res = stats.probplot(data, dist="pareto", sparams=args, plot=ax)
plt.title('Q-Q Plot for Pareto Distribution')
plt.show()
# Step 5: plot the difference between theoretical and empirical cummulative distribution
def plot_cumulative_distribution(data, args):
# Data
sample = data
sample_size = len(sample)
# Plot the empirical distribution function (EDF) of the sample
sorted_sample = np.sort(sample)
y_vals = np.arange(1, sample_size + 1) / sample_size
# Plot the cumulative distribution function (CDF) of the reference normal distribution
x_vals = np.linspace(min(sample), max(sample), 100)
(shape, loc, scale) = args
cdf_vals = pareto.cdf(x_vals, shape, loc, scale)
# Plotting
plt.figure(figsize=(10, 6))
plt.step(sorted_sample, y_vals, where='post', label='Empirical CDF', linestyle='--')
plt.plot(x_vals, cdf_vals, label='Reference Normal CDF', color='red', linestyle='-')
# Extracting parameters from Kolmogorov-Smirnov test to Pareto distribution
d_statistic, p_value = stats.kstest(sorted_sample, 'pareto', args=(shape, loc, scale))
# Highlight the KS statistic on the plot
# Find the point of maximum difference
d_max_index = np.argmax(np.abs(y_vals - pareto.cdf(sorted_sample, shape, loc, scale)))
d_max = np.abs(y_vals[d_max_index] - pareto.cdf(sorted_sample[d_max_index], shape, loc, scale))
plt.plot([sorted_sample[d_max_index], sorted_sample[d_max_index]],
[pareto.cdf(sorted_sample[d_max_index], shape, loc, scale), y_vals[d_max_index]],
'ko', label=f'KS Statistic = {d_statistic:.3f}', linewidth=3)
# Adding labels and legend
plt.xlabel('Sample Values')
plt.ylabel('Cumulative Probability')
plt.title('Kolmogorov-Smirnov Test for Pareto Distribution Adherence')
plt.legend()
plt.grid()
# Show plot
plt.show()
To test the impact of different values of b parameter in Pareto distribution format, but always starting from xm as one.
# Define a range of shape parameters for the Pareto distribution
loc = 0 # deslocation from 1
xm = 1 %0.1 # scale
shape_parameters = [1.5, 2.0, 2.62, 2.5, 3.0, 3.5]
for b in shape_parameters:
loc = 0 # Should try and see results for -1
data = create_pareto_data(b, loc, xm)
plt = visualize_empirical_distribution(data)
shape_fit, loc_fit, scale_fit = fit_pareto_distribution(data)
args = (shape_fit, loc_fit, scale_fit)
visualize_theoretical_distribution(plt, args)
test_pareto_distribution(data, args)
plot_qq_plot(data, args)
plot_cumulative_distribution(data, args)
Fitted parameters: shape=1.5345467603649448, loc=0, scale=0.10000077565787818
KS Statistic: 0.004787779270193249
P-Value: 0.9751026585607157
Fail to reject the null hypothesis: the data follow a Pareto distribution.
Similar graphics and results are obtained for parameter b assuming values as 2.0, 2.62, 2.5, 3.0, 3.5.
To test the impact of different values of b parameter in Pareto distribution format, but always starting from xm as near zero (but not equal zero to avoid error).
# Define a range of shape parameters for the Pareto distribution
loc = 0 # deslocation from 1
xm = 0.1 # scale
shape_parameters = [1.5, 2.0, 2.62, 2.5, 3.0, 3.5]
for b in shape_parameters:
loc = 0 # Should try and see results for -1
data = create_pareto_data(b, loc, xm)
plt = visualize_empirical_distribution(data)
shape_fit, loc_fit, scale_fit = fit_pareto_distribution(data)
args = (shape_fit, loc_fit, scale_fit)
visualize_theoretical_distribution(plt, args)
test_pareto_distribution(data, args)
plot_qq_plot(data, args)
plot_cumulative_distribution(data, args)
Fitted parameters: shape=1.5345467603649448, loc=0, scale=0.10000077565787824
KS Statistic: 0.004787779270193249
P-Value: 0.9751026585607157
Fail to reject the null hypothesis: the data follow a Pareto distribution.
Similar graphics and results are obtained for parameter b assuming values as 2.0, 2.62, 2.5, 3.0, 3.5.
Now, let's verify if the data from UK overseas trade in goods statistics January 2022 about imports could be said to follow a Pareto distribution function [5, 6]. First, let's create and employ a code to read and clean data.
import pandas as pd
def read_data_url(url):
# To get all tables in all tabs.
dfs = pd.read_excel(url, sheet_name=None)
# Picking one of three available tables.
df1 = dfs['EU Imports']
# Changing the names of the columns as the names from the first line of the table
df1.columns = df1.iloc[1,:]
# Removing unnecessary columns
df1.drop(df1.index[[0,1]], inplace=True)
# Reseting index
df1.reset_index(inplace=True, drop=True)
# Reseting the last line with totals
df1.drop(df1.tail(1).index,inplace=True)
# Ordering values in decreasing order
df1_descendent_order = df1.sort_values(by="January 2022", ascending=False)
# Considering the total volume of importation in January 2022
total = df1['January 2022'].sum()
# Creating a new column about percentage of the product considering the total volume
df1['Percentage'] = (df1['January 2022'].astype(float)/total)*100
df1_descendent_order = df1.sort_values(by="Percentage", ascending=False)
# Reseting index again
df1_descendent_order.reset_index(inplace=True, drop=True)
return df1_descendent_order
url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1059771/OTS_IMP_2201.xlsx'
df = read_data_url(url)
df
And, finally employ the previously defined functions to employ Kolmogorov-Smirnov test to verify if the data follows or not Pareto distribution function.
data = df['January 2022'].astype(float)
plt = visualize_empirical_distribution(data)
shape_fit, loc_fit, scale_fit = fit_pareto_distribution(data)
args = (shape_fit, loc_fit, scale_fit)
visualize_theoretical_distribution(plt, args)
test_pareto_distribution(data, args)
plot_qq_plot(data, args)
plot_cumulative_distribution(data, args)
Fitted parameters: shape=0.2219204302302021, loc=0, scale=642.0
KS Statistic: 0.30128091959669245
P-Value: 2.6301013461087913e-08
Reject the null hypothesis: the data do not follow a Pareto distribution.
But, at this time data does not follow the fitted Pareto distribution function.
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1bbCQJ7EXIGh_XGfTmI9iarF-OPS-nnl3?usp=sharing
Important observations about Kolmogorov-Smirnov test Limitations
https://github.com/whylabs/whylogs/blob/mainline/python/examples/benchmarks/KS_Profiling.ipynb
https://scialert.net/fulltext/fulltextpdf.php?pdf=ansinet/jas/2002/488-490.pdf
More about QQ-Plots
References
[1] https://towardsdatascience.com/generating-pareto-distribution-in-python-2c2f77f70dbf
[3] https://en.wikipedia.org/wiki/Q%E2%80%93Q_plot
[4] https://www.geeksforgeeks.org/quantile-quantile-plots/
[5] Data source: https://www.gov.uk/government/statistical-data-sets/uk-overseas-trade-in-goods-statistics-january-2022-import-and-export-data
[6] More UK overseads Statistics: https://www.gov.uk/government/collections/uk-overseas-trade-statistics-and-regional-trade-statistics