2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not

To show that the value of b parameter could change the shape of the Pareto curve, the effect of changing it on PDF and CDF is tested in the next code created based on [1].

import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import pareto

# Define a range of shape parameters for the Pareto distribution

loc = 0 # deslocation from 1

xm = 1 # scale

shape_parameters = [1.5, 2.0, 2.5, 3.0, 3.5]

# Generate x values for the plots

x = np.linspace(1, 5, 1000)

# Create a figure with two subplots side-by-side

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Plot PDFs for different shape parameters

for b in shape_parameters:

pdf = pareto.pdf(x, b= b, loc = loc, scale = xm)

ax1.plot(x, pdf, label=f'b = {b}')

# Set titles and labels for the PDF plot

ax1.set_title('PDF of Pareto Distribution')

ax1.set_xlabel('x')

ax1.set_ylabel('Density')

ax1.legend()

ax1.grid(True)

# Plot CDFs for different shape parameters

for b in shape_parameters:

cdf = pareto.cdf(x, b)

ax2.plot(x, cdf, label=f'b = {b}')

# Set titles and labels for the CDF plot

ax2.set_title('CDF of Pareto Distribution')

ax2.set_xlabel('x')

ax2.set_ylabel('Cumulative Probability')

ax2.legend()

ax2.grid(True)

# Show the plots

plt.show()

To verify if certain data fits in a Pareto distribution function, the command shape, loc, scale = pareto.fit(sample, floc=0) fits a Pareto distribution to the given data sample.

This fitting process estimates the parameters of the Pareto distribution that best describe the given data:

Related Equations: The Pareto distribution is defined by its shape parameter b (also called the tail index), location parameter xm (minimum value or scale), and scale parameter s. In scipy.stats.pareto, these parameters are denoted as:
shape (b): The shape parameter.
loc (loc): The location parameter (often the minimum value xm).
scale (s): The scale parameter. The fitting process involves finding the maximum likelihood estimates (MLE) for these parameters.

The pareto.fit function uses numerical methods to solve the optimization problem and find the parameters that best fit the provided data sample.

Q-Q plot(Quantile-Quantile plot) could be used to determine whether the continuous random variables follow Pareto distribution. A Q-Q (Quantile-Quantile) plot is a scatter plot that compares two probability distributions by plotting their quantiles against each other. Specifically, when checking for normality, the quantiles of the sample data are compared against the quantiles of a standard normal distribution [2].

A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how properties such as location, scale, and skewness are similar or different in the two distributions. Q–Q plots can be used to compare collections of data, or theoretical distributions [3].

While tests like the chi-square and Kolmogorov-Smirnov tests can evaluate overall distribution differences, Q-Q plots provide a nuanced perspective by directly comparing quantiles. This enables analysts to discern specific differences, such as shifts in location or changes in scale, which may not be evident from formal statistical tests alone [4].

import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import pareto, kstest

# Suppose 'data' is the list of values you want to test

# data = [your data here]

# For demonstration purposes, let's generate data from the Pareto distribution

b = 2.62 # Shape parameter

data = pareto.rvs(b, size=100)

# Step 1: Initial Visualization

plt.hist(data, bins=30, density=True, alpha=0.6, color='g', label='Empirical Data')

xmin, xmax = plt.xlim()

x = np.linspace(xmin, xmax, 100)

p = pareto.pdf(x, b)

plt.plot(x, p, 'k', linewidth=2, label='Theoretical Pareto Distribution')

plt.title('Histogram and PDF of the Pareto Distribution')

plt.legend()

plt.show()

# Step 2: Fitting the Pareto Distribution

shape, loc, scale = pareto.fit(data, floc=0)

print(f"Fitted parameters: shape={shape}, loc={loc}, scale={scale}")

# Step 3: Kolmogorov-Smirnov (KS) Test

ks_statistic, p_value = kstest(data, 'pareto', args=(shape, loc, scale))

print(f"KS Statistic: {ks_statistic}")

print(f"P-Value: {p_value}")

# Checking the test result

alpha = 0.05

if p_value > alpha:

print("Fail to reject the null hypothesis: the data follow a Pareto distribution.")

else:

print("Reject the null hypothesis: the data do not follow a Pareto distribution.")

# Step 4: Q-Q Plot

import scipy.stats as stats

fig, ax = plt.subplots()

res = stats.probplot(data, dist="pareto", sparams=(shape, loc, scale), plot=ax)

plt.title('Q-Q Plot for Pareto Distribution')

plt.show()

Fitted parameters: shape=2.677539872694408, loc=0, scale=1.0085619401300536

KS Statistic: 0.07374600773618062

P-Value: 0.6214275685193635

Fail to reject the null hypothesis: the data follow a Pareto distribution.

An additional graphic is useful to compare cumulative theoretical Pareto distribution and the distribution from the given data which is given in the next code.

# Data

sample = data

sample_size = len(sample)

# Plot the empirical distribution function (EDF) of the sample

sorted_sample = np.sort(sample)

y_vals = np.arange(1, sample_size + 1) / sample_size

# Plot the cumulative distribution function (CDF) of the reference normal distribution

x_vals = np.linspace(min(sample), max(sample), 100)

#cdf_vals = stats.norm.cdf(x_vals)

cdf_vals = pareto.cdf(x_vals, shape, loc, scale)

# Plotting

plt.figure(figsize=(10, 6))

plt.step(sorted_sample, y_vals, where='post', label='Empirical CDF')

plt.plot(x_vals, cdf_vals, label='Reference Normal CDF', color='red')

# Highlight the KS statistic on the plot

# Find the point of maximum difference

#d_max_index = np.argmax(np.abs(y_vals - stats.norm.cdf(sorted_sample)))

#d_max = np.abs(y_vals[d_max_index] - stats.norm.cdf(sorted_sample[d_max_index]))

d_max_index = np.argmax(np.abs(y_vals - pareto.cdf(sorted_sample, shape, loc, scale)))

d_max = np.abs(y_vals[d_max_index] - pareto.cdf(sorted_sample[d_max_index], shape, loc, scale))

plt.plot([sorted_sample[d_max_index], sorted_sample[d_max_index]],

[pareto.cdf(sorted_sample[d_max_index], shape, loc, scale), y_vals[d_max_index]],

'ko', label=f'KS Statistic = {d_statistic:.3f}', linewidth=3)

# Adding labels and legend

plt.xlabel('Sample Values')

plt.ylabel('Cumulative Probability')

plt.title('Kolmogorov-Smirnov Test for Pareto Distribution Adherence')

plt.legend()

plt.grid()

# Show plot

plt.show()

The code consists of several functions designed to visualize, fit, and perform hypothesis testing for data following a Pareto distribution. Below is a summary of each function's functionality, along with an example of how they work together.

Functions and Their Functionalities:

create_pareto_data:

Generates simulated data following a Pareto distribution.
Parameters: b (shape), loc (location), xm (scale).
Returns a sample of data from the Pareto distribution.

visualize_empirical_distribution:

Creates an initial visualization of empirical data as a histogram.
Parameter: data (data to be visualized).
Returns the plt object for further modifications.

visualize_theoretical_distribution:

Adds the theoretical Pareto distribution curve to the empirical histogram.
Parameters: plt (plot object), args (parameters of the distribution: shape, loc, scale).

fit_pareto_distribution:

Fits the data to the Pareto distribution.
Parameter: data (data to be fitted).
Returns the fitted parameters (shape, loc, scale).

test_pareto_distribution:

Performs the Kolmogorov-Smirnov (KS) test to check if the data follow a Pareto distribution.
Parameters: data (data to be tested), args (fitted parameters of the distribution).

plot_qq_plot:

Creates a Q-Q plot to compare the data distribution with the theoretical Pareto distribution.
Parameters: data (data to be compared), args (fitted parameters of the distribution).

7. plot_cumulative_distribution:

Compares empirical and theoretical cumulative distribution functions (CDFs).
Parameters: data (data to be compared), args (fitted parameters of the distribution).

# Defining several function for visualization, fitting, and hypothesis testing

# for Pareto distribution function

import scipy.stats as stats

import numpy as np

import matplotlib.pyplot as plt

from scipy.stats import pareto, kstest

# Suppose 'data' is the list of values you want to test

# data = [your data here]

def create_pareto_data(b = 2.62, loc = 0, xm = 1):

# For demonstration purposes, let's generate data from the Pareto distribution

#b = 2.62 # Shape parameter

#data = pareto.rvs(b, size=100)

np.random.seed(42)

data = pareto.rvs(b, loc = loc, scale = xm, size=10000)

return data

def visualize_empirical_distribution(data):

# Step 1: Initial Visualization

plt.hist(data, bins=30, density=True, alpha=0.6, color='g', label='Empirical Data')

return plt

def visualize_theoretical_distribution(plt, args = (2.62, 0, 1)):

xmin, xmax = plt.xlim()

x = np.linspace(xmin, xmax, 100)

(b, loc, xm) = args

p = pareto.pdf(x, b, loc=loc, scale = xm)

#p = pareto.pdf(x, args=args)

plt.plot(x, p, 'k', linewidth=2, label='Theoretical Pareto Distribution')

plt.title('Histogram and PDF of the Pareto Distribution')

plt.legend()

plt.show()

def fit_pareto_distribution(data):

# Step 2: Fitting the Pareto Distribution

shape, loc, scale = pareto.fit(data, floc=0)

#args = (shape, loc, scale)

print(f"Fitted parameters: shape={shape}, loc={loc}, scale={scale}")

return shape, loc, scale

def test_pareto_distribution(data, args):

# Step 3: Kolmogorov-Smirnov (KS) Test

# ks_statistic, p_value = kstest(data, 'pareto', args=(shape, loc, scale))

ks_statistic, p_value = kstest(data, 'pareto', args=args)

print(f"KS Statistic: {ks_statistic}")

print(f"P-Value: {p_value}")

# Checking the test result

alpha = 0.05

if p_value > alpha:

print("Fail to reject the null hypothesis: the data follow a Pareto distribution.")

else:

print("Reject the null hypothesis: the data do not follow a Pareto distribution.")

# Step 4: Q-Q Plot

def plot_qq_plot(data, args):

fig, ax = plt.subplots()

#res = stats.probplot(data, dist="pareto", sparams=(shape, loc, scale), plot=ax)

res = stats.probplot(data, dist="pareto", sparams=args, plot=ax)

plt.title('Q-Q Plot for Pareto Distribution')

plt.show()

# Step 5: plot the difference between theoretical and empirical cummulative distribution

def plot_cumulative_distribution(data, args):

# Data

sample = data

sample_size = len(sample)

# Plot the empirical distribution function (EDF) of the sample

sorted_sample = np.sort(sample)

y_vals = np.arange(1, sample_size + 1) / sample_size

# Plot the cumulative distribution function (CDF) of the reference normal distribution

x_vals = np.linspace(min(sample), max(sample), 100)

(shape, loc, scale) = args

cdf_vals = pareto.cdf(x_vals, shape, loc, scale)

# Plotting

plt.figure(figsize=(10, 6))

plt.step(sorted_sample, y_vals, where='post', label='Empirical CDF', linestyle='--')

plt.plot(x_vals, cdf_vals, label='Reference Normal CDF', color='red', linestyle='-')

# Extracting parameters from Kolmogorov-Smirnov test to Pareto distribution

d_statistic, p_value = stats.kstest(sorted_sample, 'pareto', args=(shape, loc, scale))

# Highlight the KS statistic on the plot

# Find the point of maximum difference

d_max_index = np.argmax(np.abs(y_vals - pareto.cdf(sorted_sample, shape, loc, scale)))

d_max = np.abs(y_vals[d_max_index] - pareto.cdf(sorted_sample[d_max_index], shape, loc, scale))

plt.plot([sorted_sample[d_max_index], sorted_sample[d_max_index]],

[pareto.cdf(sorted_sample[d_max_index], shape, loc, scale), y_vals[d_max_index]],

'ko', label=f'KS Statistic = {d_statistic:.3f}', linewidth=3)

# Adding labels and legend

plt.xlabel('Sample Values')

plt.ylabel('Cumulative Probability')

plt.title('Kolmogorov-Smirnov Test for Pareto Distribution Adherence')

plt.legend()

plt.grid()

# Show plot

plt.show()

To test the impact of different values of b parameter in Pareto distribution format, but always starting from xm as one.

# Define a range of shape parameters for the Pareto distribution

loc = 0 # deslocation from 1

xm = 1 %0.1 # scale

shape_parameters = [1.5, 2.0, 2.62, 2.5, 3.0, 3.5]

for b in shape_parameters:

loc = 0 # Should try and see results for -1

data = create_pareto_data(b, loc, xm)

plt = visualize_empirical_distribution(data)

shape_fit, loc_fit, scale_fit = fit_pareto_distribution(data)

args = (shape_fit, loc_fit, scale_fit)

visualize_theoretical_distribution(plt, args)

test_pareto_distribution(data, args)

plot_qq_plot(data, args)

plot_cumulative_distribution(data, args)

Fitted parameters: shape=1.5345467603649448, loc=0, scale=0.10000077565787818

KS Statistic: 0.004787779270193249

P-Value: 0.9751026585607157

Fail to reject the null hypothesis: the data follow a Pareto distribution.

Similar graphics and results are obtained for parameter b assuming values as 2.0, 2.62, 2.5, 3.0, 3.5.

To test the impact of different values of b parameter in Pareto distribution format, but always starting from xm as near zero (but not equal zero to avoid error).

# Define a range of shape parameters for the Pareto distribution

loc = 0 # deslocation from 1

xm = 0.1 # scale

shape_parameters = [1.5, 2.0, 2.62, 2.5, 3.0, 3.5]

for b in shape_parameters:

loc = 0 # Should try and see results for -1

data = create_pareto_data(b, loc, xm)

plt = visualize_empirical_distribution(data)

shape_fit, loc_fit, scale_fit = fit_pareto_distribution(data)

args = (shape_fit, loc_fit, scale_fit)

visualize_theoretical_distribution(plt, args)

test_pareto_distribution(data, args)

plot_qq_plot(data, args)

plot_cumulative_distribution(data, args)

Fitted parameters: shape=1.5345467603649448, loc=0, scale=0.10000077565787824

KS Statistic: 0.004787779270193249

P-Value: 0.9751026585607157

Fail to reject the null hypothesis: the data follow a Pareto distribution.

Similar graphics and results are obtained for parameter b assuming values as 2.0, 2.62, 2.5, 3.0, 3.5.

Now, let's verify if the data from UK overseas trade in goods statistics January 2022 about imports could be said to follow a Pareto distribution function [5, 6]. First, let's create and employ a code to read and clean data.

import pandas as pd

def read_data_url(url):

# To get all tables in all tabs.

dfs = pd.read_excel(url, sheet_name=None)

# Picking one of three available tables.

df1 = dfs['EU Imports']

# Changing the names of the columns as the names from the first line of the table

df1.columns = df1.iloc[1,:]

# Removing unnecessary columns

df1.drop(df1.index[[0,1]], inplace=True)

# Reseting index

df1.reset_index(inplace=True, drop=True)

# Reseting the last line with totals

df1.drop(df1.tail(1).index,inplace=True)

# Ordering values in decreasing order

df1_descendent_order = df1.sort_values(by="January 2022", ascending=False)

# Considering the total volume of importation in January 2022

total = df1['January 2022'].sum()

# Creating a new column about percentage of the product considering the total volume

df1['Percentage'] = (df1['January 2022'].astype(float)/total)*100

df1_descendent_order = df1.sort_values(by="Percentage", ascending=False)

# Reseting index again

df1_descendent_order.reset_index(inplace=True, drop=True)

return df1_descendent_order

url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/1059771/OTS_IMP_2201.xlsx'

df = read_data_url(url)

And, finally employ the previously defined functions to employ Kolmogorov-Smirnov test to verify if the data follows or not Pareto distribution function.

data = df['January 2022'].astype(float)

plt = visualize_empirical_distribution(data)

shape_fit, loc_fit, scale_fit = fit_pareto_distribution(data)

args = (shape_fit, loc_fit, scale_fit)

visualize_theoretical_distribution(plt, args)

test_pareto_distribution(data, args)

plot_qq_plot(data, args)

plot_cumulative_distribution(data, args)

Fitted parameters: shape=0.2219204302302021, loc=0, scale=642.0