1. Concepts & Definitions
1.1. A Review on Parametric Statistics
1.2. Parametric tests for Hypothesis Testing
1.3. Parametric vs. Non-Parametric Test
1.4. One sample z-test and their relation with two-sample z-test
1.5. One sample t-test and their relation with two-sample t-test
1.6. Welch's two-sample t-test: two populations with different variances
1.7. Non-Parametric test for Hypothesis Testing: Mann-Whitney U Test
1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test
1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test
1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-Fit
1.11. Non-Parametric test for Hypothesis Testing: Kolmogorov-Smirnov
1.12. Non-Parametric for comparing machine learning
2. Problem & Solution
2.1. Using Wilcoxon Sign Test to compare clustering methods
2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods
2.3. What is A/B testing and how to combine with hypothesis testing?
2.4. Using Chi-Square fit to check if Benford-Law holds or not
2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not
What is Benford's Law?
Benford’s law (also called the first digit law) states that the leading digits in a collection of data sets are probably going to be small. For example, most numbers in a set (about 30%) will have a leading digit of 1, when the expected probability is 11.1% (i.e. one out of nine digits). This is followed by about 17.5% starting with the number 2. This is an unexpected phenomenon; If all leading numbers (0 through 9) had equal probability, each would occur 11.1% of the time. To put it simply, Benford’s law is a probability distribution for the likelihood of the first digit in a set of numbers [1].
Benford’s law doesn’t apply to every set of numbers, but it usually applies to large sets of naturally occurring numbers with some connection like:
Companies’ stock market values,
Data found in texts — like the Reader’s Digest, or a copy of Newsweek.
Demographic data, including state and city populations,
Income tax data,
Mathematical tables, like logarithms,
River drainage rates,
Scientific data.
Results using Turkish data suggest that deviations from Benford’s Law are consistent with higher rates of tax evasion [2].
What is De Minimis?
De Minimis is a legal term that has been applied in many ways, including to copyright law, business law, and income tax law. “De Minimis” comes from the Latin phrase ‘de minimis non curat lex’ which translates to “The law does not concern itself with trifles.”
Today, the term is used across a variety of contexts to describe matters that are too small or trivial to be deemed worthy of consideration by a regulating authority. In international trade commerce, according to [3]: “De Minimis Value as the threshold is known, varies from country to country.
Items imported into the United States are subject to duty when the value is over USD 800. In Australia, duty and taxes kick in after the first USD 1,000. In Canada, it’s USD 20; in some other countries, it’s USD 5. In Europe the average is about USD 190, however, it may vary considerably from country to country. Worldwide, 56 percent of individuals surveyed said they would buy more if the duties were reduced or eliminated. That opinion is shared by 80 percent of consumers surveyed in Latin America.
Knowing which value applies to which country can help you estimate the landed cost—the full cost the customer will pay—and you can communicate this price to the buyer.
A brief view of the code to extract data to verify Benford's Law
1. getFirstDigitNumber(number):
Purpose: This function extracts the first digit from a given number.
Parameters:
number: The number from which to extract the first digit. It can be of any numeric type.
Process: Converts the number to a string (string_number). Extracts the first character of the string and converts it back to an integer (digit).
Output: Returns the first digit of the number.
2. getListDigits(numbers_list):
Purpose: This function extracts the first digits from a list of numbers, ignoring zeros.
Parameters:
numbers_list: A list of numbers from which to extract the first digits.
Process: Initializes an empty list list_digits. Loops through each number in the input list. Uses getFirstDigitNumber to get the first digit of each number. Appends the first digit to list_digits if it is not zero.
Output: Returns a list of first digits.
3. getFreq(list_digits):
Purpose: This function calculates the frequency of each digit in a list of digits.
Parameters:
list_digits: A list of digits for which to calculate the frequency.
Process: Uses Counter from the collections module to count the frequency of each digit in the list.
Output: Returns a Counter object representing the frequency of each digit.
4. read_data_DeMinimis():
Purpose: This function reads data from an Excel file hosted on Google Drive.
Parameters: None. Process: Defines the URL of the Excel file to download. Uses pd.ExcelFile to load the Excel file from the URL. Retrieves the sheet names from the Excel file. Reads the first sheet of the Excel file into a DataFrame (df).
Output: Returns the DataFrame containing the data from the first sheet of the Excel file.
5.extract_freq_digits_DeMinimis():
Purpose: This function extracts the frequency of the first digits from two columns ('Shipment weight' and 'Declared value') of the De Minimis dataset.
Parameters: None. Process: Calls read_data_DeMinimis to get the DataFrame with the data. Extracts the 'Shipment weight' column into a list (sw). Extracts the 'Declared value' column into a list (dv). Combines the two lists into a single list (sw_dv). Calls getListDigits to get the list of first digits from the combined list. Calls getFreq to calculate the frequency of the first digits.
Output: Returns a Counter object representing the frequency of the first digits from the combined columns. Usage Purpose: This part of the code calls the extract_freq_digits_DeMinimis function to extract the frequency of the first digits from the dataset and prints the resulting frequency distribution. Output: Displays the frequency of first digits extracted from the 'Shipment weight' and 'Declared value' columns of the De Minimis dataset.
import pandas as pd
from collections import Counter
# Function to get the first digit of number
def getFirstDigitNumber(number):
string_number = str(number)
digit = int(string_number[0])
return digit
# Function to extract the first digit of list of numbers
def getListDigits(numbers_list):
list_digits = []
# Loop to extract the first digit of each number in a list.
# Must return a list of the first digits at first_digit variable.
for number in numbers_list:
first_digit = getFirstDigitNumber(number)
if (first_digit != 0):
list_digits.append(first_digit)
return list_digits
# Function to get the frequency of digits
def getFreq(list_digits):
freq = Counter(list_digits)
return freq
# Function to read data from the internet url from google drive file.
def read_data_DeMinimis():
url = 'https://drive.google.com/uc?export=download&id=1rtx9x0U-bXki8x0NHx1IqweY83mU5xXJ'
xls = pd.ExcelFile(url)
sheet_names = xls.sheet_names
df = pd.read_excel(xls, sheet_names[0])
return df
# Function to extract frequency of digits considering two columns from De-minimis
def extract_freq_digits_DeMinimis():
df = read_data_DeMinimis()
sw = list(df['Shipment weight'])
dv = list(df['Declared value'])
sw_dv = sw + dv
list_digits = getListDigits(sw_dv)
freq = getFreq(list_digits)
return freq
freq = extract_freq_digits_DeMinimis()
freq
Counter({5: 163, 7: 87, 1: 600, 4: 168, 3: 251, 6: 121, 2: 316, 8: 86, 9: 62})
A brief view of the code to verify and visualize if Benford's Law holds
Now, let's obtain the graphics to verify the distance between the empirical and theoretical distribution of probabilities. But, first is important to have a general view of the functions that compose it:
1. compute_empirical_prob(freq): This function calculates the empirical probabilities from a given frequency dictionary.
Input:
freq: A dictionary where the keys are digits and the values are their observed frequencies.
Process:
Converts the values of the dictionary to a list (list_freq).
Sums the list to get the total frequency (total).
Divide each frequency by the total to get the probability (probability).
Creates a new dictionary (emp_dict) mapping each digit to its calculated probability.
Output:
A dictionary mapping each digit to its empirical probability.
2. benfordEquation(d): This function calculates the theoretical probability of a digit occurring as the first digit according to Benford's Law.
Input:
d: A digit (1 through 9).
Process:
Uses the Benford's Law formula to calculate the probability.
math.log10(1 + 1/d): Computes the logarithm base 10 of (1 + 1/d).
Output:
The theoretical probability of the digit d is the first digit.
3. compute_theoretical_prob(): This function calculates the theoretical probabilities for digits 1 through 9 according to Benford's Law.
Input:
None directly (implicitly uses digits 1 through 9).
Process:
Iterates through digits 1 to 9.
Uses benfordEquation(d) to calculate the theoretical probability for each digit.
Appends each calculated probability to a list (y).
Output:
A list of theoretical probabilities for digits 1 through 9.
4. plot_frequencies(emp_dict, theory_prob): This function plots the empirical and theoretical probabilities of the first digits.
Input:
emp_dict: A dictionary of empirical probabilities.
theory_prob: A list of theoretical probabilities.
Process:
Creates a list x of digits from 1 to the length of theory_prob.
Plots a bar chart of empirical probabilities.
Plots a line chart of theoretical probabilities.
Sets x-axis labels to show all digits.
Adds labels, title, legend, and grid to the plot.
Displays the plot using plt.show().
Output:
None directly (displays a plot).
5. Overall Script Execution: Step-by-Step Execution:
Calls compute_empirical_prob(freq) to compute empirical probabilities and stores it in emp_dict.
Calls compute_theoretical_prob() to compute theoretical probabilities and stores it in theory_prob.
Sorts emp_dict by keys using OrderedDict.
Prints the sorted emp_dict and theory_prob.
Calls plot_frequencies(emp_dict, theory_prob) to plot the frequencies.
import matplotlib.pyplot as plt
import math
# Creates a sorted dictionary (sorted by key)
from collections import OrderedDict
def compute_empirical_prob(freq):
list_freq = list(freq.values())
total = sum(list_freq)
# Treating frequency as an approximated probability
probability = [x / total for x in list_freq]
emp_dict = dict(zip(freq.keys(), probability))
return emp_dict
# Compute the theoretical probability of occurrence of each digit
# according to benford equation
def benfordEquation(d):
freq = math.log10(1+1/d)
return freq
# Compute the distribution probability using Benford Law equation
def compute_theoretical_prob():
x = range(1,10)
y = []
for elem in x:
y.append(benfordEquation(elem))
return y
# Plot the empirical and theoretical probabilities
def plot_frequencies(emp_dict, theory_prob):
x=list(range(1,len(theory_prob)+1))
plt.bar(emp_dict.keys(), emp_dict.values(), label='Observed', alpha=0.5)
plt.plot(x, theory_prob, '-r', label='Expected (Benford\'s Law)')
# Set x-axis labels to show all digits
plt.xticks(ticks=x, labels=x)
plt.xlabel('First Digit')
plt.ylabel('Probability')
plt.title('Frequency of First Digits')
plt.legend()
plt.grid(True)
plt.show()
emp_dict = compute_empirical_prob(freq)
theory_prob = compute_theoretical_prob()
emp_dict = OrderedDict(sorted(emp_dict.items()))
print(emp_dict)
print(theory_prob)
plot_frequencies(emp_dict, theory_prob)
OrderedDict([(1, 0.32362459546925565), (2, 0.1704422869471413), (3, 0.1353829557713053), (4, 0.09061488673139159), (5, 0.08791801510248112), (6, 0.06526429341963323), (7, 0.04692556634304207), (8, 0.04638619201725998), (9, 0.03344120819848975)])
[0.3010299956639812, 0.17609125905568124, 0.12493873660829992, 0.09691001300805642, 0.07918124604762482, 0.06694678963061322, 0.05799194697768673, 0.05115252244738129, 0.04575749056067514]
Applying the Chi-Square test to verify Benford law distribution through comparison with Chi-Square statistic and critical values
The next code applies the Chi-Square test to verify the hypothesis if the data follows or not Benford law distribution by using comparison with Chi-Square statistic and critical value.
import scipy.stats as stats
import numpy as np
# Calculate Chi-Square statistic
observed = list(emp_dict.values())
expected = theory_prob
chi_square_statistic, p_value = stats.chisquare(observed, expected)
# Degrees of freedom
df = len(observed) - 1
# Critical value at 5% significance level
critical_value = stats.chi2.ppf(0.95, df)
# Print the results
print(f"Chi-Square Statistic: {chi_square_statistic}")
print(f"Degrees of Freedom: {df}")
print(f"Critical Value: {critical_value}")
print(f"P-Value: {p_value}")
if chi_square_statistic < critical_value:
print("Accept the null hypothesis: Digits follow Benford distribution.")
else:
print("Reject the null hypothesis: Digits do not follow Benford distribution.")
Chi-Square Statistic: 0.010036387200615643
Degrees of Freedom: 8
Critical Value: 15.50731305586545
P-Value: 0.999999999973683
Accept the null hypothesis: Digits follow Benford distribution.
Applying the Chi-Square test to verify Benford law distribution through comparison with p-value
The next code does the same thing as in the previous code but employs p-value.
from scipy.stats import chi2_contingency
# Convert frequencies to observed and expected values
# Perform Chi-Square test
chi2, p = chi2_contingency([observed, expected])[:2]
# Print Chi-Square test result
if p < 0.05:
print(f'Reject the null hypothesis. The data does not follow Benford\'s Law (p-value: {p:.4f}).')
else:
print(f'Fail to reject the null hypothesis. The data follows Benford\'s Law (p-value: {p:.4f}).')
Fail to reject the null hypothesis. The data follows Benford's Law (p-value: 1.0000).
Creating a Q-Q plot to compare the distribution of the theoretical and given data
The next code employs the De-minimis database to verify if the entries do follow not the Benford Law in terms of digit frequency distribution and is composed of the following functions:
1. Functions that employ Benford Equation and compute_theoretical_prob:
benfordEquation(d): Calculates the theoretical frequency for a digit d using Benford's Law.
compute_theoretical_prob(): Calculates the theoretical probabilities for digits from 1 to 9.
2. Function plot_qq_benford:
Calculates the cumulative theoretical CDF of the digits using Benford's Law.
Converts these cumulative probabilities into theoretical quantiles using the ppf function of the normal distribution.
Calculates the quantiles of the observed data.
Plots the observed quantiles versus the theoretical quantiles.
3. Example Usage:
Read the De-Minimis data entries and verify if they follow or not Benford's theoretical distribution.
Calls the plot_qq_benford function to plot the Q-Q plot of this data. This way, the Q-Q plot will show if the observed data follows Benford's theoretical distribution. A Q–Q plot is used to compare the shapes of distributions, providing a graphical view of how shapes are similar or different in the two distributions. Q–Q plots can be used to compare collections of data, or theoretical distributions.
A more detailed description of functionalities of function plot_qq_benford has the following components:
Input:
data: Observed data.
title: Title of the Q-Q plot (default is 'Q-Q Plot').
Process:
Calculates the theoretical probabilities using compute_theoretical_prob().
Computes the cumulative distribution function (CDF) of the theoretical probabilities.
Generates theoretical quantiles from the theoretical CDF.
Computes the CDF of the observed data.
Generates observed quantiles from the observed CDF.
Creates a Q-Q plot comparing observed quantiles with theoretical quantiles.
Output:
Displays a Q-Q plot with observed quantiles versus theoretical quantiles.
Usage: Calls plot_qq_benford(observed) to plot the Q-Q plot for the observed data.
# Function to generate a Q-Q plot considering the theoretical distribution
def plot_qq_benford(data, title='Q-Q Plot'):
# Calculate the theoretical distribution
theoretical_prob = compute_theoretical_prob()
theoretical_cdf = np.cumsum(theoretical_prob) / np.sum(theoretical_prob)
# Generate theoretical quantiles
theoretical_quantiles = np.percentile(theoretical_cdf, np.linspace(0, 100, len(theoretical_cdf)))
print('Theoretical quantiles = ', theoretical_quantiles)
# Calculate observed data quantiles
observed_cdf = np.cumsum(data) / np.sum(data)
observed_quantiles = np.percentile(observed_cdf, np.linspace(0, 100, len(observed_cdf)))
print('Theoretical quantiles = ', observed_quantiles)
# Generate the Q-Q plot
fig, ax = plt.subplots(figsize=(8, 6))
# Plot observed quantiles vs theoretical quantiles
ax.scatter(theoretical_quantiles, observed_quantiles, color='blue', edgecolor='black')
ax.plot(theoretical_quantiles, theoretical_quantiles, color='red', linestyle='--') # Reference line
# Additional plot settings,
ax.set_title(title, fontsize=15)
ax.set_xlabel('Theoretical Quantiles', fontsize=12)
ax.set_ylabel('Observed Quantiles', fontsize=12)
plt.grid(True)
plt.show()
# Plotting the Q-Q plot
plot_qq_benford(observed)
Theoretical quantiles = [0.30103 0.47712125 0.60205999 0.69897 0.77815125 0.84509804
0.90308999 0.95424251 1. ]
Observed quantiles = [0.3236246 0.49406688 0.62944984 0.72006472 0.80798274 0.87324703
0.9201726 0.96655879 1. ]
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1OApTikifL-sIXPuJkLckq71c4bKqfPV6?usp=sharing
References
[2] https://cepr.org/voxeu/columns/using-benfords-law-detect-tax-fraud-international-trade
[3] International Trade Administration. De Minimis Value - Express shipment exemptions. https://www.trade.gov/de-minimis-value.