1. Concepts & Definitions
1.1. A Review on Parametric Statistics
1.2. Parametric tests for Hypothesis Testing
1.3. Parametric vs. Non-Parametric Test
1.4. One sample z-test and their relation with two-sample z-test
1.5. One sample t-test and their relation with two-sample t-test
1.6. Welch's two-sample t-test: two populations with different variances
1.7. Non-Parametric test for Hypothesis Testing: Mann-Whitney U Test
1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test
1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test
1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-Fit
1.11. Non-Parametric test for Hypothesis Testing: Kolmogorov-Smirnov
1.12. Non-Parametric for comparing machine learning
2. Problem & Solution
2.1. Using Wilcoxon Sign Test to compare clustering methods
2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods
2.3. What is A/B testing and how to combine with hypothesis testing?
2.4. Using Chi-Square fit to check if Benford-Law holds or not
2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not
What is the Northwind Traders (NT) dataset?
The Northwind Traders (NT) dataset is a toy database created by Microsoft for educational purposes. This fictitious company sells specialty foods wholesale to retail outlets worldwide. The database contains data for 801 orders placed by 85 different customers over 23 months, complete with supplementary information on customers, employees, and suppliers. For this study, the goal was to extract insights from the data which could facilitate more revenue for the company [1]. Two things which are generally good for business are the customers:
1. buying more products,
2. buying more often.
Since the customers of Northwind Traders are retailers, a discount on their orders means a higher profit margin for them, potentially generating buyer incentives as well as customer loyalty. This study sought to investigate two questions as they relate to best reaching these goals: how much discount should be offered, and how frequently should it be offered?
Since offering larger or more frequent discounts hurts the profit that can be made off of orders by NT, it is important to determine whether there is a threshold at which this loss of revenue ceases to be justified by the effects being generated.
Reading and cleaning values from NT dataset
Let's start to read the NT dataset from GitHub using the address with the data [2].
import pandas as pd
url = 'https://raw.githubusercontent.com/FoamoftheSea/dsc-mod-3-project-online-ds-sp-000/master/clean_data.csv'
df = pd.read_csv(url)
df.head()
Checking for null values.
# Checking dataframe for null values
# We can see that the only column that has missing values is EmployeeSuper
df.info()
Now, we extract descriptive statistics for each column.
# Getting descriptive statistics for all columns
df.describe()
The next step is to count the frequency of each discount class and obtain discount distribution.
print("Number of unique discount values:", df.Discount.nunique())
print(df.Discount.value_counts())
df.Discount.hist(figsize=(10,4));
Number of unique discount values: 6
Discount
0.00 434
0.05 124
0.10 95
0.15 57
0.20 55
0.25 36
Name: count, dtype: int64
We can see from the above cell that the discounts are given as percentages. They are mostly in increments of 5 from 0 to 25, but there are a few oddball values. It may be useful to create categorical bins later. Now, let's count the continuous values and check if the values are correct by extracting descriptive statistics.
# We can see here that the GROUP BY has produced a continuous variable WeightedDisount
df.WeightedDiscount.value_counts()
WeightedDiscount
0.000000 434
0.050000 51
0.200000 39
0.100000 37
0.250000 35
...
0.023684 1
0.043478 1
0.066667 1
0.094949 1
0.078611 1
Name: count, Length: 148, dtype: int64
# Checking to stats and making sure the math in SQL query has produced proper min and max
df.WeightedDiscount.describe()
count 801.000000
mean 0.054130
std 0.075352
min 0.000000
25% 0.000000
50% 0.000000
75% 0.100000
max 0.250000
Name: WeightedDiscount, dtype: float64
Since the averaging of discount amounts has produced a lot of discrete values, the solution is to create bins starting from zero and with increments of 5% for ease of comparison, with bin edges set to round.
# We can see that the averaging of discount amounts has produced a lot of discrete values, I will now bin them into increments of 5% for ease of comparison, with bin edges set to round
# data points to their nearest increment
df['Discount'] = pd.cut(x = df.WeightedDiscount,
bins = [-1, 0, 0.075, 0.125, 0.175, 0.22, 0.25],
labels = [0, 0.05, 0.10, 0.15, 0.20, 0.25])
Applying non-parametrical analysis
The next analysis involves comparing the distributions of the OrderTotal values across different discount groups using both visual and statistical methods. The steps include visualizing the data, conducting pairwise comparisons using the Kolmogorov-Smirnov test, and creating comparative plots with probability distribution lines. The steps of the analysis are:
1. Data Preparation: The WeightedDiscount column is binned into discrete discount groups, stored in the Discount column.
2. Data Visualization: Histograms with KDE lines for OrderTotal and WeightedDiscount are plotted, showing the distribution of these values across different discount groups.
3.Kolmogorov-Smirnov Test: This test is used to compare the OrderTotal distributions between successive discount groups. The test results include the KS statistic and p-value for each comparison.
4. Comparative Plots: Histograms and KDE lines for OrderTotal values of successive discount groups are plotted. Vertical lines indicate the means of each group, providing a clear visual comparison.
The following statistical concepts will be necessary:
Binning: Binning is the process of dividing continuous data into discrete intervals (bins). In this analysis, WeightedDiscount values are binned into specified intervals to create the Discount groups.
Histogram: A histogram is a graphical representation of the distribution of numerical data. It shows the frequency of data points falling within specified ranges (bins).
Kernel Density Estimation (KDE): KDE is a non-parametric way to estimate the probability density function of a random variable. It smooths the data to provide a continuous estimate of the distribution, adding more detail than a histogram alone.
Kolmogorov-Smirnov Test: The Kolmogorov-Smirnov (KS) test is a non-parametric test that compares the distributions of two independent samples. It calculates the maximum difference between the empirical cumulative distribution functions (ECDF) of the two samples.
KS Statistic: Measures the maximum distance between the ECDFs of the two samples.
P-value: Indicates the probability of observing a KS statistic as extreme as the one calculated, under the null hypothesis that the two samples are from the same distribution. A low p-value (< 0.05) suggests significant differences between the distributions.
Mean: The mean is the average value of a dataset. It is a measure of central tendency, providing a single value that summarizes the central location of the data.
The following elements will be helpful to enable the interpretation of results:
Data Preparation and Visualization: The initial step of binning WeightedDiscount and visualizing OrderTotal and WeightedDiscount distributions helps to understand the overall distribution and spread of these values across different discount groups.
Kolmogorov-Smirnov Test: By comparing the OrderTotal distributions between successive discount groups, the KS test identifies if there are significant differences in how order totals are distributed across different discount levels. Significant differences (p-value < 0.05) indicate that the discount level affects the distribution of order totals.
Comparative Plots: The comparative plots provide a visual representation of how OrderTotal values vary between successive discount groups. The addition of KDE lines and mean lines allows for a detailed comparison of the central tendency and distribution shape between groups.
In summary:
This analysis helps in understanding the impact of different discount levels on the order totals. The combination of statistical testing and visual representation provides a comprehensive view of the data. By identifying significant differences and visualizing distribution patterns, the analysis can inform decisions on discount strategies and their effectiveness in influencing order totals. Furthermore, some code from [3, 4, 5] had been inspiration for the next code.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ks_2samp
# Assuming the DataFrame df is already loaded and contains the columns 'OrderTotal' and 'WeightedDiscount'
# Create the 'Discount' column as described
df['Discount'] = pd.cut(x = df.WeightedDiscount,
bins = [-1, 0, 0.075, 0.125, 0.175, 0.22, 0.25],
labels = [0, 0.05, 0.10, 0.15, 0.20, 0.25])
# Data visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 6)) # Adjusted to 1 row and 2 columns
# Variable to iterate through the columns
columns_to_plot = ['OrderTotal', 'WeightedDiscount'] # Add other columns if necessary
for w, column in enumerate(columns_to_plot):
sns.histplot(data=df,
x=column,
hue='Discount',
alpha=0.5,
kde=True,
ax=axes[w],
palette='viridis')
axes[w].grid(True)
plt.tight_layout()
plt.show()
# Preparing the data for the Kolmogorov-Smirnov test
# Grouping 'OrderTotal' by 'Discount'
discount_groups = df.groupby('Discount')['OrderTotal']
# Pairwise Kolmogorov-Smirnov comparison between successive discount levels
discount_levels = df['Discount'].cat.categories
ks_results = []
for i in range(len(discount_levels) - 1):
group1 = discount_groups.get_group(discount_levels[i])
group2 = discount_groups.get_group(discount_levels[i + 1])
ks_stat, p_value = ks_2samp(group1, group2)
ks_results.append((discount_levels[i], discount_levels[i + 1], ks_stat, p_value))
# Displaying the results of the Kolmogorov-Smirnov test
print("Kolmogorov-Smirnov Test Results:")
for result in ks_results:
print(f'Comparison {result[0]} vs {result[1]}: KS Statistic = {result[2]:.4f}, p-value = {result[3]:.4f}')
# Creating comparative plots between successive discount levels
for i in range(len(discount_levels) - 1):
group1 = discount_groups.get_group(discount_levels[i])
group2 = discount_groups.get_group(discount_levels[i + 1])
mean1 = np.mean(group1)
mean2 = np.mean(group2)
plt.figure(figsize=(10, 6))
sns.histplot(group1, alpha=0.5, label=f'Discount {discount_levels[i]}', kde=True)
sns.histplot(group2, alpha=0.5, label=f'Discount {discount_levels[i + 1]}', kde=True)
plt.axvline(mean1, color='r', linestyle='dashed', linewidth=1, label=f'Mean {discount_levels[i]}')
plt.axvline(mean2, color='b', linestyle='dashed', linewidth=1, label=f'Mean {discount_levels[i + 1]}')
plt.legend(loc='upper right')
plt.title(f'Comparison of OrderTotal: Discount {discount_levels[i]} vs {discount_levels[i + 1]}')
plt.xlabel('OrderTotal')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
Kolmogorov-Smirnov Test Results:
Comparison 0.0 vs 0.05: KS Statistic = 0.2051, p-value = 0.0005
Comparison 0.05 vs 0.1: KS Statistic = 0.0859, p-value = 0.7800
Comparison 0.1 vs 0.15: KS Statistic = 0.0807, p-value = 0.9613
Comparison 0.15 vs 0.2: KS Statistic = 0.2045, p-value = 0.1608
Comparison 0.2 vs 0.25: KS Statistic = 0.1566, p-value = 0.5923
Although there is no difference between zero and 5% discount, the other pairwise comparison between discounts shows a significative difference.
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1WYnJgHz88KT2ePn6KlZl4cxOZtrSjyqW?usp=sharing
Additional references
https://towardsdatascience.com/k-means-in-marketing-analysis-clustering-210-us-dmas-deb9e60e3fe5
https://medium.com/@jihargifari/how-to-perform-market-basket-analysis-in-python-bd00b745b106
https://medium.com/datafabrica/mastering-e-commerce-product-recommendations-in-python-7c12a4bf0c2c
https://www.datacamp.com/tutorial/recommender-systems-python
https://humboldt-wi.github.io/blog/research/information_systems_1920/group3_shopper/
References
[1] Discount versus No Discount - The Northwind Traders Dataset: Studying the effects of discount on customer behavior:
https://towardsdatascience.com/the-northwind-traders-dataset-5513bd7b63b0
[2] Github with the dataset in CSV format:
[3] Github with several codes:
https://github.com/FoamoftheSea/dsc-mod-3-project-online-ds-sp-000
[4] Kind of clustering and testing if the clusters follow the same distribution:
[5] Comparing Kolmogorov-Smirnov and Shapiro-Wilk test to verify normality distribution of the errors: