2.6. Discount vs. No Discount: non-parametric tests

The Northwind Traders (NT) dataset is a toy database created by Microsoft for educational purposes. This fictitious company sells specialty foods wholesale to retail outlets worldwide. The database contains data for 801 orders placed by 85 different customers over 23 months, complete with supplementary information on customers, employees, and suppliers. For this study, the goal was to extract insights from the data which could facilitate more revenue for the company [1]. Two things which are generally good for business are the customers:

1. buying more products,

2. buying more often.

Since the customers of Northwind Traders are retailers, a discount on their orders means a higher profit margin for them, potentially generating buyer incentives as well as customer loyalty. This study sought to investigate two questions as they relate to best reaching these goals: how much discount should be offered, and how frequently should it be offered?

Since offering larger or more frequent discounts hurts the profit that can be made off of orders by NT, it is important to determine whether there is a threshold at which this loss of revenue ceases to be justified by the effects being generated.

Reading and cleaning values from NT dataset

Let's start to read the NT dataset from GitHub using the address with the data [2].

import pandas as pd

url = 'https://raw.githubusercontent.com/FoamoftheSea/dsc-mod-3-project-online-ds-sp-000/master/clean_data.csv'

df = pd.read_csv(url)

df.head()

Checking for null values.

# Checking dataframe for null values

# We can see that the only column that has missing values is EmployeeSuper

df.info()

Now, we extract descriptive statistics for each column.

# Getting descriptive statistics for all columns

df.describe()

The next step is to count the frequency of each discount class and obtain discount distribution.

print("Number of unique discount values:", df.Discount.nunique())

print(df.Discount.value_counts())

df.Discount.hist(figsize=(10,4));

Number of unique discount values: 6

Discount

0.00 434

0.05 124

0.10 95

0.15 57

0.20 55

0.25 36

Name: count, dtype: int64

We can see from the above cell that the discounts are given as percentages. They are mostly in increments of 5 from 0 to 25, but there are a few oddball values. It may be useful to create categorical bins later. Now, let's count the continuous values and check if the values are correct by extracting descriptive statistics.

# We can see here that the GROUP BY has produced a continuous variable WeightedDisount

df.WeightedDiscount.value_counts()

WeightedDiscount

0.000000 434

0.050000 51

0.200000 39

0.100000 37

0.250000 35

...

0.023684 1

0.043478 1

0.066667 1

0.094949 1

0.078611 1

Name: count, Length: 148, dtype: int64

# Checking to stats and making sure the math in SQL query has produced proper min and max

df.WeightedDiscount.describe()

count 801.000000

mean 0.054130

std 0.075352

min 0.000000

25% 0.000000

50% 0.000000

75% 0.100000

max 0.250000

Name: WeightedDiscount, dtype: float64

Since the averaging of discount amounts has produced a lot of discrete values, the solution is to create bins starting from zero and with increments of 5% for ease of comparison, with bin edges set to round.

# We can see that the averaging of discount amounts has produced a lot of discrete values, I will now bin them into increments of 5% for ease of comparison, with bin edges set to round

# data points to their nearest increment

df['Discount'] = pd.cut(x = df.WeightedDiscount,

bins = [-1, 0, 0.075, 0.125, 0.175, 0.22, 0.25],

labels = [0, 0.05, 0.10, 0.15, 0.20, 0.25])

Applying non-parametrical analysis

The next analysis involves comparing the distributions of the OrderTotal values across different discount groups using both visual and statistical methods. The steps include visualizing the data, conducting pairwise comparisons using the Kolmogorov-Smirnov test, and creating comparative plots with probability distribution lines. The steps of the analysis are:

1. Data Preparation: The WeightedDiscount column is binned into discrete discount groups, stored in the Discount column.

2. Data Visualization: Histograms with KDE lines for OrderTotal and WeightedDiscount are plotted, showing the distribution of these values across different discount groups.

3.Kolmogorov-Smirnov Test: This test is used to compare the OrderTotal distributions between successive discount groups. The test results include the KS statistic and p-value for each comparison.

4. Comparative Plots: Histograms and KDE lines for OrderTotal values of successive discount groups are plotted. Vertical lines indicate the means of each group, providing a clear visual comparison.

The following statistical concepts will be necessary:

Binning: Binning is the process of dividing continuous data into discrete intervals (bins). In this analysis, WeightedDiscount values are binned into specified intervals to create the Discount groups.
Histogram: A histogram is a graphical representation of the distribution of numerical data. It shows the frequency of data points falling within specified ranges (bins).
Kernel Density Estimation (KDE): KDE is a non-parametric way to estimate the probability density function of a random variable. It smooths the data to provide a continuous estimate of the distribution, adding more detail than a histogram alone.
Kolmogorov-Smirnov Test: The Kolmogorov-Smirnov (KS) test is a non-parametric test that compares the distributions of two independent samples. It calculates the maximum difference between the empirical cumulative distribution functions (ECDF) of the two samples.
KS Statistic: Measures the maximum distance between the ECDFs of the two samples.
P-value: Indicates the probability of observing a KS statistic as extreme as the one calculated, under the null hypothesis that the two samples are from the same distribution. A low p-value (< 0.05) suggests significant differences between the distributions.
Mean: The mean is the average value of a dataset. It is a measure of central tendency, providing a single value that summarizes the central location of the data.

The following elements will be helpful to enable the interpretation of results:

Data Preparation and Visualization: The initial step of binning WeightedDiscount and visualizing OrderTotal and WeightedDiscount distributions helps to understand the overall distribution and spread of these values across different discount groups.
Kolmogorov-Smirnov Test: By comparing the OrderTotal distributions between successive discount groups, the KS test identifies if there are significant differences in how order totals are distributed across different discount levels. Significant differences (p-value < 0.05) indicate that the discount level affects the distribution of order totals.
Comparative Plots: The comparative plots provide a visual representation of how OrderTotal values vary between successive discount groups. The addition of KDE lines and mean lines allows for a detailed comparison of the central tendency and distribution shape between groups.

In summary:

This analysis helps in understanding the impact of different discount levels on the order totals. The combination of statistical testing and visual representation provides a comprehensive view of the data. By identifying significant differences and visualizing distribution patterns, the analysis can inform decisions on discount strategies and their effectiveness in influencing order totals. Furthermore, some code from [3, 4, 5] had been inspiration for the next code.

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from scipy.stats import ks_2samp

# Assuming the DataFrame df is already loaded and contains the columns 'OrderTotal' and 'WeightedDiscount'

# Create the 'Discount' column as described

df['Discount'] = pd.cut(x = df.WeightedDiscount,

bins = [-1, 0, 0.075, 0.125, 0.175, 0.22, 0.25],

labels = [0, 0.05, 0.10, 0.15, 0.20, 0.25])

# Data visualization

fig, axes = plt.subplots(1, 2, figsize=(12, 6)) # Adjusted to 1 row and 2 columns

# Variable to iterate through the columns

columns_to_plot = ['OrderTotal', 'WeightedDiscount'] # Add other columns if necessary

for w, column in enumerate(columns_to_plot):

sns.histplot(data=df,

x=column,

hue='Discount',

alpha=0.5,

kde=True,

ax=axes[w],

palette='viridis')

axes[w].grid(True)

plt.tight_layout()

plt.show()

# Preparing the data for the Kolmogorov-Smirnov test

# Grouping 'OrderTotal' by 'Discount'

discount_groups = df.groupby('Discount')['OrderTotal']

# Pairwise Kolmogorov-Smirnov comparison between successive discount levels

discount_levels = df['Discount'].cat.categories

ks_results = []

for i in range(len(discount_levels) - 1):

group1 = discount_groups.get_group(discount_levels[i])

group2 = discount_groups.get_group(discount_levels[i + 1])

ks_stat, p_value = ks_2samp(group1, group2)

ks_results.append((discount_levels[i], discount_levels[i + 1], ks_stat, p_value))

# Displaying the results of the Kolmogorov-Smirnov test

print("Kolmogorov-Smirnov Test Results:")

for result in ks_results:

print(f'Comparison {result[0]} vs {result[1]}: KS Statistic = {result[2]:.4f}, p-value = {result[3]:.4f}')

# Creating comparative plots between successive discount levels

for i in range(len(discount_levels) - 1):

group1 = discount_groups.get_group(discount_levels[i])

group2 = discount_groups.get_group(discount_levels[i + 1])

mean1 = np.mean(group1)

mean2 = np.mean(group2)

plt.figure(figsize=(10, 6))

sns.histplot(group1, alpha=0.5, label=f'Discount {discount_levels[i]}', kde=True)

sns.histplot(group2, alpha=0.5, label=f'Discount {discount_levels[i + 1]}', kde=True)

plt.axvline(mean1, color='r', linestyle='dashed', linewidth=1, label=f'Mean {discount_levels[i]}')

plt.axvline(mean2, color='b', linestyle='dashed', linewidth=1, label=f'Mean {discount_levels[i + 1]}')

plt.legend(loc='upper right')

plt.title(f'Comparison of OrderTotal: Discount {discount_levels[i]} vs {discount_levels[i + 1]}')

plt.xlabel('OrderTotal')

plt.ylabel('Frequency')

plt.grid(True)

plt.show()

Kolmogorov-Smirnov Test Results:

Comparison 0.0 vs 0.05: KS Statistic = 0.2051, p-value = 0.0005

Comparison 0.05 vs 0.1: KS Statistic = 0.0859, p-value = 0.7800

Comparison 0.1 vs 0.15: KS Statistic = 0.0807, p-value = 0.9613

Comparison 0.15 vs 0.2: KS Statistic = 0.2045, p-value = 0.1608

Comparison 0.2 vs 0.25: KS Statistic = 0.1566, p-value = 0.5923

Although there is no difference between zero and 5% discount, the other pairwise comparison between discounts shows a significative difference.

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1WYnJgHz88KT2ePn6KlZl4cxOZtrSjyqW?usp=sharing

Additional references

https://towardsdatascience.com/time-series-modeling-with-arima-to-predict-future-house-price-9b180c3bbd2f

https://towardsdatascience.com/k-means-in-marketing-analysis-clustering-210-us-dmas-deb9e60e3fe5

https://medium.com/@varun.tyagi83/discount-optimisation-model-system-for-an-e-commerce-company-f400e9102214

https://medium.com/@jihargifari/how-to-perform-market-basket-analysis-in-python-bd00b745b106

https://medium.com/datafabrica/mastering-e-commerce-product-recommendations-in-python-7c12a4bf0c2c

https://towardsdatascience.com/building-and-testing-recommender-systems-with-surprise-step-by-step-d4ba702ef80b

https://www.datacamp.com/tutorial/recommender-systems-python

https://humboldt-wi.github.io/blog/research/information_systems_1920/group3_shopper/

References

[1] Discount versus No Discount - The Northwind Traders Dataset: Studying the effects of discount on customer behavior:

https://towardsdatascience.com/the-northwind-traders-dataset-5513bd7b63b0

[2] Github with the dataset in CSV format:

https://raw.githubusercontent.com/FoamoftheSea/dsc-mod-3-project-online-ds-sp-000/master/clean_data.csv

[3] Github with several codes:

https://github.com/FoamoftheSea/dsc-mod-3-project-online-ds-sp-000

[4] Kind of clustering and testing if the clusters follow the same distribution:

https://github.com/FoamoftheSea/dsc-mod-3-project-online-ds-sp-000/blob/master/data_exploration.ipynb

[5] Comparing Kolmogorov-Smirnov and Shapiro-Wilk test to verify normality distribution of the errors:

https://github.com/FoamoftheSea/dsc-mod-3-project-online-ds-sp-000/blob/master/customer_discounts.ipynb

Page updated

Google Sites

Report abuse