1. Concepts & Definitions
1.1. A Review on Parametric Statistics
1.2. Parametric tests for Hypothesis Testing
1.3. Parametric vs. Non-Parametric Test
1.4. One sample z-test and their relation with two-sample z-test
1.5. One sample t-test and their relation with two-sample t-test
1.6. Welch's two-sample t-test: two populations with different variances
1.7. Non-Parametric test for Hypothesis Testing: Mann-Whitney U Test
1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test
1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test
1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-Fit
1.11. Non-Parametric test for Hypothesis Testing: Kolmogorov-Smirnov
1.12. Non-Parametric for comparing machine learning
2. Problem & Solution
2.1. Using Wilcoxon Sign Test to compare clustering methods
2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods
2.3. What is A/B testing and how to combine with hypothesis testing?
2.4. Using Chi-Square fit to check if Benford-Law holds or not
2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not
What is an A/B testing?
A/B testing, also known as split testing, is a widely used experimentation technique in digital marketing and data analysis. It helps determine which version of an element, such as a landing page or an ad, is more effective in achieving specific goals, like increasing conversions or website dwell time.
In data science, A/B testing is a fundamental method for comparing two versions of a product, method, or service to identify which one performs better. It is commonly applied to enhance website design, app design, advertising, pricing, and other aspects of user experience.
To conduct an A/B test, you create two versions of an element and randomly show them to users. User interactions with each version are then monitored and compared to determine which version is more effective [1].
The basic concept of A/B testing involves randomly dividing a sample of users into two groups: a control group and a test group, each seeing different versions. The results from these groups are then compared to determine which version leads to better outcomes [2].
This method is so popular as its based on observations from users that it has been very well adopted by companies such as Netflix, Facebook, Google etc. Google famouly once did an A/B test to determine the color Blue based on 41 different shades of Blue on the hyperlinks that increased there click rate or conversion rate. Netflix heavily conducts A/B testing to improve its user experience and conversion rate [3].
In fact, A/B experiments can be considered a form of split testing, featuring a hypothesis, a control group, a variation, and statistically calculated results. For example, in a simple A/B test, traffic is evenly split between the original version (control) and the new version (variation), with each receiving 50% of the users. The goal is to compare the performance of the two versions to determine which one is more effective [4]. The next figure illustrates this aspect:
What is the relation between A/B testing and test of hypothesis?
Understanding how to develop and test a hypothesis is crucial in A/B testing. This process starts with forming a clear hypothesis, which is then divided into a null hypothesis and an alternative hypothesis. From this hypothesis, we design an experiment, which becomes our A/B test, to validate or test our hypothesis.
The A/B testing could employ the following tests [3]:
Parametric Tests
Non-Parametric Tests
Resampling Tests
The next figure help to summarize the possible developments about hypothesis tests:
Since A/B tests consists in extract two independent samples from the same population these are the possible tests that could be applied if the population follows:
Normal or T-Student distribution: Apply parametric test, i.e., a two-group t test or Z test.
Non-normal or T-Student distribution: Apply non-parametric test, i.e., Chi-Square or Mann-Whitney.
For both group of methods is important to remember how to perform a test of hypothesis.
Remembering the Test of hypothesis steps
But, whichever group of methods is chosen from one of the possibilities previously shown, it is necessary to follow the following steps [5]:
Define the Null and Alternative Hypotheses: First, clearly define the null and alternative hypotheses. The null hypothesis, denoted H0, is the default position that there is no effect or no difference. The alternative hypothesis, denoted H1, is the claim being made, that there is an effect or a difference.
For example:
H0: The mean click-through rate on version A = mean click-through rate on version B.
H1: The mean click-through rate on version A ≠ mean click-through rate on version B.
Choose a Significance Level: The significance level, denoted α, indicates how rare the observed results need to be under the null hypothesis to reject H0. Typical values for α are 0.01, 0.05 or 0.10.
For example, α = 0.05 means you will reject H0 if the test results would occur by chance with probability ≤ 0.05 (or 5%) under H0.
Calculate a Test Statistic: Use Python and the appropriate statistical test to calculate a test statistic and p-value based on your sample data. Common tests include t-tests, chi-square tests (see previous figure).
For example, use SciPy's ttest_ind() function to run a two-sample t-test.
Make a Decision Using the p-Value: If the p-value is less than the significance level α, reject H0 in favor of H1. Otherwise, fail to reject H0.
For example, if α = 0.05 and p-value = 0.03, reject H0. But if p-value = 0.30, fail to reject H0.
Interpret Results: Finally, interpret what the results mean in context of the problem. Be careful not to definitively "accept" H0, only fail to reject it. Also assess if assumptions of the statistical test were met.
To better illustrate these five steps, let's solve some numerical examples through application of python code.
Employing a two-sample T-test
import numpy as np
from scipy.stats import ttest_ind
# Generating the data
version_A = np.random.normal(loc=10, scale=2, size=1000)
version_B = np.random.normal(loc=12, scale=2, size=1000)
# Performing the t-test
t, p = ttest_ind(version_A, version_B)
# Printing the result
print(f"t = {t:.3f}")
print(f"p = {p:.3f}")
# Conclusion
alpha = 0.05
if p < alpha:
print("Reject the null hypothesis. There is a significant difference in the means.")
else:
print("Fail to reject the null hypothesis. There is no significant difference in the means.")
t = -22.871
p = 0.000
Reject the null hypothesis. There is a significant difference in the means.
The p-value of 0.000 indicates that the observed differences between the two versions are statistically significant and that the difference between the means is very unlikely to be due to chance. This result suggests that there is a statistically significant difference between the two versions being tested.
The negative value of t indicates that the mean of version A is smaller than the mean of version B. However, the absolute value of t is very large, indicating that there is a large difference between the means of the two samples.
In summary, we can conclude that version B is statistically superior to version A based on the results of the A/B test. However, it is important to remember that the result of the A/B test is just a tool to aid decision-making and that other considerations, such as context and target audience, should also be taken into account.
import matplotlib.pyplot as plt
# Calculating the means
mean_A = np.mean(version_A)
mean_B = np.mean(version_B)
# Plotting the data
plt.hist(version_A, alpha=0.5, label='Version A')
plt.hist(version_B, alpha=0.5, label='Version B')
plt.axvline(mean_A, color='r', linestyle='dashed', linewidth=1)
plt.axvline(mean_B, color='b', linestyle='dashed', linewidth=1)
plt.legend(loc='upper right')
plt.show()
The result is a chart that clearly shows where the means of each version lie, allowing for a more precise visualization of the differences between them.
Comparing two-sample T-Test and Welch's T-test in real data
The next Python code employed the data set and the visualization concepts had been extracted from [6, 7, 8, 9].
It employs a dataset that contains four columns: Impression, Click, Purchase, Earning, and Group. Each row in the dataset represents a specific observation, and the Group column indicates whether the observation belongs to the control or test group. The columns are described as follows:
Impression: The number of times an advertisement was displayed to users.
Click: The number of times users clicked on the advertisement.
Purchase: The number of purchases made by users after clicking on the advertisement.
Earning: The total revenue generated from the purchases.
Group: The group to which the observation belongs, either "control" or "test".
This data can be used for various analyses, such as evaluating the effectiveness of advertising campaigns, understanding user behavior, and calculating conversion rates and return on investment (ROI).
Let's start to read the data.
import pandas as pd
url='https://docs.google.com/spreadsheets/d/1dMyzwYvSGQVwFETaooYvfdwAJOOPyoy3/export?format=xlsx'
df = pd.read_excel(url)
df
Let's add a group column to the Control and Test group dataframes, then combine the control and test group data.
# Create labels
n = len(df)
labels = ['control'] * (n // 2) + ['test'] * (n - n // 2)
# Add labels to DataFrame
df['Group'] = labels
df
Data visualization
The next code creates visualizations for each column in the dataset using histograms with kernel density estimation (KDE) and color coding based on the "Group" column. The code creates a 2x2 grid of histograms with KDE curves for each column in the dataset (Impression, Click, Purchase, and Earning).
Each histogram is color-coded based on the "Group" column (control or test), allowing for a visual comparison between the two groups. The transparency (alpha = 0.5) is used to make overlapping bars distinguishable. The KDE curve provides a smoothed estimate of the data distribution. Finally, the use of a grid improves the readability of each subplot.
import matplotlib.pyplot as plt
import seaborn as sns
fig, axes = plt.subplots(2, 2, figsize = (12, 12))
w = 0
for i in range(2):
for j in range(2):
sns.histplot(data = df,
x = df.columns[w],
hue = "Group",
alpha = 0.5,
kde = True,
ax = axes [i, j],
palette = "viridis")
axes[i, j].grid(True)
w = w+1
It is also interesting to compute averages for the control and test groups.
# Analyze purchase averages for the control and test groups.
df.groupby("Group").agg({"Purchase": "mean"})
Checking normality and homogeneity of variance assumptions
Now, it is important to check the following assumptions for both 2 groups:
Normality assumption
Homogeneity of Variance assumption
Let's first check normality Assumption. Both control and test groups should follow a normal distribution of purchase data. To do this, the Shapiro test tests whether the distribution of a variable is normal. The two hypothesis will be:
H0: Normal distribution assumption is met.
H1: Normal distribution assumption is not met.
from scipy.stats import shapiro, levene, ttest_ind
test_stat, pvalue = shapiro(df.loc[df["Group"] == "control", "Purchase"])
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))
test_stat, pvalue = shapiro(df.loc[df["Group"] == "test", "Purchase"])
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))
Test Stat = 0.9430, p-value = 0.2726
Test Stat = 0.9758, p-value = 0.8697
Since the p-value for the Control and Test groups is > 0.05 (alpha), H0 cannot be rejected. The Normal distribution assumption is met.
Now, it is time to verify the homogeneity of the variance of purchase data are equal between the two groups. For this purpose, Levene test will be employed [10]. The hypothesis to be tested will be:
H0: Variances are Homogeneous.
H1: Variances are not Homogeneous.
test_stat, pvalue = levene(df.loc[df["Group"] == "control", "Purchase"],
df.loc[df["Group"] == "test", "Purchase"])
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))
Test Stat = 5.3628, p-value = 0.0261
Since the p-value for the Control and Test groups is < 0.05 (alpha), H0 cannot be accepted. So, variances are not homogeneous.
Since both independent samples respect Normality assumption, the following parametric test could be applied:
Two-sample Z-test.
Two-sample T-test.
Two-sample Welch's t-test.
Given that the Variance homogeneity assumption were not met, then a Welch's t-test should be applied.
Applying Two-sample Welch's T-test to 'Purchase' variable
test_stat, pvalue = ttest_ind(df.loc[df["Group"] == "control", "Purchase"],
df.loc[df["Group"] == "test", "Purchase"],
equal_var=False)
print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))
Test Stat = -0.7060, p-value = 0.4854
Interpretation of Results based on p-value
Remember that:
If the p-value is < alpha: We reject the H0 null hypothesis. There is no significant difference between the two versions.
If the p-value > alpha: We can not reject the H0 null hypothesis. There is a significant difference between the two versions. The p-value obtained from the t-test was 0.4845 and that’s larger than the alpha (0.05).
With p-value = 0.4854, we can not reject the H0 null hypothesis:
Null Hypothesis (H0): There is no statistically significant difference between the purchasing averages of the Control group (Maximum Bidding) and the Test group (Average Bidding).
So, there is no significant performance difference observed between these two groups.
But, what could be done if both assumptions do not hold? The next subsctions will tackle this issue.
Applying Mann-Whitney U Test into a numerical data
The Mann-Whitney U test is a non-parametric test used to determine whether there is a difference between two independent groups. It is particularly useful when:
The data do not follow a normal distribution.
The sample sizes are small.
The data are ordinal or the assumptions of the t-test (normality and homogeneity of variances) are violated.
Example Situations for Mann-Whitney U Test:
Comparing the conversion rates between two configurations of a website where the data are not normally distributed.
Comparing user satisfaction ratings (ordinal data) between two versions of an app.
The next example, assume a non-normally distributed data for earnings in two groups (Control and Test):
import pandas as pd
import scipy.stats as stats
# Sample data
data = {
'Earning': [2311.27, 1742.81, 1797.83, 1696.23, 1543.72, 2081.85, 1815.01, 1965.10, 1651.66, 2456.30],
'Label': ['control', 'control', 'control', 'control', 'control', 'test', 'test', 'test', 'test', 'test']
}
# Create DataFrame
df = pd.DataFrame(data)
# Split data into Control and Test groups
control_group = df[df['Label'] == 'control']['Earning']
test_group = df[df['Label'] == 'test']['Earning']
# Perform Mann-Whitney U test
u_stat, p_value = stats.mannwhitneyu(control_group, test_group)
# Print results
print(f"U-statistic: {u_stat:.4f}")
print(f"P-value: {p_value:.4f}")
# Conclusion
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference in distributions.")
else:
print("Fail to reject the null hypothesis. There is no significant difference in distributions.")
U-statistic: 7.0000
P-value: 0.3095
Fail to reject the null hypothesis. There is no significant difference in distributions.
Applying Chi-square Test into a numerical data
The Chi-square test is used to test the association between two categorical variables. It is particularly useful when:
You have categorical data.
You want to see if there is a significant association between two categories.
Example Situations for Chi-Square Test:
Testing whether the conversion rate (converted vs. not converted) is independent of the version of the website (A vs. B).
Comparing the distribution of user demographics (e.g., gender) between two different user groups.
Assume we have categorical data for conversions in two groups (Control and Test):
import pandas as pd
import scipy.stats as stats
# Sample data
data = {
'Converted': [50, 70, 45, 65, 55, 80, 85, 75, 60, 90],
'Not_Converted': [950, 930, 955, 935, 945, 920, 915, 925, 940, 910],
'Label': ['control', 'control', 'control', 'control', 'control', 'test', 'test', 'test', 'test', 'test']
}
# Create DataFrame
df = pd.DataFrame(data)
# Create contingency table
contingency_table = pd.crosstab(df['Label'], [df['Converted'], df['Not_Converted']])
# Perform Chi-square test
chi2, p_value, _, _ = stats.chi2_contingency(contingency_table)
# Print results
print(f"Chi-squared: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
# Conclusion
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis. There is a significant association between the groups and conversion rates.")
else:
print("Fail to reject the null hypothesis. There is no significant association between the groups and conversion rates.")
Chi-squared: 10.0000
P-value: 0.3505
Fail to reject the null hypothesis. There is no significant association between the groups and conversion rates.
Using Chi-square, contingency table and PICOT method
The PICOT method, although originally used in clinical research, can be adapted to structure A/B testing by formulating precise and focused questions. In this context, PICOT stands for Population (users being tested), Intervention (new feature or change being tested), Comparison (current version or control group), Outcome (desired user behavior or metric), and Time (duration of the test). For example, an A/B test might use the PICOT framework to investigate the effect of a new website layout on user engagement over one month compared to the current layout. The PICOT method ensures that all critical elements of the A/B test are addressed, promoting clarity and focus in test design and analysis. This method could be employed to create Contigency tables.
Contingency tables, also known as cross-tabulations, organize the frequency distribution of two or more categorical variables in a matrix format, making them ideal for A/B testing analysis. In this context, rows might represent different outcomes or user actions, while columns represent the control and test groups. The cells contain the frequency counts of these combinations. These tables are essential for calculating the Chi-square statistic and provide a clear and efficient way to summarize and analyze the data from A/B tests. Contingency tables help visualize the relationship between group assignments and outcomes, facilitating the comparison of user behaviors between the two groups.
It is important to remember that the Chi-square test is a statistical method used to assess whether there is a significant association between two categorical variables, often employed in A/B testing to compare user behaviors or outcomes between two groups. It compares the observed frequencies in the data with the expected frequencies if the groups were independent of each other. The Chi-square statistic is calculated by summing the squared differences between observed and expected frequencies, divided by the expected frequencies. This test helps determine whether differences between the control and test groups are statistically significant.
The next python tried to summary the three previous concepts: PICOT method, contingency tables, and Chi-square test.
import numpy as np
import scipy.stats as stats
# Data
visitors_A = 10000
conversions_A = 500
visitors_B = 10000
conversions_B = 550
# Conversion rates
conversion_rate_A = conversions_A / visitors_A
conversion_rate_B = conversions_B / visitors_B
# Print conversion rates
print(f"Conversion rate A: {conversion_rate_A:.2%}")
print(f"Conversion rate B: {conversion_rate_B:.2%}")
# Contingency table
contingency_table = np.array([
[conversions_A, visitors_A - conversions_A],
[conversions_B, visitors_B - conversions_B]
])
# Chi-squared test
chi2, p_value, _, _ = stats.chi2_contingency(contingency_table)
# Print results
print(f"Chi-squared: {chi2:.4f}")
print(f"P-value: {p_value:.4f}")
# Conclusion
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis. There is a significant difference in conversion rates.")
else:
print("Fail to reject the null hypothesis. There is no significant difference in conversion rates.")
Conversion rate A: 5.00%
Conversion rate B: 5.50%
Chi-squared: 2.4134
P-value: 0.1203
Fail to reject the null hypothesis. There is no significant difference in conversion rates.
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1-Me-CyYWfKBGyc6tMgv61zpAvnz6N3eu?usp=sharing
References
[1] https://medium.com/@panData/learn-how-to-perform-a-b-tests-in-python-6e3cdc00f6a9
[2] https://python.plainenglish.io/a-b-testing-comparison-of-bidding-methods-8e87cbcb2762
[3] https://medium.com/@david.joy1588/a-b-testing-clearing-perspective-bb28a9a4d5c7
[4] https://cxl.com/blog/ab-testing-guide/amp/
[6] https://python.plainenglish.io/a-b-testing-comparison-of-bidding-methods-8e87cbcb2762
[7] https://github.com/sadicesur/A-B-Testing/blob/main/AB-TESTING.ipynb
[9] https://medium.com/ogi-on-ds/the-logic-behind-a-b-testing-with-sample-python-code-a4ed76e99f93
Additional References
Practical example to obtain data from landing pages
https://lukeclarke12.medium.com/a-practical-guide-to-ab-testing-in-python-1629a56d854e
Practical manual about extraction of creating experiments with web pages
https://cxl.com/blog/ab-testing-guide/amp/
Additional material about more detailed math operations about A/B tests
Chi-square
https://medium.com/analytics-vidhya/a-b-testing-and-how-to-implement-it-in-python-70b21697efcc
Several distributions including Chi-Square Test and convertion rate
PICOT and Table from Chi-square
https://medium.com/@david.joy1588/a-b-testing-clearing-perspective-bb28a9a4d5c7
Chi-square test, picot, contingency tables and Chi-Square Testing
https://medium.com/@david.joy1588/a-b-testing-clearing-perspective-bb28a9a4d5c7
https://en.wikipedia.org/wiki/Chi-squared_test
https://en.wikipedia.org/wiki/Contingency_table
http://www.stat.yale.edu/Courses/1997-98/101/chisq.htm
https://www.graphpad.com/quickcalcs/contingency1/
https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15