2.3. What is and how to apply A/B testing?

A/B testing, also known as split testing, is a widely used experimentation technique in digital marketing and data analysis. It helps determine which version of an element, such as a landing page or an ad, is more effective in achieving specific goals, like increasing conversions or website dwell time.

In data science, A/B testing is a fundamental method for comparing two versions of a product, method, or service to identify which one performs better. It is commonly applied to enhance website design, app design, advertising, pricing, and other aspects of user experience.

To conduct an A/B test, you create two versions of an element and randomly show them to users. User interactions with each version are then monitored and compared to determine which version is more effective [1].

The basic concept of A/B testing involves randomly dividing a sample of users into two groups: a control group and a test group, each seeing different versions. The results from these groups are then compared to determine which version leads to better outcomes [2].

This method is so popular as its based on observations from users that it has been very well adopted by companies such as Netflix, Facebook, Google etc. Google famouly once did an A/B test to determine the color Blue based on 41 different shades of Blue on the hyperlinks that increased there click rate or conversion rate. Netflix heavily conducts A/B testing to improve its user experience and conversion rate [3].

In fact, A/B experiments can be considered a form of split testing, featuring a hypothesis, a control group, a variation, and statistically calculated results. For example, in a simple A/B test, traffic is evenly split between the original version (control) and the new version (variation), with each receiving 50% of the users. The goal is to compare the performance of the two versions to determine which one is more effective [4]. The next figure illustrates this aspect:

What is the relation between A/B testing and test of hypothesis?

Understanding how to develop and test a hypothesis is crucial in A/B testing. This process starts with forming a clear hypothesis, which is then divided into a null hypothesis and an alternative hypothesis. From this hypothesis, we design an experiment, which becomes our A/B test, to validate or test our hypothesis.

The A/B testing could employ the following tests [3]:

Parametric Tests
Non-Parametric Tests
Resampling Tests

The next figure help to summarize the possible developments about hypothesis tests:

Since A/B tests consists in extract two independent samples from the same population these are the possible tests that could be applied if the population follows:

Normal or T-Student distribution: Apply parametric test, i.e., a two-group t test or Z test.
Non-normal or T-Student distribution: Apply non-parametric test, i.e., Chi-Square or Mann-Whitney.

For both group of methods is important to remember how to perform a test of hypothesis.

Remembering the Test of hypothesis steps

But, whichever group of methods is chosen from one of the possibilities previously shown, it is necessary to follow the following steps [5]:

Define the Null and Alternative Hypotheses: First, clearly define the null and alternative hypotheses. The null hypothesis, denoted H0, is the default position that there is no effect or no difference. The alternative hypothesis, denoted H1, is the claim being made, that there is an effect or a difference.

For example:

H0: The mean click-through rate on version A = mean click-through rate on version B.

H1: The mean click-through rate on version A ≠ mean click-through rate on version B.

Choose a Significance Level: The significance level, denoted α, indicates how rare the observed results need to be under the null hypothesis to reject H0. Typical values for α are 0.01, 0.05 or 0.10.

For example, α = 0.05 means you will reject H0 if the test results would occur by chance with probability ≤ 0.05 (or 5%) under H0.

Calculate a Test Statistic: Use Python and the appropriate statistical test to calculate a test statistic and p-value based on your sample data. Common tests include t-tests, chi-square tests (see previous figure).

For example, use SciPy's ttest_ind() function to run a two-sample t-test.

Make a Decision Using the p-Value: If the p-value is less than the significance level α, reject H0 in favor of H1. Otherwise, fail to reject H0.

For example, if α = 0.05 and p-value = 0.03, reject H0. But if p-value = 0.30, fail to reject H0.

Interpret Results: Finally, interpret what the results mean in context of the problem. Be careful not to definitively "accept" H0, only fail to reject it. Also assess if assumptions of the statistical test were met.

To better illustrate these five steps, let's solve some numerical examples through application of python code.

Employing a two-sample T-test

import numpy as np

from scipy.stats import ttest_ind

# Generating the data

version_A = np.random.normal(loc=10, scale=2, size=1000)

version_B = np.random.normal(loc=12, scale=2, size=1000)

# Performing the t-test

t, p = ttest_ind(version_A, version_B)

# Printing the result

print(f"t = {t:.3f}")

print(f"p = {p:.3f}")

# Conclusion

alpha = 0.05

if p < alpha:

print("Reject the null hypothesis. There is a significant difference in the means.")

else:

print("Fail to reject the null hypothesis. There is no significant difference in the means.")

t = -22.871

p = 0.000

Reject the null hypothesis. There is a significant difference in the means.

The p-value of 0.000 indicates that the observed differences between the two versions are statistically significant and that the difference between the means is very unlikely to be due to chance. This result suggests that there is a statistically significant difference between the two versions being tested.

The negative value of t indicates that the mean of version A is smaller than the mean of version B. However, the absolute value of t is very large, indicating that there is a large difference between the means of the two samples.

In summary, we can conclude that version B is statistically superior to version A based on the results of the A/B test. However, it is important to remember that the result of the A/B test is just a tool to aid decision-making and that other considerations, such as context and target audience, should also be taken into account.

import matplotlib.pyplot as plt

# Calculating the means

mean_A = np.mean(version_A)

mean_B = np.mean(version_B)

# Plotting the data

plt.hist(version_A, alpha=0.5, label='Version A')

plt.hist(version_B, alpha=0.5, label='Version B')

plt.axvline(mean_A, color='r', linestyle='dashed', linewidth=1)

plt.axvline(mean_B, color='b', linestyle='dashed', linewidth=1)

plt.legend(loc='upper right')

plt.show()

The result is a chart that clearly shows where the means of each version lie, allowing for a more precise visualization of the differences between them.

Comparing two-sample T-Test and Welch's T-test in real data

The next Python code employed the data set and the visualization concepts had been extracted from [6, 7, 8, 9].

It employs a dataset that contains four columns: Impression, Click, Purchase, Earning, and Group. Each row in the dataset represents a specific observation, and the Group column indicates whether the observation belongs to the control or test group. The columns are described as follows:

Impression: The number of times an advertisement was displayed to users.
Click: The number of times users clicked on the advertisement.
Purchase: The number of purchases made by users after clicking on the advertisement.
Earning: The total revenue generated from the purchases.
Group: The group to which the observation belongs, either "control" or "test".

This data can be used for various analyses, such as evaluating the effectiveness of advertising campaigns, understanding user behavior, and calculating conversion rates and return on investment (ROI).

Let's start to read the data.

import pandas as pd

url='https://docs.google.com/spreadsheets/d/1dMyzwYvSGQVwFETaooYvfdwAJOOPyoy3/export?format=xlsx'

df = pd.read_excel(url)

Let's add a group column to the Control and Test group dataframes, then combine the control and test group data.

# Create labels

n = len(df)

labels = ['control'] * (n // 2) + ['test'] * (n - n // 2)

# Add labels to DataFrame

df['Group'] = labels

Data visualization

The next code creates visualizations for each column in the dataset using histograms with kernel density estimation (KDE) and color coding based on the "Group" column. The code creates a 2x2 grid of histograms with KDE curves for each column in the dataset (Impression, Click, Purchase, and Earning).

Each histogram is color-coded based on the "Group" column (control or test), allowing for a visual comparison between the two groups. The transparency (alpha = 0.5) is used to make overlapping bars distinguishable. The KDE curve provides a smoothed estimate of the data distribution. Finally, the use of a grid improves the readability of each subplot.

import matplotlib.pyplot as plt

import seaborn as sns

fig, axes = plt.subplots(2, 2, figsize = (12, 12))

w = 0

for i in range(2):

for j in range(2):

sns.histplot(data = df,

x = df.columns[w],

hue = "Group",

alpha = 0.5,

kde = True,

ax = axes [i, j],

palette = "viridis")

axes[i, j].grid(True)

w = w+1

It is also interesting to compute averages for the control and test groups.

# Analyze purchase averages for the control and test groups.

df.groupby("Group").agg({"Purchase": "mean"})

Checking normality and homogeneity of variance assumptions

Now, it is important to check the following assumptions for both 2 groups:

Normality assumption
Homogeneity of Variance assumption

Let's first check normality Assumption. Both control and test groups should follow a normal distribution of purchase data. To do this, the Shapiro test tests whether the distribution of a variable is normal. The two hypothesis will be:

H0: Normal distribution assumption is met.

H1: Normal distribution assumption is not met.

from scipy.stats import shapiro, levene, ttest_ind

test_stat, pvalue = shapiro(df.loc[df["Group"] == "control", "Purchase"])

print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

test_stat, pvalue = shapiro(df.loc[df["Group"] == "test", "Purchase"])

print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = 0.9430, p-value = 0.2726

Test Stat = 0.9758, p-value = 0.8697

Since the p-value for the Control and Test groups is > 0.05 (alpha), H0 cannot be rejected. The Normal distribution assumption is met.

Now, it is time to verify the homogeneity of the variance of purchase data are equal between the two groups. For this purpose, Levene test will be employed [10]. The hypothesis to be tested will be:

H0: Variances are Homogeneous.

H1: Variances are not Homogeneous.

test_stat, pvalue = levene(df.loc[df["Group"] == "control", "Purchase"],

df.loc[df["Group"] == "test", "Purchase"])

print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = 5.3628, p-value = 0.0261

Since the p-value for the Control and Test groups is < 0.05 (alpha), H0 cannot be accepted. So, variances are not homogeneous.

Since both independent samples respect Normality assumption, the following parametric test could be applied:

Two-sample Z-test.
Two-sample T-test.
Two-sample Welch's t-test.

Given that the Variance homogeneity assumption were not met, then a Welch's t-test should be applied.

Applying Two-sample Welch's T-test to 'Purchase' variable

test_stat, pvalue = ttest_ind(df.loc[df["Group"] == "control", "Purchase"],

df.loc[df["Group"] == "test", "Purchase"],

equal_var=False)

print('Test Stat = %.4f, p-value = %.4f' % (test_stat, pvalue))

Test Stat = -0.7060, p-value = 0.4854

Interpretation of Results based on p-value

Remember that:

If the p-value is < alpha: We reject the H0 null hypothesis. There is no significant difference between the two versions.
If the p-value > alpha: We can not reject the H0 null hypothesis. There is a significant difference between the two versions. The p-value obtained from the t-test was 0.4845 and that’s larger than the alpha (0.05).

With p-value = 0.4854, we can not reject the H0 null hypothesis:

Null Hypothesis (H0): There is no statistically significant difference between the purchasing averages of the Control group (Maximum Bidding) and the Test group (Average Bidding).

So, there is no significant performance difference observed between these two groups.

But, what could be done if both assumptions do not hold? The next subsctions will tackle this issue.

Applying Mann-Whitney U Test into a numerical data

The Mann-Whitney U test is a non-parametric test used to determine whether there is a difference between two independent groups. It is particularly useful when:

The data do not follow a normal distribution.
The sample sizes are small.
The data are ordinal or the assumptions of the t-test (normality and homogeneity of variances) are violated.

Example Situations for Mann-Whitney U Test:
- Comparing the conversion rates between two configurations of a website where the data are not normally distributed.
- Comparing user satisfaction ratings (ordinal data) between two versions of an app.

The next example, assume a non-normally distributed data for earnings in two groups (Control and Test):

import pandas as pd

import scipy.stats as stats

# Sample data

data = {

'Earning': [2311.27, 1742.81, 1797.83, 1696.23, 1543.72, 2081.85, 1815.01, 1965.10, 1651.66, 2456.30],

'Label': ['control', 'control', 'control', 'control', 'control', 'test', 'test', 'test', 'test', 'test']

}

# Create DataFrame

df = pd.DataFrame(data)

# Split data into Control and Test groups

control_group = df[df['Label'] == 'control']['Earning']

test_group = df[df['Label'] == 'test']['Earning']

# Perform Mann-Whitney U test

u_stat, p_value = stats.mannwhitneyu(control_group, test_group)

# Print results

print(f"U-statistic: {u_stat:.4f}")

print(f"P-value: {p_value:.4f}")

# Conclusion

alpha = 0.05

if p_value < alpha:

print("Reject the null hypothesis. There is a significant difference in distributions.")

else:

print("Fail to reject the null hypothesis. There is no significant difference in distributions.")

U-statistic: 7.0000

P-value: 0.3095

Fail to reject the null hypothesis. There is no significant difference in distributions.

Applying Chi-square Test into a numerical data

The Chi-square test is used to test the association between two categorical variables. It is particularly useful when:

You have categorical data.
You want to see if there is a significant association between two categories.

Example Situations for Chi-Square Test:
- Testing whether the conversion rate (converted vs. not converted) is independent of the version of the website (A vs. B).
- Comparing the distribution of user demographics (e.g., gender) between two different user groups.

Assume we have categorical data for conversions in two groups (Control and Test):

import pandas as pd

import scipy.stats as stats

# Sample data

data = {

'Converted': [50, 70, 45, 65, 55, 80, 85, 75, 60, 90],

'Not_Converted': [950, 930, 955, 935, 945, 920, 915, 925, 940, 910],

'Label': ['control', 'control', 'control', 'control', 'control', 'test', 'test', 'test', 'test', 'test']

}

# Create DataFrame

df = pd.DataFrame(data)

# Create contingency table

contingency_table = pd.crosstab(df['Label'], [df['Converted'], df['Not_Converted']])

# Perform Chi-square test

chi2, p_value, _, _ = stats.chi2_contingency(contingency_table)

# Print results

print(f"Chi-squared: {chi2:.4f}")

print(f"P-value: {p_value:.4f}")

# Conclusion

alpha = 0.05

if p_value < alpha:

print("Reject the null hypothesis. There is a significant association between the groups and conversion rates.")

else:

print("Fail to reject the null hypothesis. There is no significant association between the groups and conversion rates.")

Chi-squared: 10.0000

P-value: 0.3505

Fail to reject the null hypothesis. There is no significant association between the groups and conversion rates.

Using Chi-square, contingency table and PICOT method

The PICOT method, although originally used in clinical research, can be adapted to structure A/B testing by formulating precise and focused questions. In this context, PICOT stands for Population (users being tested), Intervention (new feature or change being tested), Comparison (current version or control group), Outcome (desired user behavior or metric), and Time (duration of the test). For example, an A/B test might use the PICOT framework to investigate the effect of a new website layout on user engagement over one month compared to the current layout. The PICOT method ensures that all critical elements of the A/B test are addressed, promoting clarity and focus in test design and analysis. This method could be employed to create Contigency tables.

Contingency tables, also known as cross-tabulations, organize the frequency distribution of two or more categorical variables in a matrix format, making them ideal for A/B testing analysis. In this context, rows might represent different outcomes or user actions, while columns represent the control and test groups. The cells contain the frequency counts of these combinations. These tables are essential for calculating the Chi-square statistic and provide a clear and efficient way to summarize and analyze the data from A/B tests. Contingency tables help visualize the relationship between group assignments and outcomes, facilitating the comparison of user behaviors between the two groups.

It is important to remember that the Chi-square test is a statistical method used to assess whether there is a significant association between two categorical variables, often employed in A/B testing to compare user behaviors or outcomes between two groups. It compares the observed frequencies in the data with the expected frequencies if the groups were independent of each other. The Chi-square statistic is calculated by summing the squared differences between observed and expected frequencies, divided by the expected frequencies. This test helps determine whether differences between the control and test groups are statistically significant.

The next python tried to summary the three previous concepts: PICOT method, contingency tables, and Chi-square test.

import numpy as np

import scipy.stats as stats

# Data

visitors_A = 10000

conversions_A = 500

visitors_B = 10000

conversions_B = 550

# Conversion rates

conversion_rate_A = conversions_A / visitors_A

conversion_rate_B = conversions_B / visitors_B

# Print conversion rates