1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test

1. Concepts & Definitions

2. Problem & Solution

2.1. Using Wilcoxon Sign Test to compare clustering methods

2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods

2.3. What is A/B testing and how to combine with hypothesis testing?

2.4. Using Chi-Square fit to check if Benford-Law holds or not

2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not

2.6. Discount vs. No Discount: non-parametric tests

Wilcoxon Sign-Rank Test

The Mann-Whitney U Test, also referred to as the Wilcoxon Rank Sum Test, is a non-parametric statistical method used to compare two samples or groups.

This test evaluates whether the two sampled groups are likely to come from the same population, essentially questioning if these two populations have the same data distribution. In other words, it seeks evidence to determine whether the groups originate from populations with different levels of a variable of interest. Consequently, the hypotheses in a Mann-Whitney U Test are as follows [1]:

The null hypothesis (H0) posits that the two populations are equal.
The alternative hypothesis (H1) suggests that the two populations are not equal.

Some researchers view this as a comparison of the medians between the two populations, while parametric tests compare the means between two independent groups. In specific cases, where the data have similar shapes (as per the assumptions), this interpretation is valid. However, it's important to note that medians are not directly involved in the calculation of the Mann-Whitney U test statistic. Two groups could have the same median and still show significant differences according to the Mann-Whitney U test.

Wilcoxon Sign-Rank Test assumptions

The Wilcoxon Signed-Rank Test is used to compare two related samples, matched samples, or repeated measurements on a single sample to assess whether their population mean ranks differ.

It assumes that the differences between paired observations are symmetrically distributed around the median.

It is used for paired data, such as pre-test and post-test scores for the same subjects.

The difference between Mann-Whitney U Test and Wilcoxon Signed-Rank Test is:

The Mann-Whitney U Test is used for comparing two independent samples to determine if they come from the same population.
The Wilcoxon Signed-Rank Test is used for comparing two related samples to determine if their population mean ranks differ.

In essence, the Mann-Whitney U Test is for independent groups, while the Wilcoxon Signed-Rank Test is for related or paired groups.

Wilcoxon Sign-Rank Test detailed numerical example for integer values

A teacher taught a new topic in the class and decided to take a surprise test on the next day [1]. The marks out of 10 scored by 6 students were marked as Test 1. Now, the teacher decided to take the test again after a week of self-practice. The scores were marked as Test 2.

Assume that the following data violates the assumptions of normal distribution are compiled at the next table.

In the table above, there are some cases where the students scored less than they scored before and in some cases, the improvement is relatively high (Student 4). This could be due to random effects. We will analyse if the difference is systematic or due to chance using this test.

The next step is to assign ranks to the absolute value of differences. Note that this can only be done after arranging the data in ascending order. In Wilcoxon sign-rank test, we need signed ranks which basically is assigning the sign associated with the difference to the rank as shown below.

Now, what is the hypothesis here?

The null hypothesis (H0) posits that the median difference is zero.
The alternative hypothesis (H1) suggests that the difference is positive.
Since n = n1 = n2 = 6, and significance level α = 0.05, then here p = 2 (using table or a Python code to obtain it).

2. The test statistic for this test is W is the smaller of W1 and W2 defined below:

W1: Sum of positive ranks

W2: Sum of negative ranks

W1 = 17.5

W2 = 3.5

3. Using the formula of W1 & W2, compute their values.

W = min(W1, W2 ) = 3.5

Here, if W1 is similar to W2 then we accept the null hypothesis. Otherwise, in this example, if the difference reflects greater improvement in the marks scored by the students, then we reject the null hypothesis. The critical value of W can be looked up in the table. The criteria to accept or reject null hypothesis are:

Reject H0: W <= critical value

Do not reject H0: W > critical value

Here, W > critical value=2, therefore we do not reject the null hypothesis and conclude that there’s no significant difference between the marks of two tests.

Wilcoxon Sign-Rank Test detailed numerical example for integer values - Python Code

The next Python code shows how to make some manual calculations automatically, and also provides how to employ the command wilcoxon from scipy.stats library to avoid the use of tabulated values for the Wilcoxon Sign-Rank Test.

import pandas as pd

import numpy as np

from scipy.stats import wilcoxon

# Provided data

data = {

'Student': [1, 2, 3, 4, 5, 6],

'Test 1': [8, 6, 4, 2, 5, 6],

'Test 2': [6, 8, 8, 9, 4, 10]

}

# Creating the DataFrame

df = pd.DataFrame(data)

# Calculating the difference

df['Difference (Test2 - Test1)'] = df['Test 2'] - df['Test 1']

# Calculating the sign of the difference

df['sign(Difference)'] = np.sign(df['Difference (Test2 - Test1)'])

# Calculating the absolute value of the difference

df['|Difference|'] = df['Difference (Test2 - Test1)'].abs()

# Ranking the absolute values of the differences

df['Rank'] = df['|Difference|'].rank()

# Calculating the product of the sign of the difference by the rank

df['Sign(Difference)*Rank'] = df['sign(Difference)'] * df['Rank']

# Displaying the DataFrame

print(df)

# Calculating W+ and W-

W_plus = df[df['Sign(Difference)*Rank'] > 0]['Rank'].sum()

W_minus = df[df['Sign(Difference)*Rank'] < 0]['Rank'].sum()

# Displaying W+ and W-

print(f"W+ = {W_plus}")

print(f"W- = {W_minus}")

# Calculating the Test Statistic (T)

T = min(abs(W_plus), abs(W_minus))

# Displaying T

print(f"Test Statistic (T) = {T}")

# Performing the Wilcoxon test

stat, p_value = wilcoxon(df['Test 1'], df['Test 2'])

# Defining the significance level

alpha = 0.05

# Displaying the critical value and the p-value

print(f"p-value = {p_value}")

# Hypothesis test

if p_value < alpha:

print("We reject the null hypothesis (H0). There is a significant difference between the tests.")

else:

print("We do not reject the null hypothesis (H0). There is no significant difference between the tests.")

Student Test 1 Test 2 Difference (Test2 - Test1) sign(Difference) \

0 1 8 6 -2 -1

1 2 6 8 2 1

2 3 4 8 4 1

3 4 2 9 7 1

4 5 5 4 -1 -1

5 6 6 10 4 1

|Difference| Rank Sign(Difference)*Rank

0 2 2.5 -2.5

1 2 2.5 2.5

2 4 4.5 4.5

3 7 6.0 6.0

4 1 1.0 -1.0

5 4 4.5 4.5

W+ = 17.5

W- = 3.5

Test Statistic (T) = 3.5

p-value = 0.15625

We do not reject the null hypothesis (H0). There is no significant difference between the tests.

The Python code with the data, graphic, and detailed computation to obtain Wilcoxon Sign-Rank Test is given at:

https://colab.research.google.com/drive/1nWrj_Rq8cCge1Kfi5Mq3Z8VqynLAmydA?usp=sharing

Wilcoxon Sign-Rank Test detailed numerical example for decimal values

Let’s consider that we want to choose between two machine learning models , Model A and Model B, according to their classification accuracy (%) on several test databases (benchmark sets) [2]. The objective is to choose which one is going to be deployed and used in the production environment . First, we state our null hypothesis and alternative hypothesis as:

H0: There is no difference between the two models A and B.

H1: There is a difference between the two models A and B (the median change was non-zero).

Here, we give the table of results for each model with respect to each test dataset.

Now, we start the Wilcoxon Signed-Ranks Test process.

Step 1: Compute the differences between the two methods:

Diff = Model A - Model B.

Step 2: Compute the sign of the differences (negative -1 or positive 1): Sign(Diff).

Step 3: Compute the absolute value of the differences: |Diff|.

Step 4: Rank the absolute values of differences. The lowest value getting the rank of 1, the second lower value get rank 2, the third lower get rank 3, etc. In case of ties average ranks are assigned.

Step 5: Compute the signed ranks as Rank * Sign(Diff).

Step 6: Compute W+ as the sum of positive ranks, W- as the sum of negative ranks, and Test Statistic (T) as the minimum of |W+| and |W-|.

|W+| = |28| = 28

|W-| = |-17| = 17

Using the formula of |W+| & |W-|, compute their values and T.

T = min(|W+|, |W-|) = min(28, 17) = 17

Step 7: Extract the Test Critical value Tcrit, for a significance level alpha = 0.05 and n=9, from Signed Ranks Table.

According to the table we get Tcrit=5.

Step 8:Compare the Test Statistic (T) with the Test Critical value (Tcrit). The criteria to accept or reject null hypothesis are:

Reject H0: T <= Tcrit

Do not reject H0: T > Tcrit

For the given data:

T=17 > Tcrit=5 : We fail to reject H0.

Step 9: Conclude with the test results:

We can conclude that there is no sufficient evidence to suggest that there is a difference between the two methods in terms of classification accuracy.

Wilcoxon Sign-Rank Test detailed numerical example for decimal values - Python code

The next Python code shows how to make some manual calculations automatically, and also provides how to employ the command wilcoxon from scipy.stats library to avoid the use of tabulated values for the Wilcoxon Sign-Rank Test. First, let's see the code related to the manual computations.

import numpy as np

import pandas as pd

from scipy.stats import wilcoxon

# Data

datasets = ['Test set 1', 'Test set 2', 'Test set 3', 'Test set 4', 'Test set 5',

'Test set 6', 'Test set 7', 'Test set 8', 'Test set 9']

model_a = [99.82, 89.04, 82.04, 79.00, 75.96, 72.43, 78.50, 79.36, 73.43]

model_b = [98.62, 86.61, 81.25, 76.07, 74.79, 96.46, 76.57, 73.29, 90.66]

# Create DataFrame

df = pd.DataFrame({

'Datasets': datasets,

'Model A': model_a,

'Model B': model_b

})

# Compute Diff, Sign(Diff), |Diff|

df['Diff'] = df['Model A'] - df['Model B']

df['Sign(Diff)'] = np.sign(df['Diff'])

df['|Diff|'] = df['Diff'].abs()

# Rank the absolute differences

df['Rank'] = df['|Diff|'].rank()

df['Sign(Diff)*Rank'] = df['Sign(Diff)'] * df['Rank']

# Compute W+, W-, and Test Statistic (T)

w_plus = df[df['Sign(Diff)'] > 0]['Rank'].sum()

w_minus = df[df['Sign(Diff)'] < 0]['Rank'].sum()

test_statistic = min(w_plus, abs(w_minus))

# Display the DataFrame

print(df)

# Display W+, W-, and Test Statistic (T)

print(f"W+ = {w_plus}")

print(f"W- = {w_minus}")

print(f"Test Statistic (T) = {test_statistic}")

# Compute the critical value for n = len(datasets) - 1, alpha = 0.05

n = len(datasets)

alpha = 0.05

# For n = 9, the critical value Tcrit at alpha = 0.05 is 5 (from standard Wilcoxon signed-rank test table)

# For n = 9, Tcrit = 5

t_crit = 5

# Compare the Test Statistic (T) with the Test Critical value (Tcrit)

print(f"Test Critical value (Tcrit) = {t_crit}")

if test_statistic <= t_crit:

print("Reject the null hypothesis: There is a significant difference between the two models.")

else:

print("Fail to reject the null hypothesis: There is no significant difference between the two models.")

Datasets Model A Model B Diff Sign(Diff) |Diff| Rank \

0 Test set 1 99.82 98.62 1.20 1.0 1.20 3.0

1 Test set 2 89.04 86.61 2.43 1.0 2.43 5.0

2 Test set 3 82.04 81.25 0.79 1.0 0.79 1.0

3 Test set 4 79.00 76.07 2.93 1.0 2.93 6.0

4 Test set 5 75.96 74.79 1.17 1.0 1.17 2.0

5 Test set 6 72.43 96.46 -24.03 -1.0 24.03 9.0

6 Test set 7 78.50 76.57 1.93 1.0 1.93 4.0

7 Test set 8 79.36 73.29 6.07 1.0 6.07 7.0

8 Test set 9 73.43 90.66 -17.23 -1.0 17.23 8.0

Sign(Diff)*Rank

0 3.0

1 5.0

2 1.0

3 6.0

4 2.0

5 -9.0

6 4.0

7 7.0

8 -8.0

W+ = 28.0

W- = 17.0

Test Statistic (T) = 17.0

Test Critical value (Tcrit) = 5

Fail to reject the null hypothesis: There is no significant difference between the two models.

Now, how to employ the command wilcoxon from scipy.stats library to avoid the use of tabulated values.

# Performing the Wilcoxon test

stat, p_value = wilcoxon(df['Model A'], df['Model B'])

# Defining the significance level

alpha = 0.05

# Displaying the critical value and the p-value

print(f"p-value = {p_value}")

# Hypothesis test

if p_value < alpha:

print("We reject the null hypothesis (H0). There is a significant difference between the tests.")

else:

print("We do not reject the null hypothesis (H0). There is no significant difference between the tests.")

p-value = 0.5703125

We do not reject the null hypothesis (H0). There is no significant difference between the tests.

The Python code with the data, graphic, and detailed computation to obtain Wilcoxon Sign-Rank Test is given at:

https://colab.research.google.com/drive/1nWrj_Rq8cCge1Kfi5Mq3Z8VqynLAmydA?usp=sharing

References:

[1] https://www.analyticsvidhya.com/blog/2017/11/a-guide-to-conduct-analysis-using-non-parametric-tests/

[2] https://rachidbenouini.medium.com/non-parametric-hypothesis-testing-for-comparing-machine-learning-algorithms-65b2c783cfbe