1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test

1. Concepts & Definitions

2. Problem & Solution

2.1. Using Wilcoxon Sign Test to compare clustering methods

2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods

2.3. What is A/B testing and how to combine with hypothesis testing?

2.4. Using Chi-Square fit to check if Benford-Law holds or not

2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not

2.6. Discount vs. No Discount: non-parametric tests

Wilcoxon Sign Test

Mann-Whitney U test is used for testing the difference between two independent groups with the ordinal or continuous dependent variable. Wilcoxon sign rank test is used for testing the difference between two related variables which takes into account the magnitude and direction of difference, however, Sign test ignores the magnitude and only considers the direction of the difference.

In sign test, we don’t consider magnitude thereby ignoring the ranks. The hypothesis is the same as before:

The null hypothesis (H0) posits that the median difference is zero.
The alternative hypothesis (H1) suggests that the difference is positive.

Here, if we see a similar number of positive and negative differences then the null hypothesis is true. Otherwise, if we see more positive signs then the null hypothesis is false.

Wilcoxon Sign Test assumptions

1. Paired Data: The data should consist of paired observations. Each pair should come from the same subject or related subjects.

2. Ordinal Data: The differences between paired observations can be ordinal. Unlike the Wilcoxon Signed-Rank Test, it does not require the differences to be on a continuous scale.

3. Symmetry: The distribution of the differences does not need to be symmetric. This makes the Wilcoxon Sign Test less restrictive than the Signed-Rank Test in terms of distribution shape.

4. Independence: The pairs of observations should be independent of each other.

5. Non-zero Differences: The test assumes that the differences between pairs are not all zero. Similar to the Signed-Rank Test, a large number of zero differences can affect the test's validity.

Comparison with Wilcoxon Sign-Rank Test

Data Type: The Wilcoxon Signed-Rank Test requires ordinal or continuous data, whereas the Wilcoxon Sign Test can work with purely ordinal data.

Symmetry Assumption: The Wilcoxon Signed-Rank Test assumes that the distribution of differences is symmetric, while the Wilcoxon Sign Test does not.

Usage: The Wilcoxon Signed-Rank Test is more powerful when the symmetry assumption holds, while the Wilcoxon Sign Test is useful when this assumption is not met.

Wilcoxon Sign Test detailed numerical example for integer values

A teacher taught a new topic in the class and decided to take a surprise test on the next day [1]. The marks out of 10 scored by 6 students were marked as Test 1. Now, the teacher decided to take the test again after a week of self-practice. The scores were marked as Test 2.

Assume that the following data violates the assumptions of normal distribution are compiled at the next table.

In the table above, there are some cases where the students scored less than they scored before and in some cases, the improvement is relatively high (Student 4). This could be due to random effects. We will analyse if the difference is systematic or due to chance using this test.

The next step is to assign ranks to the absolute value of differences. Note that this can only be done after arranging the data in ascending order. In Wilcoxon sign test, we need signed ranks which basically is assigning the sign associated with the difference to the rank as shown below.

Now, what is the hypothesis here?

The null hypothesis (H0) posits that number of positive and negative signs are equal, i.e., the median difference is zero.
The alternative hypothesis (H1) suggests that the number of positive and negative signs are not equal.
We want to know the probability of observing #W or fewer of minimum between number of positive or number of negative differences assuming the null hypothesis, i.e., P(X <= #W), where X ~ B(n, 0.5), where n = 6, significance level α = 0.05, and W = Min(#W1, #W2). This leads to [3]:
1. Reject H0: P(X ≤ #W) <= α
2. Do not reject H0: P(X ≤ #W) > α

2. The test statistic for this test is W is the smaller of W1 and W2 defined below:

W1: Number of positive ranks

W2: Number of negative ranks

W1 = 4

W2 = 2

3. Using the formula of W1 & W2, compute their values.

W = min(4, 2) = 2

Here, if W1 is similar to W2 then we accept the null hypothesis. Otherwise, in this example, if the difference reflects greater improvement in the marks scored by the students, then we reject the null hypothesis. The critical value of W will be computed to be employed as a criteria to accept or reject null hypothesis using:

Reject H0: P(X ≤ #W) <= critical value

Do not reject H0: P(X ≤ #W) > critical value

Here, P(X ≤ 2), where X ~ B(6, 0.5), P(X ≤ 2) = 0.34375 > critical value = 0.05, using [4] to obtain P(X ≤ 2), therefore we do not reject the null hypothesis and conclude that there's no significant difference between the marks of the two tests.

Wilcoxon SignTest detailed numerical example for integer values - Python Code

The next Python code shows how to make some manual calculations automatically, and also provides how to employ the command wilcoxon from scipy.stats library to avoid the use of tabulated values for the Wilcoxon Sign Test.

import pandas as pd

import numpy as np

from scipy.stats import wilcoxon

# Provided data

data = {

'Student': [1, 2, 3, 4, 5, 6],

'Test 1': [8, 6, 4, 2, 5, 6],

'Test 2': [6, 8, 8, 9, 4, 10]

}

# Creating the DataFrame

df = pd.DataFrame(data)

# Calculating the difference

df['Difference (Test2 - Test1)'] = df['Test 2'] - df['Test 1']

# Calculating the sign of the difference

df['sign(Difference)'] = np.sign(df['Difference (Test2 - Test1)'])

# Calculating the absolute value of the difference

df['|Difference|'] = df['Difference (Test2 - Test1)'].abs()

# Ranking the absolute values of the differences

df['Rank'] = df['|Difference|'].rank()

# Calculating the product of the sign of the difference by the rank

df['Sign(Difference)*Rank'] = df['sign(Difference)'] * df['Rank']

# Displaying the DataFrame

print(df)

# Perform the sign test

# Count the number of positive and negative differences

positive_diff_count = (df['sign(Difference)'] > 0).sum()

negative_diff_count = (df['sign(Difference)'] < 0).sum()

# Perform binomial test (two-sided)

n = positive_diff_count + negative_diff_count

test_statistic = min(positive_diff_count, negative_diff_count)

#calculate binomial probability

test_critical = binom.cdf(k=test_statistic, n=n, p=0.5)

# Display results

print(f"Number of positive differences: {positive_diff_count}")

print(f"Number of negative differences: {negative_diff_count}")

print(f"Test Statistic (minimum of positive and negative differences): {test_statistic}")

print(f"Test critical: {test_critical}")

# Interpretation of results

alpha = 0.05

if test_critical < alpha:

print("Reject the null hypothesis: There is a significant difference between the two models.")

else:

print("Fail to reject the null hypothesis: There is no significant difference between the two models.")

Student Test 1 Test 2 Difference (Test2 - Test1) sign(Difference) \

0 1 8 6 -2 -1

1 2 6 8 2 1

2 3 4 8 4 1

3 4 2 9 7 1

4 5 5 4 -1 -1

5 6 6 10 4 1

|Difference| Rank Sign(Difference)*Rank

0 2 2.5 -2.5

1 2 2.5 2.5

2 4 4.5 4.5

3 7 6.0 6.0

4 1 1.0 -1.0

5 4 4.5 4.5

Number of positive differences: 4

Number of negative differences: 2

Test Statistic (minimum of positive and negative differences): 2

Test critical: 0.34375

Fail to reject the null hypothesis: There is no significant difference between the two models.

The Python code with the data, graphic, and detailed computation to obtain Wilcoxon Sign Test is given at:

https://colab.research.google.com/drive/1vBSj1L9zD4g1quNOaC7zTO3WHOFfi4V-?usp=sharing

Wilcoxon Sign Test detailed numerical example for decimal values

Let’s consider that we want to choose between two machine learning models, Model A and Model B, according to their classification accuracy (%) on several test databases (benchmark sets) [2]. The objective is to choose which one is going to be deployed and used in the production environment. First, we state our null hypothesis and alternative hypothesis as:

H0: There is no difference between the two models A and B.

H1: There is a difference between the two models A and B (the median change was non-zero).

Here, we give the table of results for each model concerning each test dataset.

Now, we start the Wilcoxon Sign Test process.

Step 1: Compute the differences between the two methods:

Diff = Model A - Model B.

Step 2: Compute the sign of the differences (negative -1 or positive 1): Sign(Diff).

Step 3: Compute W+ as the number of positive ranks, W- as the number of negative ranks, and Test Statistic (T) as the minimum of |W+| and |W-|.

#W+ = 7

#W- = 2

Using the formula of #W+ & #W-, compute their values and W.

T = min(#W+, #W-) = min(7, 2) = 2

Step 4: Extract the Test Critical value Tcrit, for a significance level alpha = 0.05 and n=9, from a Binomial calculator [4].

According to the calculator we get Tcrit=0.08984.

Step 5:Compare the Test Statistic (T) with the Test Critical value (Tcrit). The criteria to accept or reject null hypothesis are:

Reject H0: T <= Tcrit

Do not reject H0: T > Tcrit

For the given data:

T=2 > Tcrit=0.08984 : We do not reject H0.

Step 6: Conclude with the test results:

We can conclude that there is not sufficient evidence to suggest that there is a difference between the two methods in terms of classification accuracy.

Wilcoxon Sign Test detailed numerical example for decimal values - Python code using critical value

The next Python code shows how to make some manual calculations automatically, and also provides how to employ the command wilcoxon from scipy.stats library to avoid the use of tabulated values for the Wilcoxon Sign-Rank Test. First, let's see the code related to the manual computations.

import numpy as np

import pandas as pd

from scipy.stats import binom

from scipy.stats import binom_test

# Data

datasets = ['Test set 1', 'Test set 2', 'Test set 3', 'Test set 4', 'Test set 5',

'Test set 6', 'Test set 7', 'Test set 8', 'Test set 9']

model_a = [99.82, 89.04, 82.04, 79.00, 75.96, 72.43, 78.50, 79.36, 73.43]

model_b = [98.62, 86.61, 81.25, 76.07, 74.79, 96.46, 76.57, 73.29, 90.66]

# Create DataFrame

df = pd.DataFrame({

'Datasets': datasets,

'Model A': model_a,

'Model B': model_b

})

# Compute Diff, Sign(Diff), |Diff|

df['Diff'] = df['Model A'] - df['Model B']

df['Sign(Diff)'] = np.sign(df['Diff'])

df['|Diff|'] = df['Diff'].abs()

# Rank the absolute differences

df['Rank'] = df['|Diff|'].rank()

df['Sign(Diff)*Rank'] = df['Sign(Diff)'] * df['Rank']

# Display the DataFrame

print(df)

# Perform the sign test

# Count the number of positive and negative differences

positive_diff_count = (df['Sign(Diff)'] > 0).sum()

negative_diff_count = (df['Sign(Diff)'] < 0).sum()

# Perform binomial test (two-sided)

n = positive_diff_count + negative_diff_count

test_statistic = min(positive_diff_count, negative_diff_count)

#calculate binomial probability

test_critical = binom.cdf(k=test_statistic, n=n, p=0.5)

# Display results

print(f"Number of positive differences: {positive_diff_count}")

print(f"Number of negative differences: {negative_diff_count}")

print(f"Test Statistic (minimum of positive and negative differences): {test_statistic}")

print(f"Test critical: {test_critical}")

# Interpretation of results

alpha = 0.05

if test_critical < alpha:

print("Reject the null hypothesis: There is a significant difference between the two models.")

else:

print("Fail to reject the null hypothesis: There is no significant difference between the two models.")

Datasets Model A Model B Diff Sign(Diff) |Diff| Rank \

0 Test set 1 99.82 98.62 1.20 1.0 1.20 3.0

1 Test set 2 89.04 86.61 2.43 1.0 2.43 5.0

2 Test set 3 82.04 81.25 0.79 1.0 0.79 1.0

3 Test set 4 79.00 76.07 2.93 1.0 2.93 6.0

4 Test set 5 75.96 74.79 1.17 1.0 1.17 2.0

5 Test set 6 72.43 96.46 -24.03 -1.0 24.03 9.0

6 Test set 7 78.50 76.57 1.93 1.0 1.93 4.0

7 Test set 8 79.36 73.29 6.07 1.0 6.07 7.0

8 Test set 9 73.43 90.66 -17.23 -1.0 17.23 8.0

Sign(Diff)*Rank

0 3.0

1 5.0

2 1.0

3 6.0

4 2.0

5 -9.0

6 4.0

7 7.0

8 -8.0

Number of positive differences: 7

Number of negative differences: 2

Test Statistic (minimum of positive and negative differences): 2

Test critical: 0.08984375

Fail to reject the null hypothesis: There is no significant difference between the two models.

The Python code with the data, graphic, and detailed computation to obtain Wilcoxon Sign Test is given at:

https://colab.research.google.com/drive/1vBSj1L9zD4g1quNOaC7zTO3WHOFfi4V-?usp=sharing

Wilcoxon Sign Test detailed numerical example for decimal values - Python code using alpha level

import numpy as np

import pandas as pd

from scipy.stats import binom_test

# Data

datasets = ['Test set 1', 'Test set 2', 'Test set 3', 'Test set 4', 'Test set 5',

'Test set 6', 'Test set 7', 'Test set 8', 'Test set 9']

model_a = [99.82, 89.04, 82.04, 79.00, 75.96, 72.43, 78.50, 79.36, 73.43]

model_b = [98.62, 86.61, 81.25, 76.07, 74.79, 96.46, 76.57, 73.29, 90.66]

# Create DataFrame

df = pd.DataFrame({

'Datasets': datasets,

'Model A': model_a,

'Model B': model_b

})

# Compute Diff, Sign(Diff)

df['Diff'] = df['Model A'] - df['Model B']

df['Sign(Diff)'] = np.sign(df['Diff'])

# Count the number of positive and negative differences

positive_diff_count = (df['Sign(Diff)'] > 0).sum()

negative_diff_count = (df['Sign(Diff)'] < 0).sum()

# Perform binomial test (two-sided)

n = positive_diff_count + negative_diff_count

test_statistic = min(positive_diff_count, negative_diff_count)

p_value = binom_test(test_statistic, n=n, p=0.5, alternative='two-sided')

# Display the DataFrame

print(df)

# Display results

print(f"Number of positive differences: {positive_diff_count}")

print(f"Number of negative differences: {negative_diff_count}")

print(f"Test Statistic (minimum of positive and negative differences): {test_statistic}")

print(f"P-value: {p_value}")

# Interpretation of results

alpha = 0.05

if p_value < alpha:

print("We reject the null hypothesis (H0). There is a significant difference between the tests.")

else:

print("We do not reject the null hypothesis (H0). There is no significant difference between the tests.")

Datasets Model A Model B Diff Sign(Diff)

0 Test set 1 99.82 98.62 1.20 1.0

1 Test set 2 89.04 86.61 2.43 1.0

2 Test set 3 82.04 81.25 0.79 1.0

3 Test set 4 79.00 76.07 2.93 1.0

4 Test set 5 75.96 74.79 1.17 1.0

5 Test set 6 72.43 96.46 -24.03 -1.0

6 Test set 7 78.50 76.57 1.93 1.0

7 Test set 8 79.36 73.29 6.07 1.0

8 Test set 9 73.43 90.66 -17.23 -1.0

Number of positive differences: 7

Number of negative differences: 2

Test Statistic (minimum of positive and negative differences): 2

P-value: 0.1796875

We do not reject the null hypothesis (H0). There is no significant difference between the tests.

The Python code with the data, graphic, and detailed computation to obtain Wilcoxon Sign Test is given at:

https://colab.research.google.com/drive/1vBSj1L9zD4g1quNOaC7zTO3WHOFfi4V-?usp=sharing

References:

[1] https://www.analyticsvidhya.com/blog/2017/11/a-guide-to-conduct-analysis-using-non-parametric-tests/

[2] https://rachidbenouini.medium.com/non-parametric-hypothesis-testing-for-comparing-machine-learning-algorithms-65b2c783cfbe

[3] https://www.ncl.ac.uk/webtemplate/ask-assets/external/maths-resources/statistics/hypothesis-testing/hypothesis-testing-with-the-binomial-distribution.html#:~:text=To%20hypothesis%20test%20with%20the,we%20accept%20the%20alternative%20hypothesis.

[4] https://homepage.divms.uiowa.edu/~mbognar/applets/bin.html

[5] https://vitalflux.com/sign-test-hypothesis-python-examples/