2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods

1. Concepts & Definitions

2. Problem & Solution

2.1. Using Wilcoxon Sign Test to compare clustering methods

2.3. What is A/B testing and how to combine with hypothesis testing?

2.4. Using Chi-Square fit to check if Benford-Law holds or not

2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not

2.6. Discount vs. No Discount: non-parametric tests

How to combine classification methods and Wilcoxon Sign-Rank Test?

And, in Track 10, section 1.9, finally Gaussian Mixture was compared with Logistic regression, and Ensemble of both models to classify generated data set into two classes. The models performance were compared using confusion matrix and ROC curve analysis:

https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track10/ensembles

In the data set employed on Track 10, section 1.9, Gaussian Mixture and Logistic Regression showed the same efficiency of 0.988. But, for every entry for the test data set there is a probability of the entry to be belong as class one which is different depending on which method were applied. This could be observed in the result presented next.

P = logistic gaussian

0 0.000201 0.000127

1 0.562759 0.321583

2 0.451088 0.337373

3 0.999179 0.999934

4 0.215037 0.036274

5 0.750645 0.650172

6 0.002165 0.003317

7 0.010276 0.001585

8 0.022628 0.010160

9 0.906035 0.933811

10 0.048193 0.006970

11 0.997605 0.999668

12 0.113842 0.011478

13 0.982316 0.989557

14 0.998790 0.999860

15 0.999846 0.999998

16 0.804899 0.622029

17 0.038477 0.013044

18 0.749700 0.691259

19 0.999823 0.999996

20 0.927950 0.889530

21 0.001407 0.000671

22 0.007142 0.001306

23 0.822709 0.661517

24 0.999711 0.999993

25 0.997846 0.999722

26 0.819469 0.803962

27 0.069383 0.038095

28 0.002692 0.001934

29 0.999166 0.999947

30 0.999767 0.999995

31 0.002917 0.016437

32 0.018732 0.003600

Scoring models.

logistic : 0.988

gaussian : 0.988

Loading train and test data set

Now, since the experiment objetives are more clear, let's start to load the data by employing the following Google Colab (click on the link):

https://colab.research.google.com/drive/1guqdIBa7PmeOIxFWL9twwE_rX51v2gdx?usp=sharing

from sklearn.datasets import make_blobs

from sklearn.model_selection import train_test_split

def get_train_test(test_size=0.33, SEED = 1):

# generate 2d classification dataset

X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=SEED, cluster_std=5)

# demonstrate that the train-test split procedure is repeatable

# split into train test sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=SEED)

return X_train, X_test, y_train, y_test

SEED = 1

X_train, X_test, y_train, y_test = get_train_test(SEED)

# A look at the data

print('X_test = ', X_test)

print('y_test = ', y_test)

X_test = [[ 3.99858703 12.00557395]

[ -5.11786366 2.42272223]

[ -2.61873767 -0.03165495]

[-15.99405267 0.36337804]

[ -5.01287134 6.2943088 ]

[ -5.09542341 0.18046166]

[ 4.33502949 5.33227196]

[ -0.85937456 8.78733447]

[ 0.45791186 4.79319021]

[ -3.34042694 -5.38988786]

[ -2.70403107 7.33960582]

[-10.67993622 -4.54861949]

[ -4.75956413 7.89665004]

[ -9.60860686 -0.86144724]

[ -6.70246352 -12.09054025]

[-12.58318479 -8.93848269]

[ -7.3721509 2.65977626]

[ -0.66806131 5.0015331 ]

[ -4.30041867 -0.95835324]

[-10.19119005 -12.03221032]

[ -8.88012893 1.88416055]

[ 2.53535716 9.06200028]

[ -0.23162328 8.83219569]

[ -7.48178844 2.51278087]

[-15.47227173 -3.10643638]

[-10.93056136 -4.46207791]

[ -4.39062396 -1.90884586]

[ -0.15870831 2.64524064]

[ 2.8483937 6.91896156]

[-10.90231401 -6.97295169]

[-16.36050628 -2.38560995]

[ 7.06449892 0.60045536]

[ -1.05876514 7.49250542]]

y_test = [0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0]

Applying classification methods

Now, let's apply the Logistic regression and Gaussian Mixture methods.

import numpy as np

import pandas as pd

# ROC and AUC Score

from sklearn.metrics import roc_auc_score

from sklearn.linear_model import LogisticRegression

from sklearn.mixture import GaussianMixture

def get_models():

"""Generate a library of base learners."""

lr = LogisticRegression(C=100, random_state=SEED)

gm = GaussianMixture(n_components=2, random_state=SEED)

models = {'logistic': lr,

'gaussian': gm

}

return models

def train_predict(model_list):

"""Fit models in list on training set and return preds"""

P = np.zeros((y_test.shape[0], len(model_list)))

P = pd.DataFrame(P)

print("Fitting models.")

cols = list()

for i, (name, m) in enumerate(models.items()):

print("%s..." % name, end=" ", flush=False)

m.fit(X_train, y_train)

P.iloc[:, i] = m.predict_proba(X_test)[:, 1]

cols.append(name)

print("done")

P.columns = cols

print("Done.\n")

return P

def score_models(P, y):

"""Score model in prediction DF"""

print("Scoring models.")

for m in P.columns:

score = roc_auc_score(y, P.loc[:, m])

print("%-26s: %.3f" % (m, score))

print("Done.\n")

models = get_models()

P = train_predict(models)

print('P = ',P)

score_models(P, y_test)

Fitting models.

logistic... done

gaussian... done

Done.

P = logistic gaussian

0 0.000201 0.000127

1 0.562759 0.321583

2 0.451088 0.337373

3 0.999179 0.999934

4 0.215037 0.036274

5 0.750645 0.650172

6 0.002165 0.003317

7 0.010276 0.001585

8 0.022628 0.010160

9 0.906035 0.933811

10 0.048193 0.006970

11 0.997605 0.999668

12 0.113842 0.011478

13 0.982316 0.989557

14 0.998790 0.999860

15 0.999846 0.999998

16 0.804899 0.622029

17 0.038477 0.013044

18 0.749700 0.691259

19 0.999823 0.999996

20 0.927950 0.889530

21 0.001407 0.000671

22 0.007142 0.001306

23 0.822709 0.661517

24 0.999711 0.999993

25 0.997846 0.999722

26 0.819469 0.803962

27 0.069383 0.038095

28 0.002692 0.001934

29 0.999166 0.999947

30 0.999767 0.999995

31 0.002917 0.016437

32 0.018732 0.003600

Scoring models.

logistic : 0.988

gaussian : 0.988

Done.

Let's remember how Wilcoxon Sign-Rank Test works

Now, the idea is to recover this experiment and the data set employed to apply the Wilcoxon Sign-Rank Test to verify if both results are equal or not.

For this purpose the material available at Track 11, section 1.8 will be a useful reference. This material discussed the problem of choosing between two machine learning models: Model A and Model B, according to their classification accuracy (%) on several test databases (benchmark sets) [2].

The objective is to choose which one is going to be deployed and used in the production environment . First, we state our null hypothesis and alternative hypothesis as:

H0: There is no difference between the two models A and B.

H1: There is a difference between the two models A and B (the median change was non-zero).

Now, it is a great time to recover the Python code with the data, graphic, and detailed computation to obtain Wilcoxon Sign-Rank Test which is given at:

https://colab.research.google.com/drive/1nWrj_Rq8cCge1Kfi5Mq3Z8VqynLAmydA?usp=sharing

Applying Wilcoxon Sign-Rank Test to the P matrix with the probabilities from each model

Let's, recover that variable P has the following responses from logistic and gaussian mixture models.

P

# Compute Diff, Sign(Diff), |Diff|

df = P

df['Diff'] = df['logistic'] - df['gaussian']

df['Sign(Diff)'] = np.sign(df['Diff'])

df['|Diff|'] = df['Diff'].abs()

# Rank the absolute differences

df['Rank'] = df['|Diff|'].rank()

df['Sign(Diff)*Rank'] = df['Sign(Diff)'] * df['Rank']

# Compute W+, W-, and Test Statistic (T)

w_plus = df[df['Sign(Diff)'] > 0]['Rank'].sum()

w_minus = df[df['Sign(Diff)'] < 0]['Rank'].sum()

test_statistic = min(w_plus, abs(w_minus))

# Display the DataFrame

print(df)

# Display W+, W-, and Test Statistic (T)

print(f"W+ = {w_plus}")

print(f"W- = {w_minus}")

print(f"Test Statistic (T) = {test_statistic}")

# Compute the critical value for n = len(datasets) - 1, alpha = 0.05

n = len(df)

alpha = 0.05

# For n = 9, the critical value Tcrit at alpha = 0.05 is 5 (from standard Wilcoxon signed-rank test table)

# For n = 9, Tcrit = 5

t_crit = 5

# Compare the Test Statistic (T) with the Test Critical value (Tcrit)

print(f"Test Critical value (Tcrit) = {t_crit}")

if test_statistic <= t_crit:

print("Reject the null hypothesis: There is a significant difference between the two models.")

else:

print("Fail to reject the null hypothesis: There is no significant difference between the two models.")

logistic gaussian Diff Sign(Diff) |Diff| Rank Sign(Diff)*Rank

0 0.000201 0.000127 0.000074 1.0 0.000074 1.0 1.0

1 0.562759 0.321583 0.241176 1.0 0.241176 33.0 33.0

2 0.451088 0.337373 0.113715 1.0 0.113715 29.0 29.0

3 0.999179 0.999934 -0.000756 -1.0 0.000756 7.0 -7.0

4 0.215037 0.036274 0.178763 1.0 0.178763 31.0 31.0

5 0.750645 0.650172 0.100473 1.0 0.100473 27.0 27.0

6 0.002165 0.003317 -0.001152 -1.0 0.001152 11.0 -11.0

7 0.010276 0.001585 0.008692 1.0 0.008692 16.0 16.0

8 0.022628 0.010160 0.012468 1.0 0.012468 17.0 17.0

9 0.906035 0.933811 -0.027776 -1.0 0.027776 22.0 -22.0

10 0.048193 0.006970 0.041223 1.0 0.041223 25.0 25.0

11 0.997605 0.999668 -0.002063 -1.0 0.002063 13.0 -13.0

12 0.113842 0.011478 0.102363 1.0 0.102363 28.0 28.0

13 0.982316 0.989557 -0.007242 -1.0 0.007242 15.0 -15.0

14 0.998790 0.999860 -0.001070 -1.0 0.001070 10.0 -10.0

15 0.999846 0.999998 -0.000151 -1.0 0.000151 2.0 -2.0

16 0.804899 0.622029 0.182871 1.0 0.182871 32.0 32.0

17 0.038477 0.013044 0.025433 1.0 0.025433 21.0 21.0

18 0.749700 0.691259 0.058440 1.0 0.058440 26.0 26.0

19 0.999823 0.999996 -0.000173 -1.0 0.000173 3.0 -3.0

20 0.927950 0.889530 0.038420 1.0 0.038420 24.0 24.0

21 0.001407 0.000671 0.000736 1.0 0.000736 6.0 6.0

22 0.007142 0.001306 0.005836 1.0 0.005836 14.0 14.0

23 0.822709 0.661517 0.161192 1.0 0.161192 30.0 30.0

24 0.999711 0.999993 -0.000282 -1.0 0.000282 5.0 -5.0

25 0.997846 0.999722 -0.001876 -1.0 0.001876 12.0 -12.0

26 0.819469 0.803962 0.015506 1.0 0.015506 20.0 20.0

27 0.069383 0.038095 0.031288 1.0 0.031288 23.0 23.0

28 0.002692 0.001934 0.000758 1.0 0.000758 8.0 8.0

29 0.999166 0.999947 -0.000781 -1.0 0.000781 9.0 -9.0

30 0.999767 0.999995 -0.000228 -1.0 0.000228 4.0 -4.0

31 0.002917 0.016437 -0.013520 -1.0 0.013520 18.0 -18.0

32 0.018732 0.003600 0.015132 1.0 0.015132 19.0 19.0

W+ = 430.0

W- = 131.0

Test Statistic (T) = 131.0

Test Critical value (Tcrit) = 5

Fail to reject the null hypothesis: There is no significant difference between the two models.

The Python code with all the steps is summarized in this Google Colab (click on the link):

https://colab.research.google.com/drive/1RP_5SrDgVscgl4pxJAcOmNWiWUd68OyO?usp=sharing