1. Concepts & Definitions
1.1. A Review on Parametric Statistics
1.2. Parametric tests for Hypothesis Testing
1.3. Parametric vs. Non-Parametric Test
1.4. One sample z-test and their relation with two-sample z-test
1.5. One sample t-test and their relation with two-sample t-test
1.6. Welch's two-sample t-test: two populations with different variances
1.7. Non-Parametric test for Hypothesis Testing: Mann-Whitney U Test
1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test
1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test
1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-Fit
1.11. Non-Parametric test for Hypothesis Testing: Kolmogorov-Smirnov
1.12. Non-Parametric for comparing machine learning
2. Problem & Solution
2.1. Using Wilcoxon Sign Test to compare clustering methods
2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods
2.3. What is A/B testing and how to combine with hypothesis testing?
2.4. Using Chi-Square fit to check if Benford-Law holds or not
2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not
How to combine classification methods and Wilcoxon Sign-Rank Test?
And, in Track 10, section 1.9, finally Gaussian Mixture was compared with Logistic regression, and Ensemble of both models to classify generated data set into two classes. The models performance were compared using confusion matrix and ROC curve analysis:
https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track10/ensembles
In the data set employed on Track 10, section 1.9, Gaussian Mixture and Logistic Regression showed the same efficiency of 0.988. But, for every entry for the test data set there is a probability of the entry to be belong as class one which is different depending on which method were applied. This could be observed in the result presented next.
P = logistic gaussian
0 0.000201 0.000127
1 0.562759 0.321583
2 0.451088 0.337373
3 0.999179 0.999934
4 0.215037 0.036274
5 0.750645 0.650172
6 0.002165 0.003317
7 0.010276 0.001585
8 0.022628 0.010160
9 0.906035 0.933811
10 0.048193 0.006970
11 0.997605 0.999668
12 0.113842 0.011478
13 0.982316 0.989557
14 0.998790 0.999860
15 0.999846 0.999998
16 0.804899 0.622029
17 0.038477 0.013044
18 0.749700 0.691259
19 0.999823 0.999996
20 0.927950 0.889530
21 0.001407 0.000671
22 0.007142 0.001306
23 0.822709 0.661517
24 0.999711 0.999993
25 0.997846 0.999722
26 0.819469 0.803962
27 0.069383 0.038095
28 0.002692 0.001934
29 0.999166 0.999947
30 0.999767 0.999995
31 0.002917 0.016437
32 0.018732 0.003600
Scoring models.
logistic : 0.988
gaussian : 0.988
Loading train and test data set
Now, since the experiment objetives are more clear, let's start to load the data by employing the following Google Colab (click on the link):
https://colab.research.google.com/drive/1guqdIBa7PmeOIxFWL9twwE_rX51v2gdx?usp=sharing
from sklearn.datasets import make_blobs
from sklearn.model_selection import train_test_split
def get_train_test(test_size=0.33, SEED = 1):
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=2, n_features=2, random_state=SEED, cluster_std=5)
# demonstrate that the train-test split procedure is repeatable
# split into train test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=SEED)
return X_train, X_test, y_train, y_test
SEED = 1
X_train, X_test, y_train, y_test = get_train_test(SEED)
# A look at the data
print('X_test = ', X_test)
print('y_test = ', y_test)
X_test = [[ 3.99858703 12.00557395]
[ -5.11786366 2.42272223]
[ -2.61873767 -0.03165495]
[-15.99405267 0.36337804]
[ -5.01287134 6.2943088 ]
[ -5.09542341 0.18046166]
[ 4.33502949 5.33227196]
[ -0.85937456 8.78733447]
[ 0.45791186 4.79319021]
[ -3.34042694 -5.38988786]
[ -2.70403107 7.33960582]
[-10.67993622 -4.54861949]
[ -4.75956413 7.89665004]
[ -9.60860686 -0.86144724]
[ -6.70246352 -12.09054025]
[-12.58318479 -8.93848269]
[ -7.3721509 2.65977626]
[ -0.66806131 5.0015331 ]
[ -4.30041867 -0.95835324]
[-10.19119005 -12.03221032]
[ -8.88012893 1.88416055]
[ 2.53535716 9.06200028]
[ -0.23162328 8.83219569]
[ -7.48178844 2.51278087]
[-15.47227173 -3.10643638]
[-10.93056136 -4.46207791]
[ -4.39062396 -1.90884586]
[ -0.15870831 2.64524064]
[ 2.8483937 6.91896156]
[-10.90231401 -6.97295169]
[-16.36050628 -2.38560995]
[ 7.06449892 0.60045536]
[ -1.05876514 7.49250542]]
y_test = [0 0 0 1 0 0 0 0 0 1 0 1 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 0]
Applying classification methods
Now, let's apply the Logistic regression and Gaussian Mixture methods.
import numpy as np
import pandas as pd
# ROC and AUC Score
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.mixture import GaussianMixture
def get_models():
"""Generate a library of base learners."""
lr = LogisticRegression(C=100, random_state=SEED)
gm = GaussianMixture(n_components=2, random_state=SEED)
models = {'logistic': lr,
'gaussian': gm
}
return models
def train_predict(model_list):
"""Fit models in list on training set and return preds"""
P = np.zeros((y_test.shape[0], len(model_list)))
P = pd.DataFrame(P)
print("Fitting models.")
cols = list()
for i, (name, m) in enumerate(models.items()):
print("%s..." % name, end=" ", flush=False)
m.fit(X_train, y_train)
P.iloc[:, i] = m.predict_proba(X_test)[:, 1]
cols.append(name)
print("done")
P.columns = cols
print("Done.\n")
return P
def score_models(P, y):
"""Score model in prediction DF"""
print("Scoring models.")
for m in P.columns:
score = roc_auc_score(y, P.loc[:, m])
print("%-26s: %.3f" % (m, score))
print("Done.\n")
models = get_models()
P = train_predict(models)
print('P = ',P)
score_models(P, y_test)
Fitting models.
logistic... done
gaussian... done
Done.
P = logistic gaussian
0 0.000201 0.000127
1 0.562759 0.321583
2 0.451088 0.337373
3 0.999179 0.999934
4 0.215037 0.036274
5 0.750645 0.650172
6 0.002165 0.003317
7 0.010276 0.001585
8 0.022628 0.010160
9 0.906035 0.933811
10 0.048193 0.006970
11 0.997605 0.999668
12 0.113842 0.011478
13 0.982316 0.989557
14 0.998790 0.999860
15 0.999846 0.999998
16 0.804899 0.622029
17 0.038477 0.013044
18 0.749700 0.691259
19 0.999823 0.999996
20 0.927950 0.889530
21 0.001407 0.000671
22 0.007142 0.001306
23 0.822709 0.661517
24 0.999711 0.999993
25 0.997846 0.999722
26 0.819469 0.803962
27 0.069383 0.038095
28 0.002692 0.001934
29 0.999166 0.999947
30 0.999767 0.999995
31 0.002917 0.016437
32 0.018732 0.003600
Scoring models.
logistic : 0.988
gaussian : 0.988
Done.
Let's remember how Wilcoxon Sign-Rank Test works
Now, the idea is to recover this experiment and the data set employed to apply the Wilcoxon Sign-Rank Test to verify if both results are equal or not.
For this purpose the material available at Track 11, section 1.8 will be a useful reference. This material discussed the problem of choosing between two machine learning models: Model A and Model B, according to their classification accuracy (%) on several test databases (benchmark sets) [2].
The objective is to choose which one is going to be deployed and used in the production environment . First, we state our null hypothesis and alternative hypothesis as:
H0: There is no difference between the two models A and B.
H1: There is a difference between the two models A and B (the median change was non-zero).
Now, it is a great time to recover the Python code with the data, graphic, and detailed computation to obtain Wilcoxon Sign-Rank Test which is given at:
https://colab.research.google.com/drive/1nWrj_Rq8cCge1Kfi5Mq3Z8VqynLAmydA?usp=sharing
Applying Wilcoxon Sign-Rank Test to the P matrix with the probabilities from each model
Let's, recover that variable P has the following responses from logistic and gaussian mixture models.
P
# Compute Diff, Sign(Diff), |Diff|
df = P
df['Diff'] = df['logistic'] - df['gaussian']
df['Sign(Diff)'] = np.sign(df['Diff'])
df['|Diff|'] = df['Diff'].abs()
# Rank the absolute differences
df['Rank'] = df['|Diff|'].rank()
df['Sign(Diff)*Rank'] = df['Sign(Diff)'] * df['Rank']
# Compute W+, W-, and Test Statistic (T)
w_plus = df[df['Sign(Diff)'] > 0]['Rank'].sum()
w_minus = df[df['Sign(Diff)'] < 0]['Rank'].sum()
test_statistic = min(w_plus, abs(w_minus))
# Display the DataFrame
print(df)
# Display W+, W-, and Test Statistic (T)
print(f"W+ = {w_plus}")
print(f"W- = {w_minus}")
print(f"Test Statistic (T) = {test_statistic}")
# Compute the critical value for n = len(datasets) - 1, alpha = 0.05
n = len(df)
alpha = 0.05
# For n = 9, the critical value Tcrit at alpha = 0.05 is 5 (from standard Wilcoxon signed-rank test table)
# For n = 9, Tcrit = 5
t_crit = 5
# Compare the Test Statistic (T) with the Test Critical value (Tcrit)
print(f"Test Critical value (Tcrit) = {t_crit}")
if test_statistic <= t_crit:
print("Reject the null hypothesis: There is a significant difference between the two models.")
else:
print("Fail to reject the null hypothesis: There is no significant difference between the two models.")
logistic gaussian Diff Sign(Diff) |Diff| Rank Sign(Diff)*Rank
0 0.000201 0.000127 0.000074 1.0 0.000074 1.0 1.0
1 0.562759 0.321583 0.241176 1.0 0.241176 33.0 33.0
2 0.451088 0.337373 0.113715 1.0 0.113715 29.0 29.0
3 0.999179 0.999934 -0.000756 -1.0 0.000756 7.0 -7.0
4 0.215037 0.036274 0.178763 1.0 0.178763 31.0 31.0
5 0.750645 0.650172 0.100473 1.0 0.100473 27.0 27.0
6 0.002165 0.003317 -0.001152 -1.0 0.001152 11.0 -11.0
7 0.010276 0.001585 0.008692 1.0 0.008692 16.0 16.0
8 0.022628 0.010160 0.012468 1.0 0.012468 17.0 17.0
9 0.906035 0.933811 -0.027776 -1.0 0.027776 22.0 -22.0
10 0.048193 0.006970 0.041223 1.0 0.041223 25.0 25.0
11 0.997605 0.999668 -0.002063 -1.0 0.002063 13.0 -13.0
12 0.113842 0.011478 0.102363 1.0 0.102363 28.0 28.0
13 0.982316 0.989557 -0.007242 -1.0 0.007242 15.0 -15.0
14 0.998790 0.999860 -0.001070 -1.0 0.001070 10.0 -10.0
15 0.999846 0.999998 -0.000151 -1.0 0.000151 2.0 -2.0
16 0.804899 0.622029 0.182871 1.0 0.182871 32.0 32.0
17 0.038477 0.013044 0.025433 1.0 0.025433 21.0 21.0
18 0.749700 0.691259 0.058440 1.0 0.058440 26.0 26.0
19 0.999823 0.999996 -0.000173 -1.0 0.000173 3.0 -3.0
20 0.927950 0.889530 0.038420 1.0 0.038420 24.0 24.0
21 0.001407 0.000671 0.000736 1.0 0.000736 6.0 6.0
22 0.007142 0.001306 0.005836 1.0 0.005836 14.0 14.0
23 0.822709 0.661517 0.161192 1.0 0.161192 30.0 30.0
24 0.999711 0.999993 -0.000282 -1.0 0.000282 5.0 -5.0
25 0.997846 0.999722 -0.001876 -1.0 0.001876 12.0 -12.0
26 0.819469 0.803962 0.015506 1.0 0.015506 20.0 20.0
27 0.069383 0.038095 0.031288 1.0 0.031288 23.0 23.0
28 0.002692 0.001934 0.000758 1.0 0.000758 8.0 8.0
29 0.999166 0.999947 -0.000781 -1.0 0.000781 9.0 -9.0
30 0.999767 0.999995 -0.000228 -1.0 0.000228 4.0 -4.0
31 0.002917 0.016437 -0.013520 -1.0 0.013520 18.0 -18.0
32 0.018732 0.003600 0.015132 1.0 0.015132 19.0 19.0
W+ = 430.0
W- = 131.0
Test Statistic (T) = 131.0
Test Critical value (Tcrit) = 5
Fail to reject the null hypothesis: There is no significant difference between the two models.
The Python code with all the steps is summarized in this Google Colab (click on the link):
https://colab.research.google.com/drive/1RP_5SrDgVscgl4pxJAcOmNWiWUd68OyO?usp=sharing