1. Concepts & Definitions
1.1. A Review on Parametric Statistics
1.2. Parametric tests for Hypothesis Testing
1.3. Parametric vs. Non-Parametric Test
1.4. One sample z-test and their relation with two-sample z-test
1.5. One sample t-test and their relation with two-sample t-test
1.6. Welch's two-sample t-test: two populations with different variances
1.7. Non-Parametric test for Hypothesis Testing: Mann-Whitney U Test
1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test
1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test
1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-Fit
1.11. Non-Parametric test for Hypothesis Testing: Kolmogorov-Smirnov
1.12. Non-Parametric for comparing machine learning
2. Problem & Solution
2.1. Using Wilcoxon Sign Test to compare clustering methods
2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods
2.3. What is A/B testing and how to combine with hypothesis testing?
2.4. Using Chi-Square fit to check if Benford-Law holds or not
2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not
How to combine machine learning and non-parametric tests?
The article [1] presents an interesting idea on combining non-parametrical statistical test with clustering structures by introducing a novel nonparametric statistical test called analysis of cluster structure variability (ANOCVA). The ANOCVA is based on two well-established ideas: the silhouette statistic to measure the variability of the clustering structures and the analysis of variance.
Another article [2] employed ANOCVA to compare the clustering structure of multiple groups simultaneously and also to identify features that contribute to the differential clustering.
The article [3] proposed a nonparametric bagging clustering methods are compared to identify latent structures from a sequence of dependent categorical data observed along a one-dimensional (discrete) time domain.
The post [4] with the material presented in sites [5] and [6] employed to evaluate Kolmogorov-Smirnov for each classifier and compared one data set for one cluster against other clusters through Kolmogorov-Smirnov, respectively.
Remembering classification problems from Tracks 06 and 10 content
Track 06 introduced the concept of data separation and classification through Gaussian Mixture that enable a creation of two classes from a random generated data set as described in section 2.5:
Let's remember how to generate this data set through the following code:
from pylab import *
from scipy.optimize import curve_fit
data=concatenate((normal(1,.2,2500),normal(2,.2,5000)))
y,x,_=hist(data,100,alpha=.3,label='data')
x=(x[1:]+x[:-1])/2 # for len(x)==len(y)
def gauss(x,mu,sigma,A):
return A*exp(-(x-mu)**2/2/sigma**2)
def bimodal(x,mu1,sigma1,A1,mu2,sigma2,A2):
return gauss(x,mu1,sigma1,A1)+gauss(x,mu2,sigma2,A2)
expected=(1,.2,250,2,.2,125)
params,cov=curve_fit(bimodal,x,y,expected)
sigma=sqrt(diag(cov))
x_fit = np.linspace(x.min(), x.max(), 500)
#plot combined...
plt.plot(x_fit, bimodal(x_fit, *params), color='red', lw=3, label='model')
plt.legend()
plt.show()
The data set could employ Gaussian Mixture to separate data on two distinct distributions of values:
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=2, random_state=42)
gmm.fit(x.reshape(-1, 1))
target_class = gmm.predict(x.reshape(-1, 1))
target_class
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
The next code helps in the visualization of two distinct classes of values:
x1 = []
y1 = []
x2 = []
y2 = []
k = 0
for elem in target_class:
if (elem == 0):
x1.append(x[k])
y1.append(y[k])
else:
x2.append(x[k])
y2.append(y[k])
k = k+1
#plt.figure(figsize=(10,6))
plt.scatter(x1, y1, color='red', label='N_1')
plt.scatter(x2, y2, color='blue',label='N_2')
plt.show()
How one-sample Kolmogorov-Smirnov Test could be applied to previous data?
Now, to check for normality each data set, it would interesting to verify it through a non-parametric testing, which will imply that for each cluster of data set, found previously, could be identified as a normal distribution.
For such task is valuable to recover the code produced on Track 11, section 1.11 as a result of the developments made on applying Kolmogorov-Smirnov testing to identify when two data sets follows the same distribution: https://colab.research.google.com/drive/1FHK7ICgAZVQCRd4_e5G76hFNIwzfvf5V?usp=sharing
But, two important observations should be made before the application of this content:
Observation 1
The one sample Kolmogorov-Smirnov (KS) Test will be applied into the raw data, i.e., in x1, x2, not in the counting of frequency given by y1, y2.
Observation 2
Should remember about the meaning of hypothesis test formulation made in one-sample Kolmogorov-Smirnov Test and how to verify it:
Using p-value and significance level alpha:
Reject H0: p-value < alpha
Do not reject H0: p-value >= alpha
Where the meaning of the Null and Alternative Hypothesis are:
Null Hypothesis H0: The sample follows the specified distribution.
Alternative Hypothesis Ha: The sample does not follow the specified distribution.
In summary, the meaning of rejecting or not the Null Hypothesis will lead to:
Reject H0 with p-value < alpha: The sample does not follow the specified distribution.
Do not reject H0 with p-value >= alpha: The sample follows the specified distribution.
The next Python code employed the previous understanding to create two functions: check_normal(data, alpha = 0.05)and check_normal_all(list_data)to check if the data follows a normal distribution and perform this checking for a list of data sets x1 and x2, respectively.
from scipy.stats import kstest, norm
# Defining a function to check if a data set follows or not the normal distribution
def check_normal(data, alpha = 0.05):
# Perform the Kolmogorov-Smirnov test against a normal distribution
ks_statistic, ks_p_value = kstest(data, 'norm')
print('Alpha = ', alpha)
print('Kolmogorov-Smirnov statistic = ', ks_statistic)
print('Kolmogorov-Smirnov p-value = ', ks_p_value)
if ks_p_value < alpha:
return True
else:
return False
def check_normal_all(list_data):
for data in list_data:
if check_normal(data):
print("Reject the null hypothesis. The sample does not come from the normal distribution.")
else:
print("Fail to reject the null hypothesis. The sample comes from the normal distribution.")
list_data = [x1, x2]
check_normal_all(list_data)
Alpha = 0.05
Kolmogorov-Smirnov statistic = 0.6250825540765997
Kolmogorov-Smirnov p-value = 2.09700013337507e-19
Reject the null hypothesis. The sample does not come from the normal distribution.
Alpha = 0.05
Kolmogorov-Smirnov statistic = 0.9317980164004254
Kolmogorov-Smirnov p-value = 9.79247550812199e-59
Reject the null hypothesis. The sample does not come from the normal distribution.
How two-samples Kolmogorov-Smirnov Test could be applied to previous data?
The next Python code is useful to verify if the two data sets following the same distribution or not through two-sample Kolmogorov-Smirnov Test.
In summary, the meaning of rejecting or not the Null Hypothesis will lead to:
Reject H0 with p-value < alpha: The two samples do not follow the same distribution.
Do not reject H0 with p-value >= alpha: The two samples follow the same distribution.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ks_2samp
# Defining data sets
sample1 = x1
sample2 = x2
# Perform the Kolmogorov-Smirnov test
ks_statistic, p_value = ks_2samp(sample1, sample2)
# Print the results
print(f"Kolmogorov–Smirnov Statistic: {ks_statistic}")
print(f"P-value: {p_value}")
# Decision based on p-value
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis. The two samples come from different distributions.")
else:
print("Fail to reject the null hypothesis. There is not enough evidence to suggest different distributions.")
# Plot the histograms with KDE
plt.figure(figsize=(12, 8))
sns.histplot(sample1, bins=20, kde=True, color='b', label='Sample 1')
sns.histplot(sample2, bins=20, kde=True, color='g', label='Sample 2')
plt.legend()
plt.title('Histogram and KDE of Sample Distributions')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
# Calculate ECDF for both samples
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
n = len(data)
x = np.sort(data)
y = np.arange(1, n+1) / n
return x, y
# Get ECDFs
x1, y1 = ecdf(sample1)
x2, y2 = ecdf(sample2)
# Plot the ECDFs
plt.figure(figsize=(12, 8))
plt.step(x1, y1, where='post', label='ECDF Sample 1', color='b')
plt.step(x2, y2, where='post', label='ECDF Sample 2', color='g')
# Highlight the KS statistic
d_max = np.max(np.abs(np.interp(x1, x2, y2) - y1))
plt.plot([x1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))], x1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))]],
[y1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))], np.interp(x1, x2, y2)[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))]],
'k--', label=f'KS Statistic = {ks_statistic:.3f}')
# Adding labels, title, and legend
plt.xlabel('Sample Values')
plt.ylabel('Cumulative Probability')
plt.title('Empirical Cumulative Distribution Functions (ECDF)')
plt.legend()
plt.grid()
plt.show()
Kolmogorov–Smirnov Statistic: 1.0
P-value: 1.9823306042836678e-29
Reject the null hypothesis. The two samples come from different distributions.
The Python code with the data, and detailed computation to apply Kolmogorov–Smirnov Statistic to verify if the two classes follows the same distribution is given at:
https://colab.research.google.com/drive/1z61sMOIRvedRpD26LFDVU2OqXI-YtP5c?usp=sharing
References
[1] Patriota, Alexandre Galvão et al. “ANOCVA: A Nonparametric Statistical Test to Compare Clustering Structures.” (2018). Site: https://www.semanticscholar.org/paper/ANOCVA%3A-A-Nonparametric-Statistical-Test-to-Compare-Patriota-Vidal/0cfc5889a640dbe2844e7305adf18fbed2aa3f59.
[2] Fujita A, Takahashi DY, Patriota AG, Sato JR. A non-parametric statistical test to compare clusters with applications in functional magnetic resonance imaging data. Stat Med. 2014 Dec 10;33(28):4949-62. doi: 10.1002/sim.6292. Epub 2014 Sep 3. PMID: 25185759. Site: https://pubmed.ncbi.nlm.nih.gov/25185759/
[3] Konrad Abramowicz, Sara Sjöstedt de Luna, Johan Strandberg, Nonparametric bagging clustering methods to identify latent structures from a sequence of dependent categorical data, Computational Statistics & Data Analysis, Volume 177, 107583, 2023. Site: https://www.sciencedirect.com/science/article/pii/S0167947322001633
[4] Comparing sample distributions with the Kolmogorov-Smirnov (KS) test: How to compare samples and understand if they come from the same distribution using python. Site: https://towardsdatascience.com/comparing-sample-distributions-with-the-kolmogorov-smirnov-ks-test-a2292ad6fee5