1.12. Non-Parametric for comparing machine learning methods

The article [1] presents an interesting idea on combining non-parametrical statistical test with clustering structures by introducing a novel nonparametric statistical test called analysis of cluster structure variability (ANOCVA). The ANOCVA is based on two well-established ideas: the silhouette statistic to measure the variability of the clustering structures and the analysis of variance.

Another article [2] employed ANOCVA to compare the clustering structure of multiple groups simultaneously and also to identify features that contribute to the differential clustering.

The article [3] proposed a nonparametric bagging clustering methods are compared to identify latent structures from a sequence of dependent categorical data observed along a one-dimensional (discrete) time domain.

The post [4] with the material presented in sites [5] and [6] employed to evaluate Kolmogorov-Smirnov for each classifier and compared one data set for one cluster against other clusters through Kolmogorov-Smirnov, respectively.

Remembering classification problems from Tracks 06 and 10 content

Track 06 introduced the concept of data separation and classification through Gaussian Mixture that enable a creation of two classes from a random generated data set as described in section 2.5:

https://sites.google.com/view/statistics-on-customs/in%C3%ADcio/track06/application-of-gaussian-mixture

Let's remember how to generate this data set through the following code:

from pylab import *

from scipy.optimize import curve_fit

data=concatenate((normal(1,.2,2500),normal(2,.2,5000)))

y,x,_=hist(data,100,alpha=.3,label='data')

x=(x[1:]+x[:-1])/2 # for len(x)==len(y)

def gauss(x,mu,sigma,A):

return A*exp(-(x-mu)**2/2/sigma**2)

def bimodal(x,mu1,sigma1,A1,mu2,sigma2,A2):

return gauss(x,mu1,sigma1,A1)+gauss(x,mu2,sigma2,A2)

expected=(1,.2,250,2,.2,125)

params,cov=curve_fit(bimodal,x,y,expected)

sigma=sqrt(diag(cov))

x_fit = np.linspace(x.min(), x.max(), 500)

#plot combined...

plt.plot(x_fit, bimodal(x_fit, *params), color='red', lw=3, label='model')

plt.legend()

plt.show()

The data set could employ Gaussian Mixture to separate data on two distinct distributions of values:

from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=2, random_state=42)

gmm.fit(x.reshape(-1, 1))

target_class = gmm.predict(x.reshape(-1, 1))

target_class

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

The next code helps in the visualization of two distinct classes of values:

x1 = []

y1 = []

x2 = []

y2 = []

k = 0

for elem in target_class:

if (elem == 0):

x1.append(x[k])

y1.append(y[k])

else:

x2.append(x[k])

y2.append(y[k])

k = k+1

#plt.figure(figsize=(10,6))

plt.scatter(x1, y1, color='red', label='N_1')

plt.scatter(x2, y2, color='blue',label='N_2')

plt.show()

How one-sample Kolmogorov-Smirnov Test could be applied to previous data?

Now, to check for normality each data set, it would interesting to verify it through a non-parametric testing, which will imply that for each cluster of data set, found previously, could be identified as a normal distribution.

For such task is valuable to recover the code produced on Track 11, section 1.11 as a result of the developments made on applying Kolmogorov-Smirnov testing to identify when two data sets follows the same distribution: https://colab.research.google.com/drive/1FHK7ICgAZVQCRd4_e5G76hFNIwzfvf5V?usp=sharing

But, two important observations should be made before the application of this content:

Observation 1

The one sample Kolmogorov-Smirnov (KS) Test will be applied into the raw data, i.e., in x1, x2, not in the counting of frequency given by y1, y2.

Observation 2

Should remember about the meaning of hypothesis test formulation made in one-sample Kolmogorov-Smirnov Test and how to verify it:

Using p-value and significance level alpha:
- Reject H0: p-value < alpha
- Do not reject H0: p-value >= alpha

Where the meaning of the Null and Alternative Hypothesis are:

Null Hypothesis H0: The sample follows the specified distribution.
Alternative Hypothesis Ha: The sample does not follow the specified distribution.

In summary, the meaning of rejecting or not the Null Hypothesis will lead to:

Reject H0 with p-value < alpha: The sample does not follow the specified distribution.
Do not reject H0 with p-value >= alpha: The sample follows the specified distribution.

The next Python code employed the previous understanding to create two functions: check_normal(data, alpha = 0.05)and check_normal_all(list_data)to check if the data follows a normal distribution and perform this checking for a list of data sets x1 and x2, respectively.

from scipy.stats import kstest, norm

# Defining a function to check if a data set follows or not the normal distribution

def check_normal(data, alpha = 0.05):

# Perform the Kolmogorov-Smirnov test against a normal distribution

ks_statistic, ks_p_value = kstest(data, 'norm')

print('Alpha = ', alpha)

print('Kolmogorov-Smirnov statistic = ', ks_statistic)

print('Kolmogorov-Smirnov p-value = ', ks_p_value)

if ks_p_value < alpha:

return True

else:

return False

def check_normal_all(list_data):

for data in list_data:

if check_normal(data):

print("Reject the null hypothesis. The sample does not come from the normal distribution.")

else:

print("Fail to reject the null hypothesis. The sample comes from the normal distribution.")

list_data = [x1, x2]

check_normal_all(list_data)

Alpha = 0.05

Kolmogorov-Smirnov statistic = 0.6250825540765997

Kolmogorov-Smirnov p-value = 2.09700013337507e-19

Reject the null hypothesis. The sample does not come from the normal distribution.

Alpha = 0.05

Kolmogorov-Smirnov statistic = 0.9317980164004254

Kolmogorov-Smirnov p-value = 9.79247550812199e-59

Reject the null hypothesis. The sample does not come from the normal distribution.

How two-samples Kolmogorov-Smirnov Test could be applied to previous data?

The next Python code is useful to verify if the two data sets following the same distribution or not through two-sample Kolmogorov-Smirnov Test.

In summary, the meaning of rejecting or not the Null Hypothesis will lead to:

Reject H0 with p-value < alpha: The two samples do not follow the same distribution.
Do not reject H0 with p-value >= alpha: The two samples follow the same distribution.

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

from scipy.stats import ks_2samp

# Defining data sets

sample1 = x1

sample2 = x2

# Perform the Kolmogorov-Smirnov test

ks_statistic, p_value = ks_2samp(sample1, sample2)

# Print the results

print(f"Kolmogorov–Smirnov Statistic: {ks_statistic}")

print(f"P-value: {p_value}")

# Decision based on p-value

alpha = 0.05

if p_value < alpha:

print("Reject the null hypothesis. The two samples come from different distributions.")

else:

print("Fail to reject the null hypothesis. There is not enough evidence to suggest different distributions.")

# Plot the histograms with KDE

plt.figure(figsize=(12, 8))

sns.histplot(sample1, bins=20, kde=True, color='b', label='Sample 1')

sns.histplot(sample2, bins=20, kde=True, color='g', label='Sample 2')

plt.legend()

plt.title('Histogram and KDE of Sample Distributions')

plt.xlabel('Value')

plt.ylabel('Frequency')

plt.show()

# Calculate ECDF for both samples

def ecdf(data):

"""Compute ECDF for a one-dimensional array of measurements."""

n = len(data)

x = np.sort(data)

y = np.arange(1, n+1) / n

return x, y

# Get ECDFs

x1, y1 = ecdf(sample1)

x2, y2 = ecdf(sample2)

# Plot the ECDFs

plt.figure(figsize=(12, 8))

plt.step(x1, y1, where='post', label='ECDF Sample 1', color='b')

plt.step(x2, y2, where='post', label='ECDF Sample 2', color='g')

# Highlight the KS statistic

d_max = np.max(np.abs(np.interp(x1, x2, y2) - y1))

plt.plot([x1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))], x1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))]],

[y1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))], np.interp(x1, x2, y2)[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))]],

'k--', label=f'KS Statistic = {ks_statistic:.3f}')

# Adding labels, title, and legend

plt.xlabel('Sample Values')

plt.ylabel('Cumulative Probability')

plt.title('Empirical Cumulative Distribution Functions (ECDF)')

plt.legend()

plt.grid()

plt.show()

Kolmogorov–Smirnov Statistic: 1.0

P-value: 1.9823306042836678e-29

Reject the null hypothesis. The two samples come from different distributions.

The Python code with the data, and detailed computation to apply Kolmogorov–Smirnov Statistic to verify if the two classes follows the same distribution is given at:

https://colab.research.google.com/drive/1z61sMOIRvedRpD26LFDVU2OqXI-YtP5c?usp=sharing

References

[1] Patriota, Alexandre Galvão et al. “ANOCVA: A Nonparametric Statistical Test to Compare Clustering Structures.” (2018). Site: https://www.semanticscholar.org/paper/ANOCVA%3A-A-Nonparametric-Statistical-Test-to-Compare-Patriota-Vidal/0cfc5889a640dbe2844e7305adf18fbed2aa3f59.

[2] Fujita A, Takahashi DY, Patriota AG, Sato JR. A non-parametric statistical test to compare clusters with applications in functional magnetic resonance imaging data. Stat Med. 2014 Dec 10;33(28):4949-62. doi: 10.1002/sim.6292. Epub 2014 Sep 3. PMID: 25185759. Site: https://pubmed.ncbi.nlm.nih.gov/25185759/

[3] Konrad Abramowicz, Sara Sjöstedt de Luna, Johan Strandberg, Nonparametric bagging clustering methods to identify latent structures from a sequence of dependent categorical data, Computational Statistics & Data Analysis, Volume 177, 107583, 2023. Site: https://www.sciencedirect.com/science/article/pii/S0167947322001633

[4] Comparing sample distributions with the Kolmogorov-Smirnov (KS) test: How to compare samples and understand if they come from the same distribution using python. Site: https://towardsdatascience.com/comparing-sample-distributions-with-the-kolmogorov-smirnov-ks-test-a2292ad6fee5

[5] https://github.com/vinyluis/Articles/blob/main/Kolmogorov-Smirnov/Kolmogorov-Smirnov%20-%20Classification.ipynb

[6] https://github.com/vinyluis/Articles/blob/main/Kolmogorov-Smirnov/Kolmogorov-Smirnov%20-%20Multiclass%20Classification.ipynb

Page updated

Google Sites

Report abuse