1. Concepts & Definitions
1.1. A Review on Parametric Statistics
1.2. Parametric tests for Hypothesis Testing
1.3. Parametric vs. Non-Parametric Test
1.4. One sample z-test and their relation with two-sample z-test
1.5. One sample t-test and their relation with two-sample t-test
1.6. Welch's two-sample t-test: two populations with different variances
1.7. Non-Parametric test for Hypothesis Testing: Mann-Whitney U Test
1.8. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign-Rank Test
1.9. Non-Parametric test for Hypothesis Testing: Wilcoxon Sign Test
1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-Fit
1.11. Non-Parametric test for Hypothesis Testing: Kolmogorov-Smirnov
1.12. Non-Parametric for comparing machine learning
2. Problem & Solution
2.1. Using Wilcoxon Sign Test to compare clustering methods
2.2. Using Wilcoxon Sign-Rank Test to compare clustering methods
2.3. What is A/B testing and how to combine with hypothesis testing?
2.4. Using Chi-Square fit to check if Benford-Law holds or not
2.5. Using Kolmogorov-Smirnov fit to check if Pareto principle holds or not
What is Kolmogorov-Smirnov Test?
Kolmogorov–Smirnov Test is a completely efficient manner to determine if two samples are significantly one of a kind from each other. It is normally used to check the uniformity of random numbers. Uniformity is one of the maximum important properties of any random number generator and the Kolmogorov–Smirnov check can be used to check it [1].
The Kolmogorov–Smirnov test is versatile and can be employed to evaluate whether two underlying one-dimensional probability distributions vary. It serves as an effective tool to determine the statistical significance of differences between two sets of data.
Kolmogorov Distribution
The Kolmogorov distribution, often denoted as D, represents the cumulative distribution function (CDF) of the maximum difference between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution.
The probability distribution function (PDF) of the Kolmogorov distribution itself is not expressed in a simple analytical form. Instead, tables or statistical software are commonly used to obtain critical values for the test. The distribution is influenced by sample size, and the critical values depend on the significance level chosen for the test.
where:
n is the sample size.
x is the normalized Kolmogorov-Smirnov statistic.
k is the index of summation in the series.
The next code is designed to generate and plot the probability density function (PDF) of the Kolmogorov-Smirnov distribution for a specified sample size n. These are the following key points of the code:
1. Import Libraries: The code imports necessary libraries: numpy for numerical operations, matplotlib.pyplot and seaborn for plotting, and scipy.stats for statistical functions.
2. Set Sample Size: A variable n is set to 10, indicating the sample size for which the Kolmogorov-Smirnov distribution will be evaluated.
3. Generate Uniformly Distributed Random Values: An array x of 1000 random values is generated from a uniform distribution between 0 and 1 using np.random.uniform.
4. Calculate PDF of Kolmogorov-Smirnov Distribution: The PDF of the Kolmogorov-Smirnov distribution for sample size n is calculated at the points specified in x using stats.kstwo.pdf.
5. Plot the Results:
A line plot is created using seaborn to visualize the PDF of the Kolmogorov-Smirnov distribution. The x-axis represents the uniformly distributed random values, and the y-axis represents the corresponding PDF values.
The plot includes a title indicating the sample size n, and labels for the x and y axes.
# Calculate samples
n = 10
x = np.random.uniform(0, 1, 1000)
y = stats.kstwo.pdf(x, n = n)
plt.figure(figsize = (8,5))
sns.lineplot(x = x, y = y)
plt.title(f"Kolmogorov-Smirnov Distribution for en={n}")
Text(0.5, 1.0, 'Kolmogorov-Smirnov Distribution for en=10')
This next code will create a plot that shows the cumulative distribution function (CDF) of the Kolmogorov-Smirnov distribution.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
# Calculate samples
n = 10
x = np.random.uniform(0, 1, 1000)
y_cdf = stats.kstwo.cdf(x, n=n)
# Set seaborn style
sns.set(style="whitegrid")
# Create the figure and axes
plt.figure(figsize=(10, 6))
# Plot the CDF line
sns.lineplot(x=x, y=y_cdf, color='green', label='CDF')
# Title and labels
plt.title(f"Kolmogorov-Smirnov Cumulative Distribution for n={n}", fontsize=16)
plt.xlabel('x', fontsize=14)
plt.ylabel('Cumulative Probability', fontsize=14)
# Show the legend
plt.legend()
# Show the plot
plt.show()
The Python code with the data, and detailed computation to apply Goodness-of-Fit Test is given at:
https://colab.research.google.com/drive/1FHK7ICgAZVQCRd4_e5G76hFNIwzfvf5V?usp=sharing
When use Kolmogorov-Smirnov Test?
The main idea behind using this Kolmogorov-Smirnov Test is to check whether [1]:
One Sample Kolmogorov-Smirnov Test: determine whether a sample comes from a specific distribution.
Two-Sample Kolmogorov–Smirnov Test: compare two independent samples to assess whether they come from the same distribution.
The next table presents a comparison between one-sample and two-sample KS test features:
Let’s detail in what situations is possible to apply Kolmogorov-Smirnov and the expected output:
Comparison of Probability Distributions: The test is used to evaluate whether two samples exhibit the same probability distribution.
Compare the shape of the distributions: If we assume that the shapes or probability distributions of the two samples are similar, the test assesses the maximum absolute difference between the cumulative probability distributions of the two functions.
Check Distributional Differences: The test quantifies the maximum difference between the cumulative probability distributions, and a higher value indicates greater dissimilarity in the shape of the distributions.
How does one sample Kolmogorov-Smirnov (KS) Test work?
Below are the steps for how the Kolmogorov-Smirnov (KS) test works [1]:
Hypotheses Formulation:
Null Hypothesis : The sample follows a specified distribution.
Alternative Hypothesis: The sample does not follow the specified distribution.
Selection of a Reference Distribution: A theoretical distribution (e.g., normal, exponential) is decided against which you want to test the sample distribution. This distribution is usually based on theoretical expectations or prior knowledge.
Calculation of the Test Statistic (D): For a one-sample Kolmogorov-Smirnov test, the test statistic (Dn) represents the maximum vertical deviation between the empirical distribution function (EDF) of the sample and the cumulative distribution function (CDF) of the reference distribution employing the equation Dn = max(difference). For a two-sample Kolmogorov-Smirnov test, the test statistic compares the EDFs of two independent samples.
Determination of Critical Value or P-value: The test statistic (D) is compared to a critical value from the Kolmogorov-Smirnov distribution table or, more commonly, a p-value is calculated. If the p-value is less than the significance level (commonly 0.05), the null hypothesis is rejected, suggesting that the sample distribution does not match the specified distribution.
Interpretation of Results: If the null hypothesis is rejected, it indicates that there is evidence to suggest that the sample does not follow the specified distribution. The alternative hypothesis, suggesting a difference, is accepted.
The next python code provides a visual understanding of how the one sample Kolgomorov-Smirnov test could be applied to verify if a certain data set follows or not a normal distribution.
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
# Generate a small sample size (e.g., 25 samples) from a normal distribution
np.random.seed(0) # For reproducibility
sample_size = 25
sample = np.random.normal(loc=0, scale=1, size=sample_size)
# Perform the Kolmogorov-Smirnov test for normality
d_statistic, p_value = stats.kstest(sample, 'norm')
# Print the KS test result
print(f"KS Statistic: {d_statistic}")
print(f"P-Value: {p_value}")
# Plot the empirical distribution function (EDF) of the sample
sorted_sample = np.sort(sample)
y_vals = np.arange(1, sample_size + 1) / sample_size
# Plot the cumulative distribution function (CDF) of the reference normal distribution
x_vals = np.linspace(min(sample), max(sample), 100)
cdf_vals = stats.norm.cdf(x_vals)
# Plotting
plt.figure(figsize=(10, 6))
plt.step(sorted_sample, y_vals, where='post', label='Empirical CDF')
plt.plot(x_vals, cdf_vals, label='Reference Normal CDF', color='red')
# Highlight the KS statistic on the plot
# Find the point of maximum difference
d_max_index = np.argmax(np.abs(y_vals - stats.norm.cdf(sorted_sample)))
d_max = np.abs(y_vals[d_max_index] - stats.norm.cdf(sorted_sample[d_max_index]))
plt.plot([sorted_sample[d_max_index], sorted_sample[d_max_index]],
[stats.norm.cdf(sorted_sample[d_max_index]), y_vals[d_max_index]],
'k--', label=f'KS Statistic = {d_statistic:.3f}')
# Adding labels and legend
plt.xlabel('Sample Values')
plt.ylabel('Cumulative Probability')
plt.title('Kolmogorov-Smirnov Test for Normality')
plt.legend()
plt.grid()
# Show plot
plt.show()
KS Statistic: 0.26842179992563575
P-Value: 0.044235532757121
The Python code with the data, and detailed computation to apply Goodness-of-Fit Test is given at:
https://colab.research.google.com/drive/1FHK7ICgAZVQCRd4_e5G76hFNIwzfvf5V?usp=sharing
One Sample Kolmogorov-Smirnov Test applied to sample data
The next python is useful to better understand the step 4 which is about:
4. Determination of Critical Value or P-value: The test statistic (Dn) is compared to a critical value from the Kolmogorov-Smirnov distribution table or, more commonly, a p-value is calculated. If the p-value is less than the significance level (commonly 0.05), the null hypothesis is rejected, suggesting that the sample distribution does not match the specified distribution. This could be translated as the following rules:
4.1. Using the determination of critical value (CV):
Reject H0: Dn > CV
Do not reject H0: Dn <= CV
4.2. Using p-value and significance level alpha:
Reject H0: p-value < alpha
Do not reject H0: p-value >= alpha
It is important to remember the meaning of the Null and Alternative Hypothesis:
Null Hypothesis H0: The sample follows a specified distribution.
Alternative Hypothesis Ha: The sample does not follow the specified distribution.
And the meaning of rejecting or not the Null Hypothesis:
Reject H0: The sample does not follow the specified distribution.
Do not reject H0: The sample follows a specified distribution.
The next Python illustrates the previous rules for verifying if a given sample follows or not a normal distribution.
import numpy as np
import matplotlib.pyplot as plt
from numpy.random import seed, poisson
from scipy.stats import kstest, kstwobign, norm
# Set seed (e.g., make this example reproducible)
seed(0)
# Generate a sample dataset of 100 values that follow a Poisson distribution with mean=5
sample = poisson(5, 100)
# Perform the Kolmogorov-Smirnov test against a normal distribution
ks_statistic, ks_p_value = kstest(sample, 'norm')
# Step 5: Comparing
alpha = 0.05
# Obtain the critical value from the Kolmogorov-Smirnov distribution
critical_value = kstwobign.ppf(1 - alpha)
print(f"Kolmogorov-Smirnov Statistic: {ks_statistic}")
print(f"Critical value: {critical_value}")
print(f"Alpha: {alpha}")
print(f"P-value: {ks_p_value}")
if ks_statistic > critical_value or ks_p_value < alpha:
print("Reject the null hypothesis. The sample does not come from the specified distribution.")
else:
print("Fail to reject the null hypothesis. The sample comes from the specified distribution.")
Kolmogorov-Smirnov Statistic: 0.9072498680518208
Critical value: 1.3580986393225505
Alpha: 0.05
P-value: 1.0908062873170218e-103
Reject the null hypothesis. The sample does not come from the specified distribution.
The Python code with the data, and detailed computation to apply Goodness-of-Fit Test is given at:
https://colab.research.google.com/drive/1FHK7ICgAZVQCRd4_e5G76hFNIwzfvf5V?usp=sharing
One Sample Kolmogorov-Smirnov Test applied to raw data
Suppose a data set from 30 days of the demand for a product. The objective is to know if the data follow a normal distribution with a mean of 50 and a standard deviation of 10 [3]:
data = [67, 63, 33, 69, 53, 51, 49, 78, 48, 42, 72, 52, 47, 66, 58, 44, 44, 56, 28, 25, 36, 32, 61, 57, 38, 35, 76, 58, 48, 59] .
The next structured approach ensures a thorough examination of the data's adherence to normality, leveraging statistical methods and visual aids to elucidate the results:
1. Data Initialization: The provided data, representing the demand of a product over 30 days, is encapsulated into a pandas DataFrame.
2. Data Ordering: The data is sorted in ascending order based on the demand values to facilitate subsequent calculations.
3. Frequency Calculation: A frequency column is appended to the DataFrame, indicating the count of occurrences for each demand value using the cumcount() method.
4. Observed Relative Cumulative Frequency: An observed relative cumulative frequency column is computed by calculating the cumulative count of occurrences and normalizing it by the total number of observations.
5. Expected Relative Cumulative Frequency: The expected cumulative frequency for each demand value is derived using the cumulative distribution function (CDF) of the normal distribution N(50,10). This calculation leverages the statistics.NormalDist class.
6. Difference Calculation: A new column is created to store the absolute differences between the observed and expected cumulative frequencies, quantifying the deviation for each demand value.
7. Kolmogorov-Smirnov Statistic (Dn): The maximum difference (Dn) is identified from the difference column, representing the Kolmogorov-Smirnov statistic.
8. Critical Value Comparison: The calculated Dn is compared against a predefined critical value (0.24 for a sample size of 30). Based on this comparison, the script concludes whether the data conforms to the hypothesized normal distribution N(50,10).
9. Visualization: A graphical representation is generated to visually juxtapose the empirical cumulative distribution function (CDF) against the theoretical normal CDF. Additionally, the maximum difference is highlighted on the plot to illustrate the KS statistic.
import pandas as pd
import numpy as np
from scipy import stats
import statistics
import matplotlib.pyplot as plt
# Given data
data = [67, 63, 33, 69, 53, 51, 49, 78, 48, 42, 72, 52, 47, 66, 58, 44, 44, 56, 28, 25, 36, 32, 61, 57, 38, 35, 76, 58, 48, 59]
# Step 1 - Create DataFrame and Order the data
df = pd.DataFrame(data, columns=["Demand of a Product"])
df_1 = df.sort_values(by="Demand of a Product").reset_index(drop=True)
# Step 2 - Add a frequency column indicating how many times each number appears
df_1["Frequency"] = df_1.groupby('Demand of a Product', sort=False).cumcount() + 1
df_2 = df_1
# Step 3 - Add Observed relative cumulative frequency column
df_2["Count"] = np.arange(1, len(df_2) + 1)
df_2["Obs. % Cum. Freq."] = df_2["Count"] / len(df_2)
df_3 = df_2
# Step 4 - Add expected relative cumulative frequency column
normal = statistics.NormalDist(50, 10)
df_3["Exp. % Cum. Freq."] = df_3["Demand of a Product"].apply(lambda x: normal.cdf(x))
df_4 = df_3
print(df_4.head())
# Step 5 - Add difference column
df_4["Difference"] = abs(df_4["Obs. % Cum. Freq."] - df_4["Exp. % Cum. Freq."])
df_5 = df_4
print(df_5.head())
# Step 6 - Get the max of the difference
Dn = max(df_5["Difference"])
print(f"Dn: {Dn}")
# Step 7 - Compare Critical Value of K-S vs Dn Value
cv = 0.24 # Critical value for Kolmogorov-Smirnov test with sample size of 30
if Dn <= cv:
print("Your data fits with a normal distribution N(50,10)")
else:
print("Your data DO NOT fit with a normal distribution N(50,10)")
# Plotting the results
plt.figure(figsize=(10, 6))
plt.step(df_5["Demand of a Product"], df_5["Obs. % Cum. Freq."], where='post', label='Empirical CDF')
plt.plot(df_5["Demand of a Product"], df_5["Exp. % Cum. Freq."], label='Reference Normal CDF', color='red')
# Highlight the KS statistic on the plot
d_max_index = df_5["Difference"].idxmax()
plt.plot([df_5.at[d_max_index, "Demand of a Product"], df_5.at[d_max_index, "Demand of a Product"]],
[df_5.at[d_max_index, "Obs. % Cum. Freq."], df_5.at[d_max_index, "Exp. % Cum. Freq."]],
'k--', label=f'KS Statistic = {Dn:.3f}')
# Adding labels and legend
plt.xlabel('Demand of a Product')
plt.ylabel('Cumulative Probability')
plt.title('Kolmogorov-Smirnov Test for Normality')
plt.legend()
plt.grid()
# Show plot
plt.show()
Demand of a Product Frequency Count Obs. % Cum. Freq.
0 25 1 1 0.033333
1 28 1 2 0.066667
2 32 1 3 0.100000
3 33 1 4 0.133333
4 35 1 5 0.166667
5 36 1 6 0.200000
6 38 1 7 0.233333
7 42 1 8 0.266667
8 44 1 9 0.300000
9 44 2 10 0.333333
10 47 1 11 0.366667
11 48 1 12 0.400000
12 48 2 13 0.433333
13 49 1 14 0.466667
14 51 1 15 0.500000
15 52 1 16 0.533333
16 53 1 17 0.566667
17 56 1 18 0.600000
18 57 1 19 0.633333
19 58 1 20 0.666667
20 58 2 21 0.700000
21 59 1 22 0.733333
22 61 1 23 0.766667
23 63 1 24 0.800000
24 66 1 25 0.833333
25 67 1 26 0.866667
26 69 1 27 0.900000
27 72 1 28 0.933333
28 76 1 29 0.966667
29 78 1 30 1.000000
Demand of a Product Frequency Count Obs. % Cum. Freq. \
0 25 1 1 0.033333
1 28 1 2 0.066667
2 32 1 3 0.100000
3 33 1 4 0.133333
4 35 1 5 0.166667
Exp. % Cum. Freq. Difference
0 0.006210 0.027124
1 0.013903 0.052763
2 0.035930 0.064070
3 0.044565 0.088768
4 0.066807 0.099859
Dn: 0.12574688224992647
Your data fits with a normal distribution N(50,10)
The Python code with the data, and detailed computation to apply Goodness-of-Fit Test is given at:
https://colab.research.google.com/drive/1FHK7ICgAZVQCRd4_e5G76hFNIwzfvf5V?usp=sharing
Two Sample Kolmogorov-Smirnov Test
This is a two-sided test for the null hypothesis that 2 independent samples are drawn from the same continuous distribution. If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same [2]. This leads to the following modification of the Null and Alternative hypothesis:
Null Hypothesis H0: The two samples come from the same distribution.
Alternative Hypothesis Ha: The two samples does not come from the same distribution.
This means in terms of numerical values comparisons into:
4.1. Using the determination of critical value (CV):
Reject H0: Dn > CV
Do not reject H0: Dn <= CV
4.2. Using p-value and significance level alpha:
Reject H0: p-value < alpha
Do not reject H0: p-value >= alpha
Remember that the decision could be based on comparing the p-value with a chosen significance level (e.g., 0.05). If the p-value is less than the significance level, reject the null hypothesis, indicating that the two samples come from different distributions.
Two Sample Kolmogorov-Smirnov Test - Testing two normal distributions - Python code
The next code helps to explain how the two sample Kolmogorov-Smirnov Test could be applied for testing two normal distributions with different parameters [1].
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ks_2samp
# Set the seed for reproducibility
np.random.seed(42)
# Generate two sample datasets
sample1 = np.random.normal(0, 1, 100)
sample2 = np.random.normal(0.5, 1.5, 120)
# Perform the Kolmogorov-Smirnov test
ks_statistic, p_value = ks_2samp(sample1, sample2)
# Print the results
print(f"Kolmogorov–Smirnov Statistic: {ks_statistic}")
print(f"P-value: {p_value}")
# Decision based on p-value
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis. The two samples come from different distributions.")
else:
print("Fail to reject the null hypothesis. There is not enough evidence to suggest different distributions.")
# Plot the histograms with KDE
plt.figure(figsize=(12, 8))
sns.histplot(sample1, bins=20, kde=True, color='b', label='Sample 1')
sns.histplot(sample2, bins=20, kde=True, color='g', label='Sample 2')
plt.legend()
plt.title('Histogram and KDE of Sample Distributions')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
# Calculate ECDF for both samples
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
n = len(data)
x = np.sort(data)
y = np.arange(1, n+1) / n
return x, y
# Get ECDFs
x1, y1 = ecdf(sample1)
x2, y2 = ecdf(sample2)
# Plot the ECDFs
plt.figure(figsize=(12, 8))
plt.step(x1, y1, where='post', label='ECDF Sample 1', color='b')
plt.step(x2, y2, where='post', label='ECDF Sample 2', color='g')
# Highlight the KS statistic
d_max = np.max(np.abs(np.interp(x1, x2, y2) - y1))
plt.plot([x1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))], x1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))]],
[y1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))], np.interp(x1, x2, y2)[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))]],
'k--', label=f'KS Statistic = {ks_statistic:.3f}')
# Adding labels, title, and legend
plt.xlabel('Sample Values')
plt.ylabel('Cumulative Probability')
plt.title('Empirical Cumulative Distribution Functions (ECDF)')
plt.legend()
plt.grid()
plt.show()
Kolmogorov–Smirnov Statistic: 0.35833333333333334
P-value: 9.93895980740741e-07
Reject the null hypothesis. The two samples come from different distributions.
The Python code with the data, and detailed computation to apply Goodness-of-Fit Test is given at:
https://colab.research.google.com/drive/1FHK7ICgAZVQCRd4_e5G76hFNIwzfvf5V?usp=sharing
Two Sample Kolmogorov-Smirnov Test - Testing two different distributions - Python code
The next code helps to explain how the two sample Kolmogorov-Smirnov Test could be applied for testing two different distributions with different parameters [4].
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import ks_2samp
# Set the seed for reproducibility
np.random.seed(42)
# Generate two sample datasets
data1 = np.random.normal(7, 2, 100) # Normal distribution
data2 = np.random.lognormal(2, 0.2, 100) # Log-normal distribution
# Perform the Kolmogorov-Smirnov test
ks_statistic, p_value = ks_2samp(data1, data2)
# Print the results
print(f"Kolmogorov–Smirnov Statistic: {ks_statistic}")
print(f"P-value: {p_value}")
# Decision based on p-value
alpha = 0.05
if p_value < alpha:
print("Reject the null hypothesis. The two samples come from different distributions.")
else:
print("Fail to reject the null hypothesis. There is not enough evidence to suggest different distributions.")
# Plot the histograms with KDE
plt.figure(figsize=(12, 8))
sns.histplot(data1, bins=20, kde=True, color='b', label='Data 1 (Normal)')
sns.histplot(data2, bins=20, kde=True, color='g', label='Data 2 (Log-normal)')
plt.legend()
plt.title('Histogram and KDE of Sample Distributions')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()
# Calculate ECDF for both samples
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
n = len(data)
x = np.sort(data)
y = np.arange(1, n+1) / n
return x, y
# Get ECDFs
x1, y1 = ecdf(data1)
x2, y2 = ecdf(data2)
# Plot the ECDFs
plt.figure(figsize=(12, 8))
plt.step(x1, y1, where='post', label='ECDF Data 1 (Normal)', color='b')
plt.step(x2, y2, where='post', label='ECDF Data 2 (Log-normal)', color='g')
# Highlight the KS statistic
d_max = np.max(np.abs(np.interp(x1, x2, y2) - y1))
plt.plot([x1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))], x1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))]],
[y1[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))], np.interp(x1, x2, y2)[np.argmax(np.abs(np.interp(x1, x2, y2) - y1))]],
'k--', label=f'KS Statistic = {ks_statistic:.3f}')
# Adding labels, title, and legend
plt.xlabel('Sample Values')
plt.ylabel('Cumulative Probability')
plt.title('Empirical Cumulative Distribution Functions (ECDF)')
plt.legend()
plt.grid()
plt.show()
Kolmogorov–Smirnov Statistic: 0.2
P-value: 0.03638428787491733
Reject the null hypothesis. The two samples come from different distributions.
The Python code with the data, and detailed computation to apply Goodness-of-Fit Test is given at:
https://colab.research.google.com/drive/1FHK7ICgAZVQCRd4_e5G76hFNIwzfvf5V?usp=sharing
An alternative for checking manual computations to apply Kolmogorov-Smirnov tests are given at [5].
References:
[1] https://www.geeksforgeeks.org/kolmogorov-smirnov-test-ks-test/
[2] https://towardsdatascience.com/non-parametric-tests-in-hypothesis-testing-138d585c3548
[4] https://www.statology.org/kolmogorov-smirnov-test-python/
[5] https://python.plainenglish.io/test-of-normality-kolmogorov-smirnov-test-d047a76f5efe