1.10. Non-Parametric test for Hypothesis Testing: Chi-Square Goodness-of-fit

The Chi-Square test is a statistical procedure for determining the difference between observed and expected data. This test can also be used to determine whether it correlates to the categorical variables in our data. It helps to find out whether a difference between two categorical variables is due to chance or a relationship between them [1].

There are two main types of Chi-Square tests [2]:

Chi-Square Goodness-of-Fit Use the goodness-of-fit test to decide whether a population with an unknown distribution “fits” a known distribution.
Use the test for independence to decide whether two variables (factors) are independent or dependent, i.e. whether these two variables have a significant association relationship between them or not .

Chi-Square Goodness-of-Fit Test - Step-by-Step Manual Example

A factory tracks the number of machinery breakdowns each day of the week. The data is as follows:

We want to determine if the breakdowns are uniformly distributed across the week using the following steps:

State the Hypotheses:

Null Hypothesis (H0): Breakdowns are uniformly distributed across the week.
Alternative Hypothesis (Ha): Breakdowns are not uniformly distributed across the week.

Calculate the Expected Frequency:

To find the expected frequency if the breakdowns are uniformly distributed, we sum the total number of breakdowns and divide by the number of days.

Total Breakdowns = 14 + 22 + 16 + 18 + 12 + 19 + 11 = 112

Expected Frequency per day = 112/7 = 16

Calculate the Chi-Square Statistic:

Where: Oi is the observed frequency, and Ei is the expected frequency.

χobs 2 = 0.25+2.25+0+0.25+1+0.5625+1.5625 = 5.875

4. Determine the Degrees of Freedom:

Degrees of freedom (df) = 𝑛−1, where: 𝑛 is the number of categories (days).

df = 7 − 1 = 6

5. Find the Critical Value:

At a 5% level of significance (α = 0.05) and 6 degrees of freedom, the critical value from the Chi-Square distribution table is approximately χcrit 2 = 12.592.

6. Compare the Chi-Square Statistic to the Critical Value:

The critical value of χcrit 2 will be computed to be employed as a criteria to accept or reject null hypothesis using:

Reject H0: χobs 2 >= χcrit 2

Do not reject H0: χobs 2 < χcrit 2

For the data of this problem:

χobs 2 = 5.875 < χcrit 2 =12.592

7. Conclusion:

Since the Chi-Square calculated value is less than the critical value, we accept the null hypothesis. We conclude that the breakdowns are uniformly distributed across the week.

Chi-Square Goodness-of-Fit Test - Step-by-Step Manual Example - Python Code

The next Python code shows how to make some manual calculations automatically, and also provides how to employ the command stats.chi2.ppf from scipy.stats library to avoid the use of critical values for Chi-Square distribution.

import scipy.stats as stats

# Observed frequencies

observed = [14, 22, 16, 18, 12, 19, 11]

# Expected frequencies (uniform distribution)

total_breakdowns = sum(observed)

expected = [total_breakdowns / 7] * 7

# Calculate Chi-Square statistic

chi_square_statistic, p_value = stats.chisquare(observed, expected)

# Degrees of freedom

df = len(observed) - 1

# Critical value at 5% significance level

critical_value = stats.chi2.ppf(0.95, df)

# Print the results

print(f"Chi-Square Statistic: {chi_square_statistic}")

print(f"Degrees of Freedom: {df}")

print(f"Critical Value: {critical_value}")

print(f"P-Value: {p_value}")

if chi_square_statistic < critical_value:

print("Accept the null hypothesis: Breakdowns are uniformly distributed.")

else:

print("Reject the null hypothesis: Breakdowns are not uniformly distributed.")

Chi-Square Statistic: 5.875

Degrees of Freedom: 6

Critical Value: 12.591587243743977

P-Value: 0.43733749366050445

Accept the null hypothesis: Breakdowns are uniformly distributed.

The Python code with the data,and detailed computation to apply Goodness-of-Fit Test is given at:

https://colab.research.google.com/drive/19WdSUSXTuuVL1bB1NhtG6xvqJrC2DPSS?usp=sharing

Chi-Square Independence of Test - Step-by-Step Manual Example

Use the test for independence to decide whether two variables (factors) are independent or dependent, i.e. whether these two variables have a significant association relationship between them or not. In this case, there will be two qualitative survey questions or experiments and a contingency table will be constructed. The goal is to see if the two variables are unrelated (independent) or related (dependent). The null and alternative hypotheses are:

Null Hypothesis (H0): The two variables (factors) are independent.
Alternative Hypothesis (Ha): The two variables (factors) are dependent.

Let’s take an example. Suppose, we want to investigate if gender and preferred color of shirt were independent. This means we want to find out if a person’s gender influences their color choice. We conducted a survey and organized the data in the following table of observed values [2].

State the Hypotheses:

Null Hypothesis (H0): Gender and preferred shirt color are independent.
Alternative Hypothesis (Ha): Gender and preferred shirt color are not independent.

Calculate the Expected Frequency:

For calculating Chi-squared test statistics we need to calculate the expected value. So, add all the rows and columns and overall totals:

After this we can calculate the expected value table from the previous table for each entry using the Equation (1) to obtain the Expected value Table:

Expected value = (row total * column total)/overall total (1)

For example, the expected value for Male and Black is computed using (150 x 82)/298 = 41.27 ≈ 41.3.

3. Calculate the Chi-Square Statistic:

Where: Oi is the observed frequency, and Ei is the expected frequency.

χobs 2 = 34.9572

4. Determine the Degrees of Freedom:

Degrees of freedom (df) = (number of rows − 1)*(number of columns − 1),

df = (2-1) * (4-1) = 3.

5. Find the Critical Value:

At a 5% level of significance (α = 0.05) and 3 degrees of freedom, the critical value from the Chi-Square distribution table is approximately χcrit 2 = 7.815.

6. Compare the Chi-Square Statistic to the Critical Value:

The critical value of χcrit 2 will be computed to be employed as criteria to accept or reject the null hypothesis using:

Reject H0: χobs 2 >= χcrit 2

Do not reject H0: χobs 2 < χcrit 2

For the data of this problem:

χobs 2 = 34.9572 >= χcrit 2 =7.815

7. Conclusion:

Since the Chi-Square calculated value is higher than the critical value, we reject the null hypothesis. We can conclude that gender and preferred shirt color are not independent.

Chi-Square Independence of Test - Step-by-Step Manual Example - Python using critical value

import pandas as pd

from scipy.stats import chi2_contingency

from scipy.stats import chi2

# Given dataset

df_dict = {

'Black': [48, 34],

'White': [12, 46],

'Red': [33, 42],

'Blue': [57, 26]

}

dataset_table = pd.DataFrame(df_dict, index=['Male', 'Female'])

print("Dataset Table:")

print(dataset_table)

print()

# Observed Values

Observed_Values = dataset_table.values

print("Observed Values:")

print(Observed_Values)

print()

# Perform chi-square test

val = chi2_contingency(dataset_table)

Expected_Values = val[3]

print("Expected Values:")

print(Expected_Values)

print()

# Degree of Freedom

no_of_rows = len(dataset_table.iloc[0:2, 0])

no_of_columns = 4

ddof = (no_of_rows - 1) * (no_of_columns - 1)

print("Degree of Freedom:", ddof)

print()

# Chi-square statistic

chi_square = sum([(o - e) ** 2. / e for o, e in zip(Observed_Values, Expected_Values)])

chi_square_statistic = chi_square[0] + chi_square[1]

print("Chi-square statistic:", chi_square_statistic)

print()

# Critical value

alpha = 0.05

critical_value = chi2.ppf(q=1-alpha, df=ddof)

print('Critical value:', critical_value)

print()

# Significance level

print('Significance level:', alpha)

print('Degree of Freedom:', ddof)

print()

# Hypothesis testing

if chi_square_statistic >= critical_value:

print("Reject H0, Gender and preferred shirt color are independent")

else:

print("Fail to reject H0, Gender and preferred shirt color are not independent")

print()

Dataset Table:

Black White Red Blue

Male 48 12 33 57

Female 34 46 42 26

Observed Values:

[[48 12 33 57]

[34 46 42 26]]

Expected Values:

[[41.27516779 29.19463087 37.75167785 41.77852349]

[40.72483221 28.80536913 37.24832215 41.22147651]]

Degree of Freedom: 3

Chi-square statistic: 22.597058622962738

Critical value: 7.814727903251179

Significance level: 0.05

Degree of Freedom: 3

Reject H0, Gender and preferred shirt color are independent

The Python code with the data,and detailed computation to apply Goodness-of-Fit Test is given at:

https://colab.research.google.com/drive/19WdSUSXTuuVL1bB1NhtG6xvqJrC2DPSS?usp=sharing

Chi-Square Independence of Test - Step-by-Step Manual Example - Python using P-Value

import pandas as pd

from scipy.stats import chi2_contingency

from scipy.stats import chi2

# Given dataset

df_dict = {

'Black': [48, 34],

'White': [12, 46],

'Red': [33, 42],

'Blue': [57, 26]

}

dataset_table = pd.DataFrame(df_dict, index=['Male', 'Female'])

print("Dataset Table:")

print(dataset_table)

print()

# Observed Values

Observed_Values = dataset_table.values

print("Observed Values:")

print(Observed_Values)

print()

# Perform chi-square test

val = chi2_contingency(dataset_table)

Expected_Values = val[3]

print("Expected Values:")

print(Expected_Values)

print()

# Degree of Freedom

no_of_rows = len(dataset_table.iloc[0:2, 0])

no_of_columns = 4

ddof = (no_of_rows - 1) * (no_of_columns - 1)

print("Degree of Freedom:", ddof)

print()

# Chi-square statistic

chi_square = sum([(o - e) ** 2. / e for o, e in zip(Observed_Values, Expected_Values)])

chi_square_statistic = chi_square[0] + chi_square[1]

print("Chi-square statistic:", chi_square_statistic)

print()

# Critical value

alpha = 0.05

# p-value

p_value = 1 - chi2.cdf(x=chi_square_statistic, df=ddof)

print('p-value:', p_value)

print()

# Significance level

print('Significance level:', alpha)

print('p-value:', p_value)

if p_value <= alpha:

print("Reject H0, Gender and preferred shirt color are independent")

else:

print("Fail to reject H0, Gender and preferred shirt color are not independent")

Dataset Table:

Black White Red Blue

Male 48 12 33 57

Female 34 46 42 26

Observed Values:

[[48 12 33 57]

[34 46 42 26]]

Expected Values:

[[41.27516779 29.19463087 37.75167785 41.77852349]

[40.72483221 28.80536913 37.24832215 41.22147651]]

Degree of Freedom: 3

Chi-square statistic: 22.597058622962738

p-value: 4.899565434945963e-05

Significance level: 0.05

p-value: 4.899565434945963e-05

Reject H0, Gender and preferred shirt color are independent

The Python code with the data,and detailed computation to apply Goodness-of-Fit Test is given at:

https://colab.research.google.com/drive/19WdSUSXTuuVL1bB1NhtG6xvqJrC2DPSS?usp=sharing

References:

[1] https://www.simplilearn.com/tutorials/statistics-tutorial/chi-square-test#:~:text=The%20chi%2Dsquare%20test%20is%20a%20statistical%20tool%20used%20to,significantly%20from%20the%20expected%20data.

[2] https://www.analyticsvidhya.com/blog/2024/04/a-comprehensive-guide-on-non-parametric-tests/

Page updated

Google Sites

Report abuse