Hypothesis Testing: Part I

Ed Direction Data Fellows Asynchronous Module

August 2022

Welcome to the Hypothesis Testing Part 1 Asynchronous Module!

Where We've Been, Where We're Headed

This module builds on the concepts introduced in the Descriptive and Exploratory Analysis Synchronous Module and Calculating Descriptive Statistics In Google Sheets/Excel Asynchronous Module. In these previous modules, we explored introductory concepts that inform our ability to perform clean data analysis.

Since the Calculating Descriptive Statistics In Google Sheets/Excel Asynchronous Module was released, many of you have begun using Tableau and Power BI. These platforms are capable of calculating many, if not all, of the descriptive statistics we introduced in that module.

In this module, you will learn how to statistically test if a change in your LEA’s data was likely the result of random chance or something more intentional (like an RSSP intervention strategy). Before diving into the content, we want to make a few key requests that are essential to understanding the statistical ideas we will be presenting:

Please approach this module with a growth mindset. As we describe the core statistical underpinnings that make hypothesis testing possible, we ask that you believe in your internal capacity to understand these ideas. If you will commit to working through this module, we will commit to explaining these principles as clearly as we can.

Complete this module at a comfortable pace. Please do not feel like you have to power through the content in one sitting. If you want to break up it into two or three sessions to let the information sink in, please do!

Read the language carefully. The principles we are about to introduce are explained very specifically to provide accurate descriptions of what hypothesis tests enable us to say and not say. We will do our best to accurately describe the essential nuances that make these tests possible without overloading you with too much information.

Please realize that you will not need to do any math by hand in order to perform a hypothesis test. Excel will do that for you. Throughout the module we will show you some of the math that makes hypothesis testing possible in order to build your understanding of how and why these concepts work. But rest assured that if solving math problems by hand is not your thing, you will still be able to conduct these analytical tests on your RSSP data.

Have fun! If statistics is new to you, it can be intimidating. But if you embrace the stress and work to reason through these beautifully intuitive concepts, you’ll be surprised at how much fun deeper data analysis can be.

For the more statistically savvy folks out there, we want to name that we are intentionally emphasizing calculating statistics as opposed to parameters in this module. We know that there is room for debate on whether collecting data for all the students in a school creates the eligible claim that you have data for a population. But in an effort to keep this module as streamlined as possible, we have opted to focus on calculating the values of samples.

Note-Catcher

Click on the button to the left to open the note-catcher, which is mirrored to follow the content as it is presented on the Learning Space. As you navigate through this module, you are welcome to use this optional tool to capture your notes.

Case Study: Cache Valley ISD

During the summer of 2021, the Cache Valley ISD RSSP Team identified 8th grade math as its focus area and chose Just-in-Time Intervention as its implementation strategy. Throughout the 2022 school year, the district’s RSSP team, school administrators, and 8th grade math teachers worked hard to implement JIT with fidelity in order to improve student test scores. When STAAR Scores for the 2021-2022 school year were finally released, the district’s Data Fellow was tasked with determining if the change in average 8th grade math test scores was large enough to be statistically significant -- meaning that the change in test scores was likely caused by something other than random chance. With a 2020-2021 mean score of 1316 and a 2021-2022 mean score of 1389, the Data Fellow wondered how he would be able to determine if the change in average scores was actually meaningful.

How to Determine Statistical Significance

In a few short months, you will find yourself in the same position as Cache Valley's Data Fellow. When you are tasked with determining if the difference in student performance is statistically significant, you will use hypothesis testing to develop your answer.

Before you can successfully analyze your LEA's student performance data, there are several core statistical concepts you will need to understand. These core concepts are:

Histograms & Density Curves
The Normal Distribution
Sampling Distributions

Once we have addressed these foundational concepts, we will then move onto hypothesis testing by specifically exploring:

The Null Hypothesis vs. The Alternative Hypothesis
Type I and Type II Errors
Confidence Levels (Alpha)
Test Statistics
P-Values

Part I: The Basics

Histograms and Density Curves

When we are working with large quantities of data, there are a variety of ways we can display them. One of the most useful ways is to use a histogram. Histograms are useful because they visually present the distribution of the data. The distribution contains the range and frequencies of observed values. For example, if we had a dataset that tracked the number of days students attended their 8th grade math class for a given month, we could use the histogram to the right to display the data.

You can see here that this histogram shows the range of possible values (the minimum and maximum number of days students attended math class), and the frequency of each value observation (how many students attended each number of days).

Like a histogram, a density curve attempts to display the distribution of data. Unlike a histogram, a density curve does not display the frequencies of observed values. Instead, the density curve attempts to smooth out the rugged boundaries of the histogram, creating one continuous shape with a total area equal to 1. We can then use the proportions of the area under the density curve to determine the approximate percentages of observed values.

This density curve was created with the same attendance data from which the previous histogram was created. If we look at the histogram, we can see that approximately 3/4ths of the data are to the left of the value 15. If we look at the density curve, we can see that this same proportion exists. These proportions are sometimes called probability densities.

When we convert a histogram to a density curve, we do lose some of the exactness in terms of the proportion of individual value counts relative to the total counts of values. This means that the percentage of students who attended five days of class as displayed in the histogram will not match perfectly with the proportion of area under the density curve at or around the value five. However, the approximations provided by the density curve are good enough for our purposes.

The Normal Distribution

The Normal Distribution is a symmetrical density curve with very special properties. These special properties make it one of the most important concepts in statistics. The picture to your right is an example of the normal distribution, and detailed explanations of its important properties are found below.

Special Property #1: Mean = Median = Mode

In a normal distribution, the mean, median, and mode are all the same number. In the example displayed above, 48 is the mean, median, and mode of the distribution.

Special Property #2: Variance Determines Shape

The description of this next property is going to be pretty long, so hang tight with us as we work our way through it.

Within a normal distribution, the variance within the dataset determines the shape of the distribution by identifying the spread.

Variance is a measure of how far a set of numbers is spread out from their average value.

In statistics, variance is determined using the following formula:

Where

xi = the value of one observation

xbar = the mean of all observations

n = the number of observations

If this equation is new to you, you do not need to feel intimidated! It’s a lot simpler than it looks. Let’s walk through it together using an RSSP example. Let’s say we have a list of four STAAR Math scores: 1255, 1321, 1027, 1486.

Step 1. Find the difference between the observed scores and the average score.

We start by looking at the section of the formula that features xi and xbar. This section of the formula instructs us to subtract the mean STAAR Math score from each individual STAAR Math score. So, we would first calculate the average STAAR Math score by doing the following

1255 + 1321 + 1027 + 1486 //4 = 1272

We would then subtract the mean score from each individual score, which gives us the difference between our observed scores and the mean score.

1255 - 1272 = -17

1321 - 1272 = 49

1027 - 1272 = -245

1486 - 1272 = 214

Step 2. Square the differences

Having completed those steps, we will now focus our attention on the 2nd section of the formula. This step directs us to square each of the differences we found in the previous step. When we square a difference, we simply multiply the difference by itself:

-17^2 = 289

49^2 = 2,401

-245^2 = 60,025

214^2 = 45,796

A primary reason why we square these numbers is to produce a positive number that indicates how far away our observed value is from the mean value. This is important for the next step.

Step 3. Add the squared the differences

Having squared the differences, we now turn our attention to sigma (the symbol circled in green in the equation above), which directs us to add all of the squared differences together. In our example, we would do the following:

289 + 2401 + 60,025 + 45,796 = 108,511

Step 4. Find the average of the summed squared differences

The final step is to divide the sum of our squared differences by the number of our observations minus 1. In our example, we would divide by 3 because 4-1 = 3 .

108,511/3 = 36,170

The final value of 36,170 is our variance! In other words, 36,170 is the average squared difference between student STAAR Math scores and the average student STAAR Math score!

*If you want to explore the intuition behind why we divide by n - 1, check out this StatQuest (one of the best channels on YouTube for learning core concepts in statistics).

Bonus Step 5. Use the Variance to Calculate the Standard Deviation

If you are like most people though, the above definition of variance isn’t very intuitive. Luckily, there is one extra step we can take to make the variance more useful to us. If we take the square root of the variance, we can transform the variance into the standard deviation. When we do this, we change the average squared distance of individual student math scores from the mean of all student math scores into the average distance of individual student math scores from the mean of all student math scores.

In our example, we would do the following to convert our variance into our standard deviation:

By calculating the variance and standard deviation, we can discover the spread of a normal distribution.

Special Property #3: 68 - 95 - 99.7 Rule

The last special property of the normal distribution we need to know is the 68-95-99.7 rule. According to this rule:

68% of observations fall within one standard deviation of the mean.
95% of observations fall within two standard deviations of the mean.
99.7% of observations fall within three standard deviations of the mean.

The image, taken from a fantastic Towards Data Science article, visually demonstrates this principle.

Sampling Distributions

There is an important distinction that needs to be made whenever we analyze data. We need to distinguish if we are analyzing data for a population or if we are analyzing data for a sample. The claims we can make and the equations we use hinge on this distinction.

A population is everyone within a defined group. An example would be all students within Dripping Springs ISD’s boundaries.

A sample is a subset of people within a defined group. An example would be a portion of the students within Dripping Springs ISD’s boundaries.

When we are working with samples in data analysis, we want our samples to be representative, meaning we want our samples to resemble the overall population as closely as possible. The most effective way to ensure our samples are representative is to randomly create them. An example of this would be a portion of students in Dripping Springs ISD who have been randomly selected to be a part of a sample.

When we use random selection to create samples, it is important for us to realize that each sample of students will have its own unique distribution of student test scores. This happens because each sample is composed of different students who have different attributes and come from different circumstances.

By extension, this means that if we created a density curve of each random sample’s STAAR Math scores, each density curve would look different. Below are just a few examples of what the density curves of samples could look like.

As you can see, chances are that many of the distributions of student STAAR Math score samples would not be normal.

However, if we took the mean of each sample and plotted those means using a histogram or density curve, the distribution of the sample means would become normal as the number of sample means increased!

This is one of the important points in this module. While the distribution of any given sample may not be normal, the distribution of the means of multiple samples will be approximately normal. This logic allows us to apply what we know about the normal distribution to distributions that are not normal. Because hypothesis testing heavily leans on the special properties of the normal distribution (as you will later see), this is incredibly useful.

Part II: Hypothesis Testing

The Null Hypothesis vs the Alternative Hypothesis

The first step to conducting a hypothesis test is… writing a hypothesis!

A hypothesis is a statement that can be objectively tested, and in the case of the Cache Valley Data Fellow we met at the beginning of this module, we will write a hypothesis about the district's STAAR Math data.

We start by writing two hypotheses: The Null Hypothesis and the Alternative Hypothesis

Generally speaking, the Null Hypothesis assumes there is no difference between the means of two groups. In the case of the Cache Valley ISD Data Fellow, the Null Hypothesis specifically assumes that there is no statistically significant difference between the 2020-2021 STAAR Math Scores and the 2021-2022 STAAR Math scores of students at Cache Valley ISD.

The Null Hypothesis is the focal point of hypothesis testing. We always assume that the Null Hypothesis is true until we gather enough evidence to reject it.

When we reject the Null Hypothesis, we are saying that we are highly confident that there is a statistically significant difference between the mean of two groups. It is essential, however, for us to understand that we can never be 100% sure that the Null Hypothesis is incorrect. A point we will describe in greater detail later.

The Alternative Hypothesis assumes there is a meaningful -- or statistically significant -- difference between the mean of two groups.

In the case of Cache Valley ISD, the Alternative Hypothesis states that there is a statistically significant difference between the 2020-2021 STAAR Math Scores and the 2021-2022 STAAR Math scores of students at Cache Valley ISD.

While it is important to state the Alternative Hypothesis, we do not directly test it. We instead always select the Null Hypothesis to be the focal point of our statistical test.

Type I and Type II Errors

Before we conduct the test, there are two things we should do: understand the language of hypothesis testing and examine the potential outcomes of the test.

Understand the Language

When we conduct hypothesis tests, you’ll notice we use the phrases “Reject the Null Hypothesis” and “Fail to Reject the Null Hypothesis”. Here is what those phrases mean:

Reject the Null Hypothesis: When we say this, we are saying that we have gathered enough evidence to determine that it is highly unlikely that the Null Hypothesis is true.

Fail to reject the Null Hypothesis: When we say this, we are saying that we have not gathered enough evidence to determine that it is highly unlikely that the Null Hypothesis is true.

We recommend committing the meaning of these phrases to memory -- doing so makes reasoning through the test’s potential outcomes much easier.

Potential Outcomes

With the meaning of those phrases clear in our minds, we can now look at the image below and examine the four potential outcomes of our impending test:

Let’s walk through each potential outcome together.

Outcome 1: The Null Hypothesis is true, but we rejected it.

This means that we detected a statistically significant difference between two groups… when no difference actually existed. We made the wrong decision. We can call this outcome a False Positive. This is formally known as a Type I Error.

Outcome 2: The Null Hypothesis is false, and we rejected it.

This means we detected a statistically significant difference between two groups… and there was a difference between them. We made the correct decision! We can call this outcome a True Positive.

Outcome 3: The Null Hypothesis is true, and we failed to reject it.

This means we did not detect a statistically significant difference… and there was no such difference. We made the correct decision! This is called a True Negative.

Outcome 4: The Null Hypothesis is false, but we failed to reject it.

This means that we did not detect a statistically significant difference… even though a difference actually existed. We made the wrong decision. We can call this outcome a False Negative. This is formally known as a Type II Error.

The objective when conducting a hypothesis test is to arrive at outcome two (a true positive) or outcome three (a true negative).

Unfortunately, despite our best efforts, we will never be able to be 100% certain that we have arrived at the right outcome. In this case, statistics is not like a game show -- the correct answer is not revealed after we have submitted our guess. Instead, the best we can do is determine the threshold of evidence we want to collect before deciding to reject the Null Hypothesis.

Confidence Levels

We determine our evidence threshold by stating the specific probability associated with how often our particular decision will be correct.

How can we determine the probability of making a correct decision? By using our knowledge of the normal distribution! How the normal distribution connects to our hypothesis test will become clear in a moment.

We first determine our evidence threshold by selecting the probability of committing a Type I Error -- detecting a statistically significant difference in the mean of two groups when there is in fact no such difference. The probability of us committing a Type I Error is known as alpha.

The standard convention is to set our alpha at .05. Setting alpha at .05 means that we are willing to accept a 5% chance of committing a Type I Error. Conversely, this also means that we can make our decisions with 95% confidence that we are not committing a Type I Error.

You might be wondering why we don’t set our alpha to 0 and give ourselves a 100% chance of not committing a Type I Error. The reason is because our alpha does not exist in a vacuum -- it’s directly tied to our chances of committing a Type II Error (failing to detect a meaningful difference in the mean of two groups when there is in fact a meaningful difference). The higher likelihood we give ourselves of avoiding a false positive, the lower likelihood we have of avoiding a false negative.

The key here is to select an alpha that most effectively balances the probabilities of committing a Type I or a Type II Error. The excellent work of previous statisticians has indicated that .05 is an effective alpha to select when working in most cases in the social sciences (which includes education).

Test Statistics

Our second-to-last step is to calculate the observed difference between the two groups we are studying and compare that difference to what we would expect it to be if the Null Hypothesis were true.

In our case study example, we can compare the difference between the 2020-2021 and the 2021-2022 STAAR Math scores to the difference we would expect if the Null Hypothesis were true. We know that the average score during the 2020-2021 school year was 1316 and the average test score during the 2021-2022 school year was 1389. We can find the difference by using subtraction:

1389 - 1316 = 73

So, our observed difference in the means of the two groups of student test scores is 73. Under the Null Hypothesis, we would expect the difference to be 0.

We can compare our observed value and our expected Null Hypothesis value by calculating a test statistic.

The purpose of a test statistic is to determine how many standard deviations (more specifically standard errors -- for the more statistically savvy folks out there) our observed value is away from the expected value as stated by the Null Hypothesis.

The good news is that computers will always calculate test statistics for you -- you will never have to calculate one by hand. It is important, however, that you know how to tell your computer which test statistic you want to calculate. We will teach you how to do this in the next asynchronous module.

For our Cache Valley Data Fellow who has an observed difference of 73, his test statistic is 3.29. This indicates that his observed value is 3.29 standard deviations away from the mean value according to the Null Hypothesis.

This test statistic is the key piece of information we need to perform our hypothesis test. When we pair our test statistic with our knowledge of the Normal Distribution, we can determine the probability of us committing a Type I Error.

Remember, when we are working with samples, even though the distribution of each sample will likely not be normal, the means of many samples, if collected and displayed on a separate distribution, will create an approximately normal distribution. This is important because it means we can now use the 68-95-99.7 rule in conjunction with our test statistic to determine the probability of committing a Type I Error.

Let’s explain exactly how this works. We know that, according to the 68-95-99.7 rule, 95% of the data are within approximately two standard deviations of the mean (the actual number is 1.96).

We test the Null Hypothesis by imagining a normal distribution with a mean of zero (the expected value under the Null Hypothesis). We then use our test statistic to determine how close our observed value of 73 is to our expected value zero. In our case, 73 is 3.29 standard deviations away from zero.

This means that because our observed difference is more than two standard deviations away from the Null Hypothesis mean, there is less than a 5% chance that the value 73 is from a distribution where the mean is actually zero.

Think about it. We know that 95% of possible values are going to be within two standard deviations of the mean. This means that 5% of potential values are going to have standard deviations that are two or greater. This naturally leads to the conclusion that observing a value of or more extreme than 73 would only happen less than 5% of the time if the mean were actually zero -- if there were actually no difference between the mean of the 2020-2021 test scores and the 2021-2022 test scores.

You can use the image below to help you understand this concept visually

The symbol in the middle of the image -- the fancy U -- represents the mean. The symbol of the circle with the extended line represents the standard deviation. The red area represents the proportion of values that fall within one positive and one negative standard deviation. The blue area + the red area represent the proportion of values that fall within two standard positive and two negative standard deviations of the mean. Because our test statistic of 3.29 is more than two standard deviations away from the mean, our value of 73 fits within the green area -- an area only very few values fall within.

This amounts to very strong evidence against the Null Hypothesis. Knowing that our observed outcome is highly unlikely if the Null Hypothesis were true gives us the confidence we need to reject the Null Hypothesis and move forward believing that there is in fact a meaningful difference between the two groups.

P-Values

We can use our test statistic to calculate the exact probability of observing a value at least as extreme as 73 if the actual value is 0 by finding the p-value. You will always use a computer to calculate the p-value for your data, and we will show you how to do that in the next asynchronous module.

In the case of our Cache Valley Data Fellow, when he used Microsoft Excel to calculate the p-value for Cache Valley ISD test scores, he found that the p-value was .001. Because .001 is less than his alpha of .05, he was safely able to reject the Null Hypothesis and determine that the difference in student test scores was statistically significant.

Final Thoughts

Congratulations! You now have a much stronger understanding of the theoretical underpinnings behind hypothesis testing! In the next asynchronous module, we will show you how to use Excel to perform hypothesis tests on your RSSP data.

You may be surprised that it requires this much work to simply state if the difference between two groups was likely due to something more meaningful than random chance. But it’s true. It does take this much work.

Please keep in mind that hypothesis tests do not tell us what caused the difference in the mean of the two groups. They only tell us that a meaningful difference exists. More advanced methods are needed to determine causality.

Module Complete!

Congratulations on completing the Hypothesis Testing: Part I module. Please complete the Exit Ticket form by clicking on the link below. We will use the information you submit to track your completion.

Exit Ticket

Page updated

Report abuse