Non-Parametric Tests

The total length of the videos in this section is approximately 38 minutes, but you will also spend time answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

Ways to approximate the reference distribution, rather than calculating it exactly

MoreNPTests.2.ReferenceDistributions.mp4

Question 1: If you don't convert the data to ranks, can you exactly anticipate the mean and variance of the reference distribution in a randomization/permutation test?

Show answer

No. You can estimate it, though. These exact calculations of mean and variance are the result of the fact that the data has been converted to ranks, and if we assume there are no ties, we know we are dealing with the numbers from 1 to N.

Uniform reference distribution

Try sketching a histogram of the numbers from 1 to 10. It's flat, right? One example of a uniform distribution is the distribution of the number you obtain when you roll a die: each number from 1 to 6 has the same probability of occurring.

Question 2: If we don't convert to ranks, does the data necessarily follow any particular distribution?

Show answer

No.

Central Limit Theorem - first mention in this course!

MoreNPTests.3.CLT.mp4

In the video, I said that the Central Limit Theorem works regardless of the distribution of the original data. This is true. But, then, why is it helpful that converting to ranks gives us a uniform distribution?

As we will see, the CLT says that when we draw a large sample from a data set and record the sum (or mean) of the values in the sample, if we repeatedly draw different samples, the sums (or means) will look approximately normal. However, the meaning of the word "large" depends on the distribution of the original data: the weirder the distribution, the larger sample size we need in order for the CLT to be true. So, the advantage of converting the data to ranks is that we know the uniform distribution is not too weird (no outliers, symmetric), and the CLT will work even for a small sample size.

This will make much more sense very soon when you run a CLT simulation in R.

Question 3: The Central Limit Theorem says that which of the following will be normal, for large sample sizes? Check all that apply.

The data
The sum of the values in a sample from the data, if we drew many samples
The mean of the values in a sample from the data, if we drew many samples

Show answer

The second two options. Note that it wouldn't make sense for the data itself to become normal just because you have a large sample size. Consider the distribution of the heights of adults: it has two modes (bumps), one for males, and one for females. This distribution will have two bumps no matter how many adults you include! However, if you drew a sample of adults and recorded the mean height, and then drew another sample and recorded the mean height again, etc., those recorded sample means will start to look normal if the sample size is big enough.

Example of the rank sum test

Below you can download the code file showed in the video, along with the data set I used, which comes from the New York Times. I am providing this code in case you are interested in looking at it at any point. HOWEVER, this lecture is not about R, and many lines in this code file are beyond what I expect you to be able to do in R at this point. The purpose of the following video is for you to see an example of the rank sum test.

MoreNPTests.4.CovidRankSumExample.mp4

Question 4: What do we conclude about covid deaths and low v. high risk states from the example in the previous video?

Show answer

We cannot rule out the possibility that high-risk and low-risk states have the same distribution of cumulative Covid deaths.

Facts about the rank sum

MoreNPTests.5.Facts.mp4

Question 5: Are rank sum tests appropriate for large sample sizes?

Show answer

Yes. Rank sum tests (and other non-parametric) tests often come to mind for small sample sizes, because small sample sizes are no problem for these tests. However, large sample sizes are also no problem for these tests. We may not want to take the computational time to conduct a non-parametric test with an exact reference distribution, but the approximations discussed in this module solve that problem.

Handling ties

MoreNPTests.6.Ties.mp4

Question 6: When should you consider a rank sum test? Check all that apply.

small sample size
outliers
censoring
lots of ties
when you make a histogram of the data in your sample, it is very far from normal
big sample size, no outliers or censoring or ties, normal-looking histogram - is the rank sum test still fine?

Show answer

All of the scenarios except "lots of ties." Though the rank sum test works well for small sample sizes, that's because it works well for any sample size. It works great even if there are no outliers or censoring, and it's fine to have normal data. Basically, the rank sum test releases you from many of the assumptions you are used to making in the statistical tests that are traditionally presented in intro stats.

That's it for this section.

During this tutorial you learned:

More about the rank sum test, which is an example of a randomization/permutation test
3 ways to generate a reference distribution for a non-parametric test (exact, approximate exact, normal approximation)
Why the distribution of ranks is uniform
A little about the Central Limit Theorem
How to perform a rank sum test in R using New York Times COVID-19 data
The benefits of the rank sum test
How to handle ties when performing a rank sum test

Terms and concepts:

rank sum test, randomization/permutation test, reference distribution, exact way, approximate exact, normal approximation, Central Limit Theorem

Functions in review:

rank(), order(), sum(), choose(), wilcox.test()