Missing data

The total length of the videos in this section is approximately 75 minutes. Feel free to do this in multiple sittings. You will also spend time answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

A story

MissingData.1.Part1.mp4

Question 1: Why is the question about family size worded so awkwardly?

Show answer

Writing clear survey questions is incredibly important, complicated, and difficult. There are entire books on the topic of survey question design. The wording of the question I showed you here is far from ideal. But what is the right way to ask how many kids are in someone's family? You have to write the question such that the participant understands how to answer, so that the participants understand the question the same way as each other, and so that the participants' understanding is the same as your own.

You can't write, "How many siblings do you have?" without adding qualifications about step and half siblings. You can't write, "How many kids lived in your house when you were growing up?" because this is not the same as asking about siblings, and siblings don't necessarily live with you. You can't write, "How many kids do your parents have?" because the answer might be different for each parent, and the definition of "having a kid" can be complicated. Etc. So, the question I wrote on the survey was an imperfect attempt to generate useful information and also give me a chance to talk about question design a little!

The next day of my story

MissingData.2.Part2.mp4

Question 2: Should we guess that the family size is equal to 2 for each student who missed the first day of class?

Show answer

Nope! For one thing, it is silly to assume that each of the late-joining students would have the same family size as each other and as the overall average. Also, don't we have additional information about these students that would help us make a better guess?

Relationship between variable of interest and another variable

MissingData.3.Part3.mp4

Question 3: If there were 10 Chinese high school students who added the course after the first class meeting, what should the mode number of kids be?

Show answer

We would add 10 more people to the first bar in the bar chart, and 1 would become the most common answer.

Types of missing data

Missing Completely at Random

MissingData.4.Part4.mp4

Question 4: If our data set includes NA values, do we hope that the values are missing completely at random, or would another (yet-to-be described) type of missingness be better?

Show answer

Missing completely at random is the best we can hope for! If there is no relationship between whether the values are missing and what the values would have been, then missingness completely at random just means that we have a smaller sample size, but our data set will still be representative of the target population (if it started out that way!).

Missing at Random

MissingData.5.Part5.mp4

Question 5: Suppose you want to summarize a variable that records people's incomes, but there is some missingness. Can you think of another variable that, if included in the data set, would allow you to assume missingness at random?

Show answer

Job category or education come to mind. It would be great if you had both. There is very likely a correlation between education level and income. People tend to refuse to provide their incomes when their incomes are on the high end or on the low end. So, you can't just look at the people who provided income information and pretend you have a good summary of income. But you can rely on the correlation between income and education and job to help you make good guesses at the missing incomes.

Assuming missing at random in this case is like pretending that there is a bucket for each education level, and people were randomly selected from certain education levels to refuse to provide their incomes. Perhaps only 1% of people who finished college refuse to give their incomes, but perhaps 40% of high school graduates and 50% of medical school graduates refuse to provide income (making up those percentages).

MissingData.6.Part6.mp4

Question 6: If someone refuses to answer the question about cocaine, what do you think their answer is?

Show answer

I don't think that most of the people who refuse to answer are cocaine users. However, I do think that the proportion of cocaine users among those who refuse to answer will be much higher than the proportion of cocaine among those who do answer the question. There is a correlation between whether you answer the question and what your answer would be.

Missing Not at Random, so what do we do?

MissingData.7.Part7.mp4

Question 7: At what stage of a study should we try to avoid having missingness not at random?

Show answer

When you are collecting the data! That is your chance to include multiple variable that are correlated with each other, so that you can hope for missingness at random instead of not at random.

Motivating the strategies for handling missing data

Question 8: Suppose that you are working with the data set shown below, which contains a few rows and columns from infert, a data set about fertility that is available in R. I wrote NA in one cell. Without using a computer or calculator for anything, what is your best guess at the NA value?

Show answer

The actual value is 2. This isn't the point, though. I was hoping that you would start to understand what the goal is when we are handling missing data and what strategies might be available. As you watch the rest of this lecture, see if you recognize any of the strategies that you considered as you answered this question.

Excerpt from infert data set

How to handle missing data

Create a category called "missing"

MissingData.8.Part8.mp4

Question 9: What is one advantage of handling missing data by adding a "missing" category to a categorical variable?

Show answer

The main one that I think of is that we don't have to make guesses at what the missing values should have been, as we do in most of the remaining strategies for handling missing data.

A fake data example illustrating the types of missing data

MissingData.9.Example.mp4

Question 10: Why is the variance of the education variable so much smaller for the MNAR data, compared to the MCAR data?

Show answer

When data is MNAR, there is a correlation between the values themselves and whether they are missing. To create MNAR data for this example, I threw out any value of education above 13. So, the range - and therefore the variance - of education decreased.

Strategies: Drop NA values

MissingData.10.Dropping Missing Values.mp4

Question 11: Suppose that you have a data set describing 50 people, and each of the 50 people left one of the 10 survey questions blank. If you run a regression that includes the 10 survey questions, how many people will be included your regression?

Show answer

Zero. The software will exclude any person who is missing at least one of the variables used in the model. So, we need a better strategy than dropping missing values.

Strategies: Mean imputation

MissingData.11.MeanImputation.mp4

Question 12: Why does mean imputation underestimate the variance of the variable you are exploring?

Show answer

The variance is the mean of the squared differences from the mean. Sloppily, it is the average distance from the middle. If you insert a bunch of new data points that are right at the mean, the average distance from the mean decreases.

Back to the example graphics

MissingData.12.Example mean imputation.mp4

Question 13: What is the advantage of imputing based on the distribution of the observed value?

Show answer

A histogram of the variable after imputing will look a lot like a histogram of the observed values. Mean, variance, and shape of distribution are preserved. This is desirable only if the distribution of observed values was representative of the population distribution - so, only when data is missing completely at random.

Strategies: Impute based on the distribution of the observed variable; Impute based on the observed values of other variables

MissingData.13.Strategies.mp4

Question 14: If you fill in missing values based on the relationship between the variable of interest and age, but whether the variable of interest is missing also depends on sex, is this strategy going to work?

Show answer

Nope! Imputing based on the observed values of another variable only works if you correctly guess which other variables are related to the missingness of the variable of interest; if these other variables are actually observed in the data set; and if you correctly model the relationship between the variables. Obviously, you never guess all of this perfectly. But it is often possible to do a fairly good job if you know the context well or are working closely with collaborators who understand the context.

One more time, back to the example

MissingData.14.Example.mp4

Question 15: Given a data set, can we ever know for sure whether the missingness is completely at random, at random, or not at random?

Show answer

No. We use our knowledge of the context to figure out the type of missingness. There is some information in the data. For example, you may be able to tell that the rows where a variable is missing are different from the rest of the data on some other variable - this implies that you don't have missingness completely at random, at least.

Multiple imputation

Question 16: Why is multiple imputation better than any of the other strategies presented in this lecture?

MissingData.15.Multiple Imputation.mp4

Question 16: Why is multiple imputation better than any of the other strategies presented in this lecture?

Show answer

If you randomly impute the missing values in your data set one time and then analyze the data set as if there was never any missingness, you are underestimating the uncertainty in your analyses. You are ignoring the fact that if you randomly imputed imputed the missing values again, you'd obtain different imputations. You are pretending that your imputations are the actual observed values and that there is no uncertainty in your imputations. Any confidence intervals you end up calculating will be too narrow, underestimating the uncertainty. The goal of multiple imputation is to take into account both the uncertainty about your estimator based on any particular imputed data set and the fact that the imputed values are random and differ between possible imputed data sets.

You learned about missing data! Note that many statistics departments offer multiple courses focused only on handling missing data - this topic is not a tangent or a side note but a subfield of statistics. This overview is just a beginning.

During this tutorial you learned:

About missingness using an example student survey with missing observations from students who registered late
About 3 types of missingness: Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR)
How the types of missingness appear when visualized using simulated data, including how each type of missingness affects the mean and variance of the variable of interest
Strategies to handle missingness, including dropping rows, creating a new variable to indicate missingness, randomly sampling from observed values of the variable with missingness, mean/median imputation, imputing conditional on other variables, multiple imputation, etc.
How to impute with non-parametric and parametric strategies
Why multiple imputation improves on the other strategies to deal with missingness

Terms and concepts:

missingness, Missing Completely at Random (MCAR), Missing at Random (MAR), Missing Not at Random (MNAR), imputation, mean/median imputation, multiple imputation