Data Cleaning and Management

The total length of the videos in this section is approximately 70 minutes. Feel free to do this in multiple sittings! You will also spend time answering short questions and running R code while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

This set of videos primarily focuses on the BRFSS data set, a national survey focused on risk-taking behaviors. I suggest downloading the data and following along, as prompted by the video. Note that you will download a more recent version of the data than the one shown in the videos, so don't worry if the results look different when you run the code or if there variables included have changed.

Note that the videos are using the 2014 data, but the code has been updated to refer to the 2021 data. There are comments in the code showing what changes needed to be made.

Sometimes students have trouble opening the BRFSS data. The code file contains some troubleshooting ideas (foreign package is more likely to install than SASxport; try adding a space to the end of the filename if it won't open). If you are not able to open the BRFSS data, though, don't let that stop you from watching the videos below.

Data cleaning is not the most exciting topic that is included on this site. However, this is the most practically useful topic included here. For undergraduates: when you graduate and get a job, in the professional world or as an RA in grad school, you probably won't be hired to consider big-picture ethical questions or build fancy models. It is very likely that your first job will be to clean data. So, let's get cleaning!

Please download the code used in the videos:

Part 1

DataCleaningAndManagement.1.Data Cleaning.mp4

Question 1: What's the best way to make changes to your data set?

Show answer

Writing reproducible R code. Although it can be tempting to make changes directly to your data, especially when you are trying to start working with the data as soon as possible, you will regret the lack of reproducible code as soon as someone asks what you did, or additional data arrives, or you realize you permanently changed something that you needed, etc.

Part 2

DataCleaningAndManagement.2.Data Cleaning.mp4

Question 2: Suppose that you are cleaning a numeric variable, and you recode "refused to answer" and "not applicable" so that they are both NA. What additional variable should you add to the data set?

Show answer

A yes/no indicator for "not applicable." Although you can't use either "refused to answer" or "not applicable" as part of the numerical variable, these two answers reflect important differences in the respondents, and you don't want to lose track of that information.

Part 3

DataCleaningAndManagement.3.Data Cleaning.mp4

Question 3: If you run a regression on a predictor that should be categorical but is not stored as a factor variable, what can go wrong?

Show answer

Either of the above.

Part 4

DataCleaningAndManagement.4.Data Cleaning.mp4

Question 4: Which of the following are variable types in R?


Show answer

The first four. There are other types as well, such as logical (TRUE or FALSE).

Part 5

DataCleaningAndManagement.5.Data Cleaning.mp4

Question 5: When it is appropriate to use the pdf() and dev.off commands?

Show answer

When you have finalized your graphics

Missing Data Code (missing data concepts are in a separate module, but it doesn't matter which you watch first)

The following code and videos were created by a former QAI intern, Joanna Harton '16 (now a biostatistician!) so you will hear a different voice.

Please download the code file and the csv file used in the missing data video.

DataCleaningAndManagement.6.Missing Data.mp4

Question 6: What goes in the function na.strings()?

Show answer

A vector containing the strings (in double quotes) of what qualifies as missing data

Dates

Please download the code and example file:

The function new_interval has been retired from the package lubridate and replaced by the function interval. Our code file has been updated, but not the video, so you will see that the code and video differ on that line. (The old command new_interval does still work, but a warning pops up suggesting that you use the new command interval instead.)

DataCleaningAndManagement.7.Dates.mp4

Question 7: What is the purpose of converting the interval between the dates to a period?

Show answer

So that you can access the number of years between the two dates

Hooray, now you can get a job cleaning data! Maybe?

During this tutorial you learned:


Terms and concepts:

Reproducibility, recoding


Functions in review: 

getwd(), paste(), save(), is.na(), class(), is.numeric(), is.character(), is.factor(), as.factor(), as.Date(), data.frame(), cbind(), rbind(), merge(), na.strings(), Sys.Date()


lubridate functions in review:

wday(), interval(), as.period()