Data Cleaning and Management
The total length of the videos in this section is approximately 70 minutes. Feel free to do this in multiple sittings! You will also spend time answering short questions and running R code while completing this section.
You can also view all the videos in this section at the YouTube playlist linked here.
This set of videos primarily focuses on the BRFSS data set, a national survey focused on risk-taking behaviors. I suggest downloading the data and following along, as prompted by the video. Note that you will download a more recent version of the data than the one shown in the videos, so don't worry if the results look different when you run the code or if there variables included have changed.
Note that the videos are using the 2014 data, but the code has been updated to refer to the 2021 data. There are comments in the code showing what changes needed to be made.
Sometimes students have trouble opening the BRFSS data. The code file contains some troubleshooting ideas (foreign package is more likely to install than SASxport; try adding a space to the end of the filename if it won't open). If you are not able to open the BRFSS data, though, don't let that stop you from watching the videos below.
Data cleaning is not the most exciting topic that is included on this site. However, this is the most practically useful topic included here. For undergraduates: when you graduate and get a job, in the professional world or as an RA in grad school, you probably won't be hired to consider big-picture ethical questions or build fancy models. It is very likely that your first job will be to clean data. So, let's get cleaning!
Please download the code used in the videos:
Part 1
![](https://www.google.com/images/icons/product/drive-32.png)
Question 1: What's the best way to make changes to your data set?
Opening it in Excel and making changes directly
Typing commands into the R console to change the variables in the data set
Writing reproducible R code
Show answer
Writing reproducible R code. Although it can be tempting to make changes directly to your data, especially when you are trying to start working with the data as soon as possible, you will regret the lack of reproducible code as soon as someone asks what you did, or additional data arrives, or you realize you permanently changed something that you needed, etc.
Part 2
![](https://www.google.com/images/icons/product/drive-32.png)
Question 2: Suppose that you are cleaning a numeric variable, and you recode "refused to answer" and "not applicable" so that they are both NA. What additional variable should you add to the data set?
A yes/no indicator for "not applicable"
A list of possible reasons that someone might refuse to answer
Show answer
A yes/no indicator for "not applicable." Although you can't use either "refused to answer" or "not applicable" as part of the numerical variable, these two answers reflect important differences in the respondents, and you don't want to lose track of that information.
Part 3
![](https://www.google.com/images/icons/product/drive-32.png)
Question 3: If you run a regression on a predictor that should be categorical but is not stored as a factor variable, what can go wrong?
You might incorrectly assume a linear relationship between the outcome variable and a categorical variable, with numbers arbitrarily assigned to the categories
You might get an error
Either of the above
Show answer
Either of the above.
Part 4
![](https://www.google.com/images/icons/product/drive-32.png)
Question 4: Which of the following are variable types in R?
Numeric
Character
Factor
Date
BRFSS
Show answer
The first four. There are other types as well, such as logical (TRUE or FALSE).
Part 5
![](https://www.google.com/images/icons/product/drive-32.png)
Question 5: When it is appropriate to use the pdf() and dev.off commands?
When you are revising a graphic
When you have finalized your graphics
Show answer
When you have finalized your graphics
Missing Data Code (missing data concepts are in a separate module, but it doesn't matter which you watch first)
The following code and videos were created by a former QAI intern, Joanna Harton '16 (now a biostatistician!) so you will hear a different voice.
Please download the code file and the csv file used in the missing data video.
![](https://www.google.com/images/icons/product/drive-32.png)
Question 6: What goes in the function na.strings()?
A comma delimited list of the strings that constitute missingness
A vector containing the strings (in double quotes) of what qualifies as missing data
Show answer
A vector containing the strings (in double quotes) of what qualifies as missing data
Dates
Please download the code and example file:
The function new_interval has been retired from the package lubridate and replaced by the function interval. Our code file has been updated, but not the video, so you will see that the code and video differ on that line. (The old command new_interval does still work, but a warning pops up suggesting that you use the new command interval instead.)
![](https://www.google.com/images/icons/product/drive-32.png)
Question 7: What is the purpose of converting the interval between the dates to a period?
So that you can convert the period to a graph
So that you can access the number of years between the two dates
So that you can tell if you put the dates in backwards (and have a negative interval)
Show answer
So that you can access the number of years between the two dates
Hooray, now you can get a job cleaning data! Maybe?
During this tutorial you learned:
Data cleaning methods, through an example with the CDC Behavioral Risk Factor Surveillance System (BRFSS) data set
How to create reproducible code for data cleaning
Different ways to manage data sets, including a review of workspace in R
How to update a string object with paste()
To directly read in files for specific file formats with packages (eg: read.xport() function from SASxsport package to read in BRFSS file)
Why to save data and R objects as .R files
About variable types in R and recoding variables
How to create a data frame with data.frame() and matrix with cbind() or rbind()
About how R handles missing data and how to manipulate missing data symbols
To work with dates using the lubridate package
Terms and concepts:
Reproducibility, recoding
Functions in review:Â
getwd(), paste(), save(), is.na(), class(), is.numeric(), is.character(), is.factor(), as.factor(), as.Date(), data.frame(), cbind(), rbind(), merge(), na.strings(), Sys.Date()
lubridate functions in review:
wday(), interval(), as.period()