Introduction to Data Analysis

Data Mining and Cleaning

Mining Data

There's a bunch of different data files (.txt, .csv, .dta, .xls(x), . . .). We'll focus on the two most commonly used: .csv and .xls(x).

Most of the time, mining data is as simple as clicking "download". This subsection focuses on part of the data cleaning process for "Stay-at-Home Orders in a Fiscal Union" (with Mario Crucini).

First, go to USAFacts and download county-level case data at a daily frequency here: https://usafacts.org/visualizations/coronavirus-covid-19-spread-map/. There should be a set of buttons on the right-hand side of the webpage:

If you click on the known cases button, it will load a page with the raw data. We don't want that. Instead, right-click the button and click "Download Linked File As...". I already set up a folder called "Economics in R" that looks like this:

If you haven't done this yet, it may be helpful as we'll save files to several of these folders by the end of the chapter. We see that USAFacts named their case data "covid_confirmed_usafacts.csv". Not a bad name, so we'll keep it like that and save it in the "Raw" subdirectory of the "Data" folder. While we're at it, let's also download the death and population data to that folder as well.

Pretty simple right? So what if the data isn't already compiled as a .csv file?

Now, go to Homebase and download state-level hours worked data at a daily frequency here: https://joinhomebase.com/data/covid-19/. What happened to the download button? There's just a bunch of tables that look like this:

Not quite as simple this time. But we can still do this in just a few steps.

  1. Click anywhere on the table.

  2. Type Command+A. (This should select the entire table)

  3. Copy the table.

  4. Open Excel.

  5. Paste the table.

We're almost done. You should have something in Excel that looks like this:

Now we have several extra columns and rows that we don't need. First, delete Columns A-C. Then, delete all of the rows up to the state-level data (Rows 1-73). Row 1 should now be the column names/dates. Row 2 should be entirely blank. Row 3 should start the data entries. We don't need the blank row, so delete that.

Now, element 1A should be blank. There's no column name for the state names. Simply, call it "State". Your file should now look like:

You should check the "edges" of the table to make sure everything is okay. The top edge should all be column names. The left edge should all be state names in this example. In general, the left edge is some sort of identifier, but it doesn't always have to be the state name. Typically, the right-hand edge is fine. There shouldn't be any extra data "hanging" off the edge.

Most problems occur on the bottom-edge. This is typically where Notes for the table are located. Sometimes you can delete the note and move on. Other times, the note is referring to a specific data entry or group of data entries that have been revised. It's wise to check these elements and make sure that they are entered correctly.

The Homebase data is pretty clean, so we didn't run into any of the issues above. Go ahead and name this sheet "Hours". Then create two new sheets: "Businesses" and "Employees". Repeat the above process for the businesses open table and the employees working table on the Homebase page. Notes: Make sure to paste the data in the correct sheet. Remember that we need to add the column name "State" for the state names column after we clean the data.

After you've finished the other two excel sheets, save the file with the USAFacts data in the "Raw" subdirectory of the "Data" folder. I chose to call it "EconomyHomebaseState.xlsx" For raw data, I typically try to name them based on three factors: (1) what is the data about? (2) where is the data from? (3) what level is the data at? Again, use what works for you.

Data Transformation

We have the raw data files that we need saved in the raw data folder. This is great, but we want "Analytic" data. Specifically, in this case, we want to analyze the relationship between Covid-19 and the economy. How do we do that when all of the data is spread across three csv files and an excel file with multiple sheets? We need to merge() all of these files together. We'll start by using RStudio to load in all three USAFacts datasets and merge them. First, create a new R script.

Notice that we have several new packages compared to the last chapter. This time we'll also use readxl, cdlTools, and tictoc. You may need to install these packages. If so, you can do this in the console now.

Also notice, that I have a section for "Filenames Needed". Here, I first set the working directory. Then I assign all of the filenames needed to a character vector. I have not always done this, but it seems to be the easiest way for reproducibility. Sometimes I'll have over a hundred lines of code written before I'm finished reading and merging files. If you change the filename on your computer, it's annoying to scroll through all of these lines. This saves you from that hassle. Setting the working directory is mostly for other people. (We may not care about them as much, but it's still nice to do.) They probably don't save their files in the same exact path as you, so this saves them time. Instead of fixing every file, they can just change the working directory.

After you've written and run the code above, we'll start by reading in the file for daily cases. We know that the confirmed case data is saved as a .csv file. For .csv files, we'll use the function fread which is part of the data.table package. It stands for friggin' fast reader. It's one of my favorite parts of the data.table package.

I've read in the file as dt_cases = fread(filename_Cases). As we discussed in the previous chapter, you should now see dt_cases in the Environment panel under the Data subheading. Note: If this didn't happen and instead the computer gave you an error, it's most likely from the setwd('...') line. Make sure this matches where the "Economics in R" folder is on your computer. For example, if you made the "Economics in R" folder directly on your desktop, then you need to remove "Projects/". Windows users may have to change a little extra.

Now that the data is read in, let's View() it. Remember, we can click on dt_cases in the Environment panel to open the data browser in the Source panel. You should see:

This is called wide data. No one likes wide data. We like long data. What's the difference? First, what are our identifiers? The main two are "countyFIPS" and "Date". For now, we don't care about the state ones.

"But I don't see a 'Date' column." Yep, that's why this is wide data. For Autauga County, we read the data left-to-right (wide). So in Autauga County on January 22nd, 2020 there were no newly confirmed cases of Covid-19. If we move one column to the right, we see there was also no cases on January 23rd. Although this is nice to read, it's difficult to work with. What if we want to add another variable to this dataset? We could do it, but it would be sloppy and unorganized.


melt()

Unlike wide data, long data reads up and down. To make the data easier to work with, we'll melt() it from wide to long. melt() is a complicated function, so let's learn how to use it. In R, if you don't know how a function works, type the function name without brackets preceded by a question mark in the Console panel (This is almost the only time I use the console). In this case, we would type ?melt. The Package/Help panel should have changed now.

We want to focus on the "Usage" and "Arguments" sections. The Usage section tells you what the function takes as inputs. It also tells you the defaults for inputs if we don't write anything for them (e.g. variable.name = 'variable'). For melt(), there are two main inputs. These are data and id.vars. As intuitive as it seems, this has to be my most viewed help file by far.

We want to change dt_cases from wide to long. We don't care about keeping the wide format in this circumstance, so we'll overwrite the data.table. To do this, start by writing dt_cases = melt(data = dt_cases, . . .). So far, this line tells the computer that we want to melt the old dt_cases to make a new dt_cases. However, we haven't said how to melt the old dt_cases. We don't want to melt the "State" or "stateFIPS" column right? That's because these are additional identifier columns.

In the melt function, all identifier columns need to be reported to the id.vars input. So now we have, dt_cases = melt(data = dt_cases, id.vars = c("countyFIPS", "County Name", "State", "stateFIPS")). Although we have the bulk of the function done, there are a few things that are easiest to tidy up now. These are the inputs variable.name, variable.factor, and value.name. What's the difference between variable and value? The easiest way to remember this is that the variable is what varies over columns. The value is what is being recorded. In this case, the variable is the date and the value is the number of newly confirmed Covid-19 cases. By default variable.name = 'variable'. We don't want that. I'll forget what "variable" is later.

To finish off, write dt_cases = melt(data = dt_cases, id.vars = c("countyFIPS", "County Name", "State", "stateFIPS"), variable.name = 'Date', variable.factor = FALSE, value.name = 'Cases'). Now, we set the column name for the dates to "Date" instead of "variable". Note: I haven't described the reasoning for setting variable.factor = FALSE. Why would data.table package set it to TRUE by default? We'll circle back to this later.

Click the new dt_cases to see what the melted data.table looks like.

There should now be a column for "Date" and a column for "Cases". Although we can't read all of the values for Autauga County as fast now, this long format is much easier to work with. What if we want to add another variable to this new dataset? Now, it's simple. Hopefully you remember from the previous section how to add/update a column in a data.table. Hint: :=

Now, read in the Covid-19 deaths data from USAFacts as dt_deaths. After reading the data, melt() it from wide to long. Remember, the value isn't cases anymore. The value is what is being recorded. In this case, value.name = 'Deaths'. Everything else can remain the same. At this point, you should have two nearly identical data.tables loaded in. The first five columns in each should be location and time identifiers and the last column should be either "Cases" or "Deaths".


merge()

How do we combine these two tables together? We'll use the merge() function. We could create a new data.table called dt_covid by writing dt_covid = merge(dt_cases, dt_deaths, by = c('countyFIPS', 'stateFIPS', 'Date')). Notice, in this case, that we don't have to include the other two identifiers "County Name" and "State". These are redundant and will slow down the merge. Speaking of redundancy, isn't the economic data we downloaded from Homebase at the state-level? Why are we working with county-level data in this case? The answer is: we shouldn't be.

This leaves us with two options. We can do the merge at the county level making dt_covid, and then aggregate the county data to the state level (Similar to what we did in the first chapter with the gender wage gap data). Or we could aggregate dt_cases and dt_deaths to the state level and then merge them to make dt_covid. At first glance, it may seem like it's better to do the merge first since it only requires two steps. Let's check!

Type and run the code above. (These run fast, so highlight from tic() to toc() for each time test and then click the Run button.) You should get results in the range below. Note: The functions tic() and toc() are from the package mentioned earlier tictoc.

"Wow, Oscar, you saved less than half a second!" I know this seems inconsequential in this example, but it can make significant differences when working with larger datasets. Merging keeps getting slower the more groups you add. I've wasted a lot of time crashing my computer from poorly-written merges. Don't be me.

Now that I'm done with that rant, let's move on. Read in the population data from USAFacts. Similar to above, aggregate the population data to the state level first, then merge it with dt_covid. If you've been copying my code, you should run into an issue.

Notice that dt_covid has "stateFIPS" with the state FIPS code as the identifier, but dt_population has "State" with the state abbreviation as the identifier. We need to merge a state FIPS (abbreviation) with a state FIPS (abbreviation). To fix this, we'll use the function fips() in the package cdlTools to first add a "stateFIPS" column to dt_population, then merge the two. Your code should look something like this:

Last Steps

On to the last step: We need to now read and merge the economic data with the Covid data. Unlike the csv files from USAFacts, the Homebase data was spread across three spreadsheets in an excel file. To do this, we need to use a different file reader. We'll first use the function read_excel() in the readxl package. Write dt_hours = read_excel(filename_Economy, sheet = 1). Notice that I wrote dt_hours, but if we write class(dt_hours) the console will show that it is actually a data.frame. Before we do anything else, we need to convert this to a data.table to stay consistent with our naming structure. Simply write, dt_hours = data.table(dt_hours). Just to double check, write class(dt_hours) in the console again. Note: I normally do these two steps in one line, so that the naming convention is not obscured, dt_hours = data.table(read_excel(...)).

We should know what the data looks like since we input it in excel, but let's look at it in RStudio.

"I thought the columns were dates? Why are there weird numbers as the column names?" Short answer: It's something weird with the Excel dating system. We'll fix this in a moment, but first let's melt() the data from wide-to-long as before. You should have three columns: "State", "Date", and "Hours". To fix the date column, we write dt_hours[, Date := as.Date(as.numeric(Date), origin = '1899-12-30')]. Note: The origin for the excel dating system is December 30th, 1899.

If you View() the melted data.table, you should notice that the dates are all in 2021. However, the data was recorded in 2020. This is because we did not provide a year when we pasted the data in excel. To fix this, we need to sub()stitute "2021" with "2020", dt_hours[, Date := sub('2021', '2020', Date)]. Note: If you do not understand the substitute function, remember you can look at the details by writing ?sub in the console.

We're almost ready to merge the hours worked data with the Covid data, but what's the problem? Our identifier columns do not match. In dt_covid, "Date" is written in m-d-yy format while it's in yyyy-mm-dd format in dt_hours. Don't use m-d-yy format. It's terrible. Always use yyyy-mm-dd. So in this case, we're going to fix the date in dt_covid. We do this by writing dt_covid[, Date := as.Date(Date, format = '%m/%d/%y')].

Lastly, we need to fix the state identifier. dt_covid has both the state FIPS and the state abbreviation while dt_hours only has the state name as an identifier. We'll add the FIPS code to dt_hours and merge on that. As before, we can write dt_hours[, stateFIPS := fips(State)]. However, you should notice that for "Delaware" the state FIPS is entered as NA. This is a bug in the cdlTools package. That's why it's important to remember to check packages for bugs! We can fix this pretty simply using an if-else statement, dt_hours[, stateFIPS := ifelse(State == 'Delaware', 10, fips(State))].

We can finally merge the two datasets as normal, dt_covid = merge(dt_covid, dt_hours, by = c('stateFIPS', 'Date')). For now, skip this step. Why? We also want to add the businesses open and employees working data to dt_covid. Rather than copy and paste the same code three times, we'll write a for loop to do this.

Notes: You should notice a few extra lines. First, on line 125, I change the column "Date" in dt_covid from a Date to a character. This is simply for the merge() function to work. The "Date" columns in dt_covid and dt_econ need to be of the same type. Line 152 is also new. Notice that there is already a column "State" in dt_covid before the merge. Writing dt_econ[, State := NULL] removes the column "State" from dt_econ, so that we don't have four "State" columns when we are finished merging.

Writing Data

Your final dataset should now look like:

That's great, but we're already over 150 lines of code. We don't want to re-write this, or have to copy and paste it everytime we want to work with our data. Instead, we'll save it to the "Analytic" subdirectory in the "Data" folder for this project. This step is simple, fwrite(dt_covid, file = 'Analytic/CovidEconomyState.csv').

I like to put this line of code into an if statement as above. If "CovidEconomyState.csv" already exists in the "Analytic" folder, then the previous version will be overwritten. This is just a precautionary step to not overwrite data you wanted to keep. if (FALSE){...} tells the computer to not run the lines inside the brackets. Thus, you won't accidentally overwrite data you wanted to keep if you click the Source button. Instead, you have to deliberately run the line of code. Final Note: Don't forget to save the .R file. This should go in the "Data" subdirectory of the "Code" folder. I saved mine as "CreateCovidEconPanel.R".

Data Analysis

We finally have a dataset we can work with, but what are we supposed to do with it? There are a few things that should be done in the background of any project: correlation tables, basic regressions (essentially conditional correlations), plotting regression residuals. These normally don't make the final cut for the text of a paper, but they can help you understand the data.

Before we get stared, create a new .R file called "DataExploration.R" in the "Empirics" subdirectory of the "Code" folder. We're adding several new packages for this section: stargazer, ggplot2, and ggthemes. Again, you'll need to install these first using the console. stargazer is great for making tables ready to export directly to Latex. ggplot2 is awesome. Lots of flexibility in making graphs. I don't think many people will push back if you say that ggplot graphs are better than Stata graphs. ggthemes just makes the graphs look nicer.

Your header should now look like the above. After copying the header, read in the data we just saved. In case you forgot dt_state = fread(filename_panel).

Correlations

We'll start by checking the cor()relations among the data. To do this, we'll write cor(dt_state[, c('Population', 'Cases', 'Deaths', 'Hours', 'Businesses', 'Employees')]). The console should display:

As expected, Cases and Deaths have a large positive correlation as do (Hours, Businesses, Employees). We also see that there is a slight negative correlation of Cases with the Economic variables. Population has a slightly positive correlation with the Covid variables, but almost no correlation with the economic variables.

Although this helps describe the overall relationship among the data, remember that we have a daily, state-level panel. Are these correlations being driven by changes over time or differences across states?

Let's take a look. First, we're going to create four subsets of data. Two subsets to look at the cross-section. One cross-section on March 1, 2020 subset_March01 = dt_state$Date == '2020-03-01', and another on May 13, 2020 subset_May13 = dt_state$Date == '2020-05-13'. This allows us to compare the correlations of the data at the onset of the virus as well as later in the pandemic. The other two subsets will look at the time series. I chose Ohio (subset_Ohio = dt_state$State == 'OH') and New York as my states, but feel free to choose other ones.

Now, re-run the correlations for each subset. (If you forgot how to perform an operation on only certain rows of a data.table, remember that they are set up as dt[i, j, by]. For March 1st, we would then write cor(dt_state[subset_March01, c(...)]).

I'll let you go through the correlations in detail yourself, but I'll highlight a few pieces. Comparing the first two correlation tables, we see that Cases has a stronger negative correlation with the economic variables on May 13th than they do on March 1st. Also notice that New York has a stronger negative correlation with the economic variables than Ohio. Note: The warning messages appear because the populations for Ohio and New York do not change over time.

Regressions

Correlation tables help describe the unconditional relationship between variables. However, they don't do a great job at understanding conditional relationships. Let's take a further look at the correlation for the May 13th subset. Notice specifically three relationships: (1) Population is negatively correlated with the economic variables, (2) Cases is negatively correlated with the economic variables, and (3) Population and Cases are positively correlated. How then do we separate the relationship between Cases and Population with the economy?

We'll use a regression. In R, the regression function is lm() which stands for linear model. Let's regress hours worked on cases and population, lm_HoursMay = lm(Hours ~ Cases + Population, data = dt_state, subset = subset_May13). Again, if you do not understand the regression function, type ?lm.

To see the results of the linear model, we use summary(lm_HoursMay).

Notice that, conditional on controlling for cases, population does not have statistically significant relationship with hours worked. However, the relationship between Cases and Hours worked is statistically significant. This suggests, in the correlation table above for May 13th, that the correlation between Population and Hours worked is driven by the fact that states with larger populations also had higher case rates. This is why it's important to run regressions to help more fully understand the relationship among variables.

Residuals

So you ran a regression. Sure, the coefficient on Cases was significant at the 95% level. But what if we want to know more about the overall model fit? Residual plots help a bunch. To generate these, type plot(lm_HoursMay).

What do these plots actually tell us?

  • Top-left Panel (Non-linearity and heteroscedasticity)

  • Top-right Panel (Distribution of residuals, along the dashed line means they're normally distributed)

  • Bottom-left Panel (Heteroscedasticity)

  • Bottom-right Panel (High Leverage Points)

For now, we'll focus on the bottom-right panel. We want to look for observations outside of the red, dashed lines. These observations have a large Cook's distance and can significantly affect our results. Let's re-run our regression without observation 2442. To find this observation, type dt_state[2442, ] in the console. This should show you that New York is the observation. Optional Reading: Residual plots aren't often covered in economics courses. However, they can be important for model comparisons. To gain an introductory understanding of them go here: https://data.library.virginia.edu/diagnostic-plots/. For a more in-depth analysis read Section 3.3.3 in An Introduction to Statistical Learning. Further Exploration: Create another subset vector to exclude New York, exclude_NewYork = dt_state$State != 'NY'. Now, re-run the regression with the extra exclusion, lm_HoursMayNY = lm(Hours ~ Cases + Population, data = dt_state, subset = subset_May13 & exclude_NewYork). Again, type summary(lm_HoursMayNY) to see how this affected the results.