Introduction to R

Why R?

Simply, it's what I use 99% of the time. But also:

R is free. Not all students get licenses through their school for statistical software packages. Coding in R helps increase the reproducibility of research. In a matter of minutes, anyone can run a standard R file on their computer for free. This helps non-academics reproduce your research as well. There's also no paying for updates. If your version of R is outdated, just download the new version. For free.

R is open source. There are over 15,000 packages for R. At some point, someone probably had the same question as you and designed a package for it. Simply plug and go for your analysis. Disclaimer: Some view the open source as a negative aspect of R citing that the packages may not be reliable. This is typically not an issue for commonly used packages. Just remember to be careful.

Downloading R

If you use a Mac, go here: https://cran.r-project.org/bin/macosx/

Windows, go here: https://cran.r-project.org/bin/windows/base/

You also need to download RStudio: https://rstudio.com/products/rstudio/download/#download

Windows users should also install Rtools (Skip for Mac users): https://cran.r-project.org/bin/windows/Rtools/


After you've finished downloading R/RStudio and following their installation instructions, you never actually need to open R. For reference, R is a language. You need to download R, so that your computer learns the language.

RStudio is just an interface. RStudio essentially allows you to have a conversation with your computer in R (not English or Chinese or whatever your native language is).

RStudio Interface

Again, you don't need to open R. You only need to open RStudio. When you open RStudio for the first time, it should look like similar to Figure 1. There are three main panels: the Console, the Environment, and the Plot/Help Panel.

Figure 1: RStudio Interface

We'll start with the console. The console allows you to communicate with your computer in R. Typically, your computer will respond back to you. It helps to think of coding as a conversation. You ask your computer a question, and your computer responds with an answer.

Say, for instance, you forgot what one plus one is. How do you find out? In the console, simply type:

1 + 1

This needs to be typed after the blue arrow:

After typing 1+1, hit enter to get an answer from your computer:

The computer should calculate the correct answer, 2. At this point, you're practically a professional in R. You've learned how to talk to your computer, and that's half the battle. The rest is learning how to ask your computer more difficult questions (i.e. learning more of the R language).

Notice that the computer also returned [1]. This is not technically part of the answer, it is actually the index for the value 2. Why? Because the computer returned a vector where 2 is the first and only element. This leads us to our next section: Data Types and R-objects.

Data Types and R-objects

What is a vector? A vector is simply a collection of elements. In the previous example, 2 was the only element in the collection. Let's fix this.

To make a vector with more than one element, use the function c(). For example:

I hit enter. Why didn't the computer say anything back?

Notice that I entered vec_y = c(...). The computer does not read this as a question, so it does not provide an answer. Instead, it reads the line of code as a statement. Specifically, the computer reads "combine the values (24, 28, 27, 28, 22, 25) into a vector object with name vec_y." Remember the Environment panel?

Now there is an object saved to the global environment. Suppose this vector is wages observed from a cross-section. Rather than explicitly type out c(24, 28, . . .), we can simply write vec_y to refer to the wages.

To help grasp the concept of data types and objects in R, create another vector called vec_x. Assign the first three values of the vector to be "Male" and the last three "Female".

This should have also added vec_x to the global environment. Notice that there are a few differences between vec_x and vec_y. First, Male and Female are in quotation marks whereas the numbers in vec_y are not. That's because Male and Female are strings, and (24, 28, . . .) are numbers. Thus, they are saved as different data types. We'll focus specifically on four data types in R:

  1. Character ("Female", "24", "TRUE")

  2. Numeric (24, 2, 1.68)

  3. Integer (1L, 5L, 20L)

  4. Logical (TRUE, FALSE)


If you forget the data type of a vector you saved, look at the environment panel.

Alternatively, you can use the function typeof() with the object of question as the input. In the example above, all elements in a given vector are of the same type. But what if we want to make a "vector" of two different element types? First, let's combine vec_log with vec_y and then combine vec_y with vec_x as new R-objects. You can do this by typing c(vec_log, vec_y) in the console.

Notice that vec_log_y is of type numeric and vec_y_x is of type character. The first merge changed TRUE to 1 and FALSE to 0. The second merge put all of the numbers in vec_y in quotes. These changes happened because all elements must be of the same type when using a vector as the R-object. If we want the elements to maintain their original types, we must use a different R-object. We'll focus on 6 R-objects throughout the book:

  1. Vectors (1-dimensional, elements all same type)

  2. Lists (1-dimensional, elements different types)

  3. Matrices (2-dimensional, elements all same type)

  4. Data Tables (2-dimensional, columns different types, elements within column same type)

  5. Factors

  6. Data Frames


We'll skip the details behind each R-object for now. The main takeaways are the "dimensions" of each object and the restrictions for the elements. Lists and vectors are 1-dimensional while matrices and data.tables are 2-dimensional (rows and columns).

Vectors and matrices require all elements to be the same type while lists and data.tables can contain different types of elements. Thus, if we wanted to combine vec_y and vec_x with each element keeping their own type in the example above, we would write list(vec_y, vec_x).

Except for Data Tables, all of the R-objects above are included in base R. This means you do not need to install packages in order to use them. Any time that you open R, you can type c(...) and create a vector, list(...) to create a list or matrix(...) to create a matrix. However, if you wish to create a data.table, you'll first need to install the package.

Packages

What R packages? Typically, a package includes extra functions to speed up your code, or ask your computer different questions. As mentioned earlier, there are over 15,000 R packages. Most of these are stored on the Comprehensive R Archive Network (CRAN). Packages stored on CRAN are easy to install. In the console, type install.packages("data.table").

Something similar should display in your console after hitting enter. I already had the package data.table installed on my computer. In this case, install.packages() simply updates the data.table package.

Remeber the Plot/Help panel? If we take a look under the tab "Packages", we should see that the data.table package is installed.

Notice that the package datasets has a blue checkmark next to it while data.table does not. This is because we have not yet loaded in the data.table package to R. We have simply installed the package to the computer.

What happens if we try to use a function in the data.table package right now? Type dt_Wages = data.table(Wages = vec_y, Gender = vec_x) into the console.

This should have generated an error. The first line Error in data.table(Wages = vec_y, Gender = vec_x) : tells you where the error is in your code. The second line could not find function "data.table" tells you what the error is.

Another way to think of packages is as vocabulary or slang. When we installed R at the beginning of the chapter, we taught our computer the language. When we install packages, we teach the computer additional vocabulary/slang in the R language. Every time you open RStudio, it assumes you're using "regular" language (base R). If you wish to use a package (slang), you need to tell R to talk that way. We do this by typing library(data.table). Note: When installing a package, the package goes in quotes install.packages("data.table"). When adding a package to the library, there are no quotes library(data.table).

After typing library(data.table), there should be a blue checkmark next to the data.table package.

If you forget what packages/slang you are currently using, you can always look at this panel to see which packages have checkmarks. Remember, every time you close and re-open R, you need to say which packages you want to use.

Now that we told the computer that we're using data.table slang, let's try to make a data.table of wages again, dt_Wages = data.table(Wages = vec_y, Gender = vec_x). There should not be an error in the console this time. Now that the computer understood what we were saying, it should have added dt_Wages to the environment panel.

Notice that dt_Wages is located under the subheading Data. All of the previous vectors that we made are under the subheading Values. Why? Recall that vectors and lists are "one-dimensional" while matrices and data.tables are "2-dimensional." A general rule of thumb is that R-objects with the capability to have more than one dimension are allocated to the Data subheading. So even if we made dt_Wages with just one column, it would still go under Data.

From here, there are several things we can do to view the data. Notice the blue arrow to the left of dt_Wages. If we click on the arrow, we should see

Now, we see two columns in dt_Wages. These are Wages and Gender. We also see the type of each column. Wages is numeric while Gender is a character column. Notice that this cuts off part of the Gender column. If we want to View() the entire data.table, then we need to click anywhere on the dt_Wages line except for on the blue arrow.

This should have created a new panel. I'll refer to this as the Source/Data Panel. This panel now displays a visual representation of the entire dataset. The "first" column is just the row index.

Now, close RStudio. A pop-up asking you to "Save workspace image" should appear. Click "Don't Save".

I almost always recommend clicking don't save. Why? Clicking save would have saved all of the Data and Values in the Environment Panel. If you have a lot of data loaded in, this will slow down RStudio the next time you try to open it.

Now, reopen RStudio. Everything in the Environment Panel should be gone. The Console Panel should not have any code in it. "But what if we wanted to work with cross-section data of wages and gender again? You just told me not to save anything."

The next section shows the steps to create, write, and save a .R file (think .do file in Stata).

Writing Code

Why write a .R file? Why not just click "save the workspace image" when we closed out? Remember that writing code is essentially just having a conversation with your computer. Saving a .R file will allow you to "record" the entire conversation while saving the workspace environment only records the answers the computer gave to your questions. This is why saving .R files is important.

Now, let's create dt_Wages again, but not in the console. At the top left of the RStudio interface, there is a piece of paper with a green "+". We want to click this and then click "R Script".

The Source/Data panel should have opened again, and now we can start writing a .R file. (From now on, you should almost exclusively talk to your computer through the Source panel. The Console and the Environment panel will show you what the computer says back.)

Now, write the code to create dt_Wages again in the Source panel rather than the Console. If you forgot the values of the wage and gender vectors, we used vec_y = c(24, 28, 27, 28, 22, 25) and vec_x = c('Male', 'Male', 'Male', 'Female', 'Female', 'Female'). Remember, since we reopened RStudio, we have to say that we are using data.table slang. The end result should look like this:

Some quick notes while we're here. You can skip to the next shaded section if you like.

Always comment your code! I can't stress this enough. Projects last a long time. Our memories don't. There's nothing worse than wasting a day trying to figure out what your code says after you haven't looked at it for six months. This is even more important if you're working on a joint-project with multiple people coding. It's also important for people trying to replicate your project after it gets published in the AER.

While we're on organization, you' probably noticed how I like to name R-objects by now. I first write an abbreviation of the type of the object (dt for data.table, vec for vector, df for data.frame, mat for matrix). Why? Most of the time, you'll be working in the Source panel with a .R file. If you don't run the code, it won't generate the Data and Values in the Environment panel; so you might not know what type of R-object you're working with. This can create issues as some functions only work for data.frames, but not data.tables and vice versa.

After the R-object abbreviation, I write an underscore followed by what the main variable the object is recording. For vectors, this is simple. For objects with more than one-dimension, I like to write the dependent variable. Thus, dt_Wages. This is just personal preference. Do what works for you. Just try your best to stay consistent within a project (something I'm still working on).

Last part on organization. I like to start my .R files with two lines. The first line says what project is this code for. The second line says what the code does.

You should notice two things slightly different from before. I added a line of code to clear the contents of the working environment, rm(list = ls()). Let's say that you just ran a separate .R file that saved some Data and Values. These could mess up the analysis of our new file which is why we clear the contents.

You should also notice on line 13 that, rather than writing out Male and Female three times each, I used the function rep(). This function replicates a value x. In this case, I said rep(x, each = 3), but the default is rep(x, times = 3) or rep(x, 3). What's the difference? The default rep function will replicate the entire vector x three times. This would result in ("Male", "Female", "Male", ...), but we wanted ("Male", "Male", "Male, "Female", ...). Using each = 3 says replicate each element in x three times sequentially. Note: rep() is a commonly used function for me. Having to distinguish between whether I want to use each or the default times is also a common occurrence which is why I mentioned this here.

Now, we want to run our code. There are three different ways to run the .R file:

  1. Line-by-line

  2. By Section

  3. Entire File

During writing and debugging, I don't typically recommend running the entire file. This is more useful after you've finished writing a file to make sure there aren't mistakes. To do this, click the "Source" button in the top right of the above figure.

I usually run line-by-line while I'm writing a new file. You do this by clicking the "Run" button in the top right of the figure above. Note that this will run the line where your cursor is currently at in your .R file. So if you want to create the wage vector, click your cursor to line 12 and then click "Run". Make sure to pay attention to your cursor! Running the wrong line of code can often mess things up.

To run a section of code, you also use the "Run" button. However, instead of having your cursor on a specific line, you need to highlight the lines to run. If you want to create the wage and gender vectors, it should look like this:

This will also run line 11 which is just a comment. The computer won't respond to comments. Think of these as "thoughts in your head".

Using data.tables

This section explains the basics of data.tables. For any empirical work, data.tables will be our workhorse. They're basically a data.frame on steroids. Honestly, one of the coolest things I've ever seen. I'm only describing the basics of data.tables in this section. They can do a lot more which gets slowly introduced throughout the book. If you ever have issues or want to see some of the other cool things, then check the recommended readings. Recommended Reading: Introduction to data.tables and the FAQ. Section 1 of the FAQ is specifically for beginners. Note: After R inevitably wins the R vs Stata argument, most people move on to the data.table vs dplyr argument. dplyr was created as part of the "tidyverse" by Hadley Wickham. This book will focus only on data.tables. In your free time, it might be wise to acquaint yourself with the dplyr language as it varies quite drastically from base R and data.table.

Go ahead and click the "Source" button if you haven't yet. This should have ran the entire file and created the data.table dt_Wages as before.

We currently have each individual's hourly earnings. Let's say that we want to add a new column of their yearly earnings. But we already made the data.table, how do we add the column? To add new columns in a data.table, we use := to assign the values. We can approximate yearly earnings by 2000*Wages. To update the dt_Wages, we write dt_Wages[, EarningsYearly := 2000*Wages]. Run this line by itself as was described above. Then click on dt_Wages in the Environment panel to open the data in the Source/Data panel.

So what happened step-by-step? First, we wrote dt_Wages. This tells the computer which object we are talking about. Then we wrote [, ]. Why is the comma needed? A data.table is set-up as dt[i, j, by]. For now, forget about the by section. This is then just standard matrix notation, mat[i, j], where i represents the rows and j represents the columns. Since we left the i section blank, this means "for all rows". After the comma, we wrote EarningsYearly :=. This says to create/update a column named "EarningsYearly". Lastly, we wrote 2000*Wages. This says what to assign to the elements of the new column.

But what if we only wanted to select some rows? Assume that the approximation of yearly earnings of 2,000 times hourly wages was calibrated to the average male hours worked per year. Next, assume that females work 5% less hours per year. The values for female are then overestimated. To correct this, we can update the EarningsYearly column just for females by writing dt_Wages[Gender == 'Female', EarningsYearly := 0.95*EarningsYearly]. (You may need to click on the "Untitled1" tab to return to writing your .R file.)

After writing, run just this new line. If you click on the "dt_Wages" tab in the Source/Data panel, you may see that nothing updated. Why? Because you're still viewing the old data. To view the new data, again, click dt_Wages in the Environenment panel. Now it should have updated the values for females.

At this point, you've learned how to create/update columns. You've also learned how to assign values to specific rows in a column. So you should be starting to get used to what i and j do. But what about the by part in dt[i, j, by]? Let's assume that we also see the houshold each individual belongs to, vec_hh = c(1, 1, 2, 2, 3, 3) or if you remember from earlier vec_hh = rep(c(1, 2, 3), each = 2). Create a new column called id_household with these values in dt_Wages.

Now, let's calculate yearly household income. To do this we write, dt_Wages[, EarningsHH := sum(EarningsYearly), by = c('id_household')]. If you view the data, you should now see:

You can manually check that this calculated each household's income correctly. Now, you should start to have an understanding of dt[i, j, by]. As one more example, let's say we're writing a report on the gender wage gap. So we want to calculate the average earnings for males and females separately. Also assume we want to analyze both the unconditional gender wage gap for yearly earnings and the conditional (on hours worked) gap. Similar to above, we could make two new columns in dt_Wages by writing dt_Wages[, `:=`(EarningsGender = mean(EarningsYearly), WagesGender = mean(Wages)), by = Gender]. Note: An accent ` goes before and after := when assigning multiple columns in a data.table. This is not an apostrophe, '.

Note 1: I recommend using the spacing in the figure above when assigning multiple columns. Thus, there is one new/updated column per line. After all new columns are written out, I then return one more time for the by statement. Note 2: Notice the difference between the by statements on lines 29 and 34. Either is correct when conditioning the by statement on only one variable. Let's say we had a panel and we wanted to find each household's income for each year. We would have to write by = c('id_household', 'Year').

Although the method on lines 32-34 works fine, assume that we don't care about individual or household income. Now, dt_Wages is full of a bunch of extra information that we don't need. (This is common in other settings such as aggregating data from the county to the state-level and we just want to work with the state-level data.) In this case, we don't want to create new columns. We want to create a new data.table. Luckily, the code is almost identical. There are two differences.

First, we need to change `:=` to a period .. Instead of creating new columns, this tells the data.table to perform the operations and "collapse" on the by statement. Second, we need to name a new data.table. Let's call it dt_GenderGap. Altogether, we now have dt_GenderGap = dt_Wages[, .(Earnings = mean(EarningsYearly), Wages = mean(Wages)), by = Gender]. Running this line will create a new item in the Data section of the Environment panel. If you click the dt_GenderGap line, your RStudio interface should look like this:

As the last step, we'll now calculate female earnings as a percentage of male earnings to summarize the wage gap. To do this for earnings, we write dt_GenderGap[, 100*Earnings/Earnings[Gender == 'Male']]. Notice that selecting specific rows works in the i position as we showed earlier, as well as the j position as in this example. Here, we selected a specific value of Earnings to divide the entire Earnings column by. Also, notice that we did not use the assignment operator :=. Thus, R does not update dt_GenderGap. It simply prints the output of the question to the Console panel.

Write and run this line for the wages as well to calculate the conditional gap on hours worked. In the Console panel, you should see:

In this example, females make 90.2 cents for every dollar a male makes throughout the year. After accounting for differences in hours worked, females make 94.9 cents for every dollar a male makes. Now we have the information to write our report on the gender wage gap. What if we wanted to save our code to work on it later?

Saving Files

This section is admittedly less about saving files in R, and more about organizational structure. To save a file, simply click "File" then "Save as" or click the floppy disk image like you would in any other program. Note: Clicking the "double floppy disks" will save all open files, so be careful if you only want to save one file.

If you don't have an organizational structure for your projects yet, I highly suggest you use something close to this:

  1. Code

    • Data

    • Empirics

    • Model

  2. Data

    • Analytic

    • Raw

    • Temp

  3. Output

  4. readme.txt

This is for projects that are in progress. I've seen a few other variants of this structure, so feel free to change things a little to what works best for you. For projects that are no longer in progress (replication files). You can likely get rid of the "Temp" subfolder and the "Output" folder. The output folder is just for you to save figures/tables and move them to a latex document. Note: Sometimes I have no idea what I'm doing when I try to write code. If you run into this issue, it might be helpful to make a "Temp" subfolder under the "Code" folder. After you figure out what you're doing, move the .R file to the correct subdirectory.

In the example for this chapter, we would save our .R file to the "Empirics" subfolder under "Code". I'm still working on proper names for files, but I would probably call this something like "CalcGenderGap.R" which is short for calculate gender wage gap. Again, do your best to name your files something that you'll remember in six months. It also helps to keep your readme.txt file updated when you add new code or data. In case you do forget what a file does, you can always check there.