R How-to videos

Terminology/Basic syntax

Here are a few things that are helpful to know when getting started

  • Data frame : This is from Base R; this is used for storing data tables. Tidyverse makes a “tibble” which is supposed to be a slightly better version of a data frame.

  • Assignment operator (<-) is used to assign data to a data frame (Keyboard shortcuts: Option+- on a Mac or Alt+- on Windows)

  • Pipe operator (%>%) is used in Tidyverse to connect actions. (Cmd+Shift+M on a Mac or Control+Shift+M on Windows). See the Tools menu in RStudio for more on keyboard shortcuts.

  • df$columnname : This is how to be explicit about exactly which dataframe a column is coming from. It’s the name of the data frame with a dollar sign and then followed by the name of the column.

  • Stringr::sub_str(): This is how to be explicit about which package a function is coming from. In this case it’s the “stringr” package and sub_str is the name of a function from that package.

  • getwd() : How to find out what your working directory is. You can type this in any file (script, markdown, etc) or in the console.

Importing data

Readr package in Tidyverse comes with 5 “parsers” for importing different text file types.

read_csv() – for comma-delimited files

read_tsv() – for tab-delimited files

read_fwf() – for fixed-width files

read_log() for web log files

(Note that BaseR has a function called “read.csv” that functions very differently)

If you need to import an Excel file, you need the readxl package from Tidyverse.

There are other packages for other file types, including JSON or SPSS. There are packages to connect to various kinds of SQL servers. And there are also ways to get data via APIs.


The readxl package works a little differently than readr. For example, if you want to override the column types that it guesses on the way in you need to use this syntax:

This will set all the columns to text:
df<- read_xlsx('datafile.xlsx', col_types="text")

or to set each column:

df <- read_xlsx('datafile.xlsx', col_types=c("text", "text", "guess", "skip", "numeric", "date"))

Basics of dplyr()

In Tidyverse, the dplyr package is the primary tool you’ll use for data analysis. The syntax is very similar to Structured Query Language (SQL).

Below are example pieces of code to use as a reference. Anywhere it says “df” is where you would insert your dataframe or tibble name that holds your data. This is case-sensitive. Anywhere it says “column_name” is where you would insert the name of a column in your dataframe. This is case-sensitve. Anywhere it says “df_new”, the code is making a new column on the fly and you can name it whatever you want.

Select columns to view in output

df %>% select(column_name1, column_name2)

Filter which rows are shown in output
(
Note the double equal sign)

df %>% filter(column_name== 'value')

Summarize your data by groups (like a Pivot Table)

tablename %>% group_by(column_name)

Do some math as part of that summarizing

Count the records:

df %>%
group_by(column_name) %>%
summarize(new_column = n() )

Sum the values in a column:

df %>%
group_by(column_name) %>%
summarize(new_column = sum(column_name))

Average values in a column:

df %>%
group_by(column_name) %>%
summarize(new_column = mean(column_name) )

Note: for median you will need the stats package.

Sort your results by a column

This will sort in ascending order

df %>%
group_by(column_name) %>%
summarize(new_column = n() ) %>%
arrange(column_name)

To sort descending, change it to: arrange(desc(column_name))

Joining data

Joining is how you put two data frames together side-by-side, matching on one or more idnumbers or other columns that match in the two data frames. Your new data frame will have more columns, not necessarily more rows, depending on the join type. More info here.

Join types:

Inner_join – this will return only the records that match between the two tables

Left_join – this will return all the records from the first table listed, but only the ones that match from the second table listed.

Right_join – the opposite of a left_join. All records from second table.

Anti_join – this will return all the records from the first table that are NOT IN the second table.

Full_join – this will return all the records from both tables, regardless if they match

Basic join syntax

df_new <- inner_join(df1, df2, by=c("column_df1"= "column_df2") )

This basic syntax will put the two data frames together and will put all the columns (from both tables) into a new data frame. It will not repeat the “joining” column if they have the same name. If there are other columns with the same names, your new columns will be named with a “.x” and “.y” attached to the end of the name to identify which table they came from. X refers to the first table listed in the join; Y refers to the second.

If you want to only bring a few columns from the second table (and avoid having duplicate columns), you can do that as part of the join. Note that we include a pipe and select command before the comma and you MUST include any columns that are being used in the join.

df_new <- inner_join(df1, df2 %>% select(column_df2, col1, col2, col3) , by=c("column_df1"= "column_df2") )

Appending data

Typically you would use this when you have two data files – hopefully with all the same columns and column names – that you want in one data file. In other words, you’re adding the rows from one to the bottom of the other data frame.

Tidyverse has a function called “bind_rows” that will do this, but thankfully it doesn’t insist that the number of columns (and their names) be identical between the two data frames. Base R has a function called “rbind” that does require these to be exact. So bind_rows tends to be more flexible.

Here we’re going to make a new data frame by binding two existing data frames together:

df_new <- bind_rows(df1, df2)

Bind_rows will line up the columns in df1 with those in df2 that have the same name. If you have a field in df1 called “state” and a column in df2 called “statename”, bind_rows will put those in separate columns in your new data frame. And if you have a column that exists in one table but not the other, that will still show up as a column in your new data frame.

Adding columns/recoding

Data cleanup often involves fixing inconsistent values in your data. This video shows you how to use mutate() from dplyr() and case_when() -- which is the kind of like an IF statement -- to standardize values.


Also check out the stringr package from Tidyverse and the grep functions from Base R for some other useful data cleanup tools.

Exporting data

Sometimes after you've done some work in R, you might need to export some or all of your data for other uses. Perhaps there is a summarization you did that will be turned into a graphic in another piece of software. You can export any data frame to a .csv file or other formats. It's also easy to export to JSON using the JSONLite package.

The one I use most frequently is to export a .csv file. Here's the syntax:
write.csv( df, 'myfilename.csv', row.names=FALSE)

If you don't include the row.names argument, it will add a column at the front of the file numbering each of the rows.

If you want to specify that it goes into a particular subdirectory (for example, I often have an "output" subdirectory), you can do this:

write.csv( df, './output/myfilename.csv', row.names=FALSE)


Renaming columns

There are a couple different ways to rename a column. This syntax is what you’d likely use when you are importing or building a new data frame. For example, this would create a data frame with just 4 columns selected from the old one and it will rename col1 in the process:

df_new <- df %>%
select(col1, col2, col3, col4)%>%
rename(newname = col1)

If you just want to keep your data frame with all the columns, but rename one or more fields you can do this:

df <- df %>%
rename(newname1 = col1, newname2 = col2)

You can also rename on the fly in your analysis. In this example, the field is currently called “gender” but I want to change it to “sex” in my output (it won’t be saved to the dataframe).

df %>% select(sex=gender, firstname, lastname)

Replacing NULL values

If you have a column with NULL values (it will show up as NA), this might prohibit you from doing things like adding up the values in that column. Or perhaps you simply want those filled in so you have a value in there.

A simple way to set NULL values in a column to zero looks like this:

dfname$fieldname[is.na(dfname$fieldname)] <- 0

You could also use mutate() and a case_when() statement, especially if you're in a situation where you have some other codes to fix in that same column. This example sets the NULL values to "NO VALUE" and then it recodes the Y to YES and the N to NO, all within the same column, but note that it's putting the answers in a new column.

df <- df %>% mutate( newfieldname = case_when( is.na(fieldname) ~ 'NO VALUE', fieldname=='Y' ~ 'YES', fieldname=='N' ~ 'NO'))

It's also possible to replace all the NULL values in an entire table. This is not something you should use on a regular basis, but there might be occasions when it comes in handy. This replaces all the NULL values with zero.

df <- df %>% mutate_all(~replace(., is.na(.), 0))