Reshaping and manipulating data with tidyverse

The total length of the videos in this section is approximately 42 minutes. Feel free to do this in multiple sittings! You will also spend time running R code and answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

This lecture was created by QAI Intern Susie Black '20. So, you will hear her voice in the videos.

As you go through this module, compare and contrast these techniques with the ones shown in the Data Cleaning module. We are giving you multiple sets of tools so that you can choose what you prefer in various situations as you go forward.

Knowing how to reshape and manipulate data in R is an important skill to learn for working in data science. The tidyverse package in R is a collection of packages, including dplyr and tidyr, that allows you to easily transform and sift through large datasets. For more information regarding the use of tidyverse, you might bookmark this excellent online textbook. 

In this tutorial, we will be using functions from the packages dplyr and tidyr to reorganize a dataset from UNICEF that details information about the health of young women around the world. To access the dataset, follow the link here

Note, you should download the file and save the second tab, which is labeled "Long", as a csv file. This is the dataset we will be working through. 

As with all of the R modules in the course, you are not trying to memorize the functions introduced in these videos. Instead, your goal is to know that functions with these capabilities exist and be able to figure out how to use them when needed. Note that tidyverse is quite useful and becoming extremely common in data science.

Please download the following code file and run it as you watch the videos.

Introduction to tidyverse

ReshapingAndManipulatingData.1.Intro to Tidyverse.mp4

Question 0: Were you able to open tidyverse?

Click here for a few suggestions if tidyverse is not working

Question 1: What is a tibble?

Show answer

A type of modern data frame that makes working with tidyverse functions easier. A tibble is a type of data frame in R with unique subsetting and printing defaults. There are several advantages to saving your dataset as a tibble. A major one is that you cannot print an entire dataset in your console unless you convert your dataset back to a data frame. This is especially helpful when working with large datasets because you won't accidentally print thousands of rows of data at once. If you want to learn more about tibbles input vignette("tibble").

Introduction to dplyr

ReshapingAndManipulatingData.2.Introdution to Dplyr.mp4

The following question and some later questions refer to a data set called CPS85 that is part of the mosaicData package.

install.packages("mosaicData") # only if you have never used this package on this computer

library(mosaicData) # to make the package (and therefore the data) available 

?CPS85 # to read about the data

head(CPS85)

CPS<-as_tibble(CPS85)

CPS

Try to answer the questions without actually running code. Then, try running the code that's in the answer.


Question 2: Which set of commands would sort the rows in CPS by wage, in descending order, while also keeping the sector column? Check all that apply.

Show answer

The first and third options.

You will want to use the arrange function with wage to order wage. You will need to use desc inside the arrange function to change the default (ascending order). You will use select to choose the two columns sector and wage. The order of the functions select and arrange does not matter. The function select will select the columns in the data set you would like to print in your console. Try running these lines of code to check for yourself:

_

1.) select then arrange with wage

CPS %>%

select(sector, wage) %>%

arrange(desc(wage))

_

2.) select then arrange with sector

CPS %>%

select(sector, wage) %>%

arrange(desc(sector))

_

3.) arrange with wage then select

CPS %>%

arrange(desc(wage)) %>%

select(sector, wage)

_

4.) arrange, without desc, with sector and then select

CPS %>%

arrange(sector) %>%

select(sector, wage)

Notes:

CPSnew <- CPS %>% ....


CPS %>%

select(sector, wage) %>%

arrange(desc(wage)) %>%

print(n=50)

Introduction to tidyr

Next, before watching the introduction to tidyr video, please familiarize yourself with "long" and "wide" datasets and transforming between the two. 

The following example dataset provides information on the number of passengers who use different modes of transportation on a given day in three cities around the world. 

The tidyr function, spread, could be used to widen this long dataset by creating separate columns for each type of transportation. The spread function could be used to widen the data in this manner: 

The tidyr function gather could be used to elongate this wide dataset by combining separate modes of transportation into one column. The gather function could be used to elongate the data in this manner: 

ReshapingAndManipulatingData.3.Introduction to TidyR.mp4

Question 3: Suppose you have a dataset, d, that contains information about the number of kites flown on different dates. This dataset consists of two columns: Date and Kites. The date column includes the name of the month, followed by a comma, and then a number denoting the day. Which of the following R commands would you use to create two different columns for month and day?

Show answer

The first option. You would need to use the function "separate" to split a column on a character, such as a comma, into two separate columns. Spread is used to widen long data and gather is used to elongate wide data. Try running these lines of code to check for yourself:

_

Here is an example dataset:

Kites <- c(3, 4, 5, 7, 8)

Date <- c("Mar, 17", "May, 10", "Jul, 22", "Sep, 24", "Nov, 11")

d <- data.frame(Date, Kites)

d

_

1.) Separate

d %>%

separate(Date, c("Month", "Day"), sep = ",")

_

2.) Spread

d %>%

spread(Date, Kites, sep = ",")

_

3.) Gather

d%>%

gather(Date, Kites)

The line above does not change the data set, because gather is not a relevant action here.

Question 4: Which of the following R commands would you use to express information regarding month and day from the kite dataset in one column?

Show answer

The last option. The function "unite" allows you to combine information from two columns into one, separated by a character such as a comma.

Try running these lines of code to check for yourself.

_

Here is an example dataset:

Month <- c("Mar", "May", "Jul", "Sep", "Nov")

Day <- c(17, 10, 22, 24, 11)

d <- data.frame(Month, Day, Kites)

_

1.) Gather

d %>%

gather(Date, Month, Day, sep = ",")

The line doesn't do anything, because gather is not an appropriate action here.

_

2.) Spread

d %>%

spread(Date, Month, Day, sep = ",")

The line above gives an error, because spread is not an appropriate action here.

_

3.) Unite

d %>%

unite(Date, Month, Day, sep = ",")

A note: we have said that spread usually generates an error. Students often ask for an explanation of what the error is, exactly, and how we fix it.

The issue is that, depending on which rows you've kept in the data set, there may be ambiguity about which lines should be combined when you run spread. In the lecture video, the first 10 lines of the tibble sub_health look like this:

# A tibble: 31,821 x 4

   Location     Year Indicator Value

   <chr>       <int> <chr>     <int>

 1 Afghanistan  2015 ANC_1T       51

 2 Afghanistan  2015 ANC_1T       55

 3 Afghanistan  2015 ANC_1T       50

 4 Afghanistan  2015 ANC_1T       51

 5 Afghanistan  2015 ANC_1T       49

 6 Afghanistan  2015 ANC_1T       50

 7 Afghanistan  2015 ANC_1T       48

 8 Afghanistan  2015 ANC_1T       48

 9 Afghanistan  2015 ANC_1T       48

10 Afghanistan  2015 ANC_1T       48

# … with 31,811 more rows

All 10 of the lines shown are identical on Location, Year, and Indicator! Here is a different subset of the rows in sub_health:

sub_health[13:22,]

# A tibble: 10 x 4

   Location     Year Indicator Value

   <chr>       <int> <chr>     <int>

 1 Afghanistan  2015 ANC_BLOOD    24

 2 Afghanistan  2015 ANC_BLOOD    25

 3 Afghanistan  2015 ANC_BLOOD    28

 4 Afghanistan  2015 ANC_BLOOD    29

 5 Afghanistan  2015 ANC_BLOOD    30

 6 Afghanistan  2015 ANC_BLOOD    30

 7 Afghanistan  2015 ANC_BLOOD    31

 8 Afghanistan  2015 ANC_BLOOD    32

 9 Afghanistan  2015 ANC_BLOOD    30

10 Afghanistan  2015 ANC_BLOOD    30

When we spread on Indicator, there is going to be a column called ANC_1T and another column called ANC_BLOOD. But the spread function doesn't know which ANC_1T rows should be lined up with which ANC_BLOOD rows. That is the purpose of 

mutate(grouped_id = row_number())

The above line of code creates a new variable called grouped_id that gives different numbers to rows that are otherwise identical on the variables other than "Value":

sub_health%>% 

+ group_by(Indicator) %>% 

+ mutate(grouped_id = row_number())

# A tibble: 31,821 x 5

# Groups:   Indicator [26]

   Location     Year Indicator Value grouped_id

   <chr>       <int> <chr>     <int>      <int>

 1 Afghanistan  2015 ANC_1T       51          1

 2 Afghanistan  2015 ANC_1T       55          2

 3 Afghanistan  2015 ANC_1T       50          3

 4 Afghanistan  2015 ANC_1T       51          4

 5 Afghanistan  2015 ANC_1T       49          5

 6 Afghanistan  2015 ANC_1T       50          6

 7 Afghanistan  2015 ANC_1T       48          7

 8 Afghanistan  2015 ANC_1T       48          8

 9 Afghanistan  2015 ANC_1T       48          9

10 Afghanistan  2015 ANC_1T       48         10

and

A tibble: 10 x 5

# Groups:   Indicator [1]

   Location     Year Indicator Value grouped_id

   <chr>       <int> <chr>     <int>      <int>

 1 Afghanistan  2015 ANC_BLOOD    24          1

 2 Afghanistan  2015 ANC_BLOOD    25          2

 3 Afghanistan  2015 ANC_BLOOD    28          3

 4 Afghanistan  2015 ANC_BLOOD    29          4

 5 Afghanistan  2015 ANC_BLOOD    30          5

 6 Afghanistan  2015 ANC_BLOOD    30          6

 7 Afghanistan  2015 ANC_BLOOD    31          7

 8 Afghanistan  2015 ANC_BLOOD    32          8

 9 Afghanistan  2015 ANC_BLOOD    30          9

10 Afghanistan  2015 ANC_BLOOD    30         10

There is an assumption that we want to match the first ANC_1T row with the first ANC_BLOOD row, etc.

Then, when we spread, the rows are matched up according to this id number. See how the first row in the new data set has the 51 for ANC_1T and the 24 for ANC_BLOOD on the same line.

sub_health%>%

+ group_by(Indicator) %>% 

+ mutate(grouped_id = row_number()) %>% 

+ spread(Indicator, Value)

# A tibble: 22,041 x 29

   Location     Year grouped_id ANC_1T ANC_BLOOD ANC_BP

   <chr>       <int>      <int>  <int>     <int>  <int>

 1 Afghanistan  2015          1     51        24     74

 2 Afghanistan  2015          2     55        25     75

 3 Afghanistan  2015          3     50        28     77

 4 Afghanistan  2015          4     51        29     78

 5 Afghanistan  2015          5     49        30     78

 6 Afghanistan  2015          6     50        30     80

 7 Afghanistan  2015          7     48        31     78

 8 Afghanistan  2015          8     48        32     78

 9 Afghanistan  2015          9     48        30     78

10 Afghanistan  2015         10     48        30     79

Last, the line select(-grouped_id) removes the id variable when it's no longer needed.

Sometimes, there is no ambiguity about which lines should be combined, and you can run spread without adding and later removing an id variable. You will know when adding an id is needed because spread will give you an error when you omit the id.

Using dplyr with cleaned data

ReshapingAndManipulatingData.4.Dplyr with Cleaned Data.mp4

Question 5: The order of operations matters for which of these combination of dplyr functions? Check all that apply.

Show answer

Third and fourth options. The functions arrange and filter when paired with select will produce the same output regardless of function order. The order of select and mutate matter because the printed output will be different. You would need to add the name of the new column created with the mutate function in the select function if you use mutate before select. If you use summarise before select, R will not be able to pull the name(s) of the column(s) you are selecting. This will result in an error because the names aren't "known". 

Advanced functions in dplyr

After these videos were recorded, R replaced several of the outdated functions mentioned in the video (summarise_each, mutate_if, transmute, and others) with the function across. You can still watch the video below if you want, but more important is to look at the updated code at the end of the code file, to see how to use across.

ReshapingAndManipulatingData.5.Advanced Functions in Dplyr.mp4

Question 6: What code would you use to create a new dataset containing all of the mean values for the continuous variables in the CPS dataset, which are wage, educ, exper, and age

Show answer

CPS %>%

  summarise(across(c(wage, educ, exper, age), mean))

In my opinion, tidyverse is simpler than original R for this task.

And now you know how to tidy in tidyverse.

During this tutorial you learned:


Operators in review:

%>%


Base R functions in review:

names(), order() [note, default is in ascending order], print(), with(), tapply(), aggregate(), is.na()


Some tidyverse functions in review:

as_tibble(), filter(), select(), contains(), arrange() [note, default is in ascending order], desc(), spread() [note, updated function is pivot_wider()], gather() [note, updated function is pivot_longer()], group_by(), mutate(), row_number(), unite(..,sep=), separate(..., sep=), transmutate()