Reshaping and manipulating data with tidyverse

The total length of the videos in this section is approximately 42 minutes. Feel free to do this in multiple sittings! You will also spend time running R code and answering short questions while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

This lecture was created by QAI Intern Susie Black '20. So, you will hear her voice in the videos.

As you go through this module, compare and contrast these techniques with the ones shown in the Data Cleaning module. We are giving you multiple sets of tools so that you can choose what you prefer in various situations as you go forward.

Knowing how to reshape and manipulate data in R is an important skill to learn for working in data science. The tidyverse package in R is a collection of packages, including dplyr and tidyr, that allows you to easily transform and sift through large datasets. For more information regarding the use of tidyverse, you might bookmark this excellent online textbook.

In this tutorial, we will be using functions from the packages dplyr and tidyr to reorganize a dataset from UNICEF that details information about the health of young women around the world. To access the dataset, follow the link here.

Note, you should download the file and save the second tab, which is labeled "Long", as a csv file. This is the dataset we will be working through.

As with all of the R modules in the course, you are not trying to memorize the functions introduced in these videos. Instead, your goal is to know that functions with these capabilities exist and be able to figure out how to use them when needed. Note that tidyverse is quite useful and becoming extremely common in data science.

Please download the following code file and run it as you watch the videos.

Introduction to tidyverse

ReshapingAndManipulatingData.1.Intro to Tidyverse.mp4

Question 0: Were you able to open tidyverse?

Click here for a few suggestions if tidyverse is not working

If you are not already using RStudio, try downloading it and installing tidyverse there
Try directly installing the packages dplyr and/or tidyr. Installing tidyverse is just a shortcut for installing these two packages and a few others.
One student had success with this: install.packages("tidyverse",type="binary")
One student had success after restarting R
If the above suggestions don't work, and you are using MacOS Catalina, you could use the process outlined in this link. However, it is not straightforward, and I'd rather you didn't spend a lot of time on it.

Question 1: What is a tibble?

a subset of a dataset
a type of data frame
a vector

Show answer

A type of modern data frame that makes working with tidyverse functions easier. A tibble is a type of data frame in R with unique subsetting and printing defaults. There are several advantages to saving your dataset as a tibble. A major one is that you cannot print an entire dataset in your console unless you convert your dataset back to a data frame. This is especially helpful when working with large datasets because you won't accidentally print thousands of rows of data at once. If you want to learn more about tibbles input vignette("tibble").

Introduction to dplyr

ReshapingAndManipulatingData.2.Introdution to Dplyr.mp4

The following question and some later questions refer to a data set called CPS85 that is part of the mosaicData package.

install.packages("mosaicData") # only if you have never used this package on this computer

library(mosaicData) # to make the package (and therefore the data) available

?CPS85 # to read about the data

head(CPS85)

CPS<-as_tibble(CPS85)

CPS

Try to answer the questions without actually running code. Then, try running the code that's in the answer.

Question 2: Which set of commands would sort the rows in CPS by wage, in descending order, while also keeping the sector column? Check all that apply.

First, use the function select to select the columns for sector and wage. Then use the function arrange with desc to order by wage.
First, use the function select to select the columns for sector and wage. Then use the function arrange with the desc to order by sector.
First, use the function arrange with desc to order by wage. Then use the function select to view the columns for sector and wage.
First, use the function arrange to order by sector. Then use the function select to view the columns for sector and wage.

Show answer

The first and third options.

You will want to use the arrange function with wage to order wage. You will need to use desc inside the arrange function to change the default (ascending order). You will use select to choose the two columns sector and wage. The order of the functions select and arrange does not matter. The function select will select the columns in the data set you would like to print in your console. Try running these lines of code to check for yourself:

1.) select then arrange with wage

CPS %>%

select(sector, wage) %>%

arrange(desc(wage))

2.) select then arrange with sector

CPS %>%

select(sector, wage) %>%

arrange(desc(sector))

3.) arrange with wage then select

CPS %>%

arrange(desc(wage)) %>%

select(sector, wage)

4.) arrange, without desc, with sector and then select

CPS %>%

arrange(sector) %>%

select(sector, wage)

Notes:

The code you have looked at so far manipulates a data set and prints part of the new data set on the screen. However, if you want to use the new data set further, you should give it a name:

CPSnew <- CPS %>% ....

By default, the first 10 rows of a tibble are printed. However, you can ask for the first n rows by adding print:

CPS %>%

select(sector, wage) %>%

arrange(desc(wage)) %>%

print(n=50)

Introduction to tidyr

Next, before watching the introduction to tidyr video, please familiarize yourself with "long" and "wide" datasets and transforming between the two.

The following example dataset provides information on the number of passengers who use different modes of transportation on a given day in three cities around the world.

The tidyr function, spread, could be used to widen this long dataset by creating separate columns for each type of transportation. The spread function could be used to widen the data in this manner:

The tidyr function gather could be used to elongate this wide dataset by combining separate modes of transportation into one column. The gather function could be used to elongate the data in this manner:

ReshapingAndManipulatingData.3.Introduction to TidyR.mp4

Question 3: Suppose you have a dataset, d, that contains information about the number of kites flown on different dates. This dataset consists of two columns: Date and Kites. The date column includes the name of the month, followed by a comma, and then a number denoting the day. Which of the following R commands would you use to create two different columns for month and day?

d %>% separate(Date, c("Month", "Day"), sep = ",")
d %>% spread(Date, Kites, sep = ",")
d %>% gather(Date, Kites)

Show answer

The first option. You would need to use the function "separate" to split a column on a character, such as a comma, into two separate columns. Spread is used to widen long data and gather is used to elongate wide data. Try running these lines of code to check for yourself:

Here is an example dataset:

Kites <- c(3, 4, 5, 7, 8)

Date <- c("Mar, 17", "May, 10", "Jul, 22", "Sep, 24", "Nov, 11")

d <- data.frame(Date, Kites)

1.) Separate

d %>%

separate(Date, c("Month", "Day"), sep = ",")

2.) Spread

d %>%

spread(Date, Kites, sep = ",")

3.) Gather

d%>%

gather(Date, Kites)

The line above does not change the data set, because gather is not a relevant action here.

Question 4: Which of the following R commands would you use to express information regarding month and day from the kite dataset in one column?

d %>% gather(Date, Month, Day)
d %>% spread(Date, Month, Day, sep = ",")
d %>% unite(Date, Month, Day, sep = ",")

Show answer

The last option. The function "unite" allows you to combine information from two columns into one, separated by a character such as a comma.

Try running these lines of code to check for yourself.

Here is an example dataset:

Month <- c("Mar", "May", "Jul", "Sep", "Nov")

Day <- c(17, 10, 22, 24, 11)

d <- data.frame(Month, Day, Kites)

1.) Gather

d %>%

gather(Date, Month, Day, sep = ",")

The line doesn't do anything, because gather is not an appropriate action here.

2.) Spread

d %>%

spread(Date, Month, Day, sep = ",")

The line above gives an error, because spread is not an appropriate action here.

3.) Unite

d %>%

unite(Date, Month, Day, sep = ",")

A note: we have said that spread usually generates an error. Students often ask for an explanation of what the error is, exactly, and how we fix it.

The issue is that, depending on which rows you've kept in the data set, there may be ambiguity about which lines should be combined when you run spread. In the lecture video, the first 10 lines of the tibble sub_health look like this:

# A tibble: 31,821 x 4

Location Year Indicator Value

1 Afghanistan 2015 ANC_1T 51

2 Afghanistan 2015 ANC_1T 55

3 Afghanistan 2015 ANC_1T 50

4 Afghanistan 2015 ANC_1T 51

5 Afghanistan 2015 ANC_1T 49

6 Afghanistan 2015 ANC_1T 50

7 Afghanistan 2015 ANC_1T 48

8 Afghanistan 2015 ANC_1T 48

9 Afghanistan 2015 ANC_1T 48

10 Afghanistan 2015 ANC_1T 48

# … with 31,811 more rows

All 10 of the lines shown are identical on Location, Year, and Indicator! Here is a different subset of the rows in sub_health:

sub_health[13:22,]

# A tibble: 10 x 4

Location Year Indicator Value

1 Afghanistan 2015 ANC_BLOOD 24

2 Afghanistan 2015 ANC_BLOOD 25

3 Afghanistan 2015 ANC_BLOOD 28

4 Afghanistan 2015 ANC_BLOOD 29

5 Afghanistan 2015 ANC_BLOOD 30

6 Afghanistan 2015 ANC_BLOOD 30

7 Afghanistan 2015 ANC_BLOOD 31

8 Afghanistan 2015 ANC_BLOOD 32

9 Afghanistan 2015 ANC_BLOOD 30

10 Afghanistan 2015 ANC_BLOOD 30

When we spread on Indicator, there is going to be a column called ANC_1T and another column called ANC_BLOOD. But the spread function doesn't know which ANC_1T rows should be lined up with which ANC_BLOOD rows. That is the purpose of

mutate(grouped_id = row_number())

The above line of code creates a new variable called grouped_id that gives different numbers to rows that are otherwise identical on the variables other than "Value":

sub_health%>%

+ group_by(Indicator) %>%

+ mutate(grouped_id = row_number())

# A tibble: 31,821 x 5

# Groups: Indicator [26]

Location Year Indicator Value grouped_id

1 Afghanistan 2015 ANC_1T 51 1

2 Afghanistan 2015 ANC_1T 55 2

3 Afghanistan 2015 ANC_1T 50 3

4 Afghanistan 2015 ANC_1T 51 4

5 Afghanistan 2015 ANC_1T 49 5

6 Afghanistan 2015 ANC_1T 50 6

7 Afghanistan 2015 ANC_1T 48 7

8 Afghanistan 2015 ANC_1T 48 8

9 Afghanistan 2015 ANC_1T 48 9

10 Afghanistan 2015 ANC_1T 48 10

and

A tibble: 10 x 5

# Groups: Indicator [1]

Location Year Indicator Value grouped_id

1 Afghanistan 2015 ANC_BLOOD 24 1

2 Afghanistan 2015 ANC_BLOOD 25 2

3 Afghanistan 2015 ANC_BLOOD 28 3

4 Afghanistan 2015 ANC_BLOOD 29 4

5 Afghanistan 2015 ANC_BLOOD 30 5

6 Afghanistan 2015 ANC_BLOOD 30 6

7 Afghanistan 2015 ANC_BLOOD 31 7

8 Afghanistan 2015 ANC_BLOOD 32 8

9 Afghanistan 2015 ANC_BLOOD 30 9

10 Afghanistan 2015 ANC_BLOOD 30 10

There is an assumption that we want to match the first ANC_1T row with the first ANC_BLOOD row, etc.

Then, when we spread, the rows are matched up according to this id number. See how the first row in the new data set has the 51 for ANC_1T and the 24 for ANC_BLOOD on the same line.

sub_health%>%

+ group_by(Indicator) %>%

+ mutate(grouped_id = row_number()) %>%

+ spread(Indicator, Value)

# A tibble: 22,041 x 29

Location Year grouped_id ANC_1T ANC_BLOOD ANC_BP

1 Afghanistan 2015 1 51 24 74

2 Afghanistan 2015 2 55 25 75

3 Afghanistan 2015 3 50 28 77

4 Afghanistan 2015 4 51 29 78

5 Afghanistan 2015 5 49 30 78

6 Afghanistan 2015 6 50 30 80

7 Afghanistan 2015 7 48 31 78

8 Afghanistan 2015 8 48 32 78

9 Afghanistan 2015 9 48 30 78

10 Afghanistan 2015 10 48 30 79

Last, the line select(-grouped_id) removes the id variable when it's no longer needed.

Sometimes, there is no ambiguity about which lines should be combined, and you can run spread without adding and later removing an id variable. You will know when adding an id is needed because spread will give you an error when you omit the id.

Using dplyr with cleaned data

ReshapingAndManipulatingData.4.Dplyr with Cleaned Data.mp4

Question 5: The order of operations matters for which of these combination of dplyr functions? Check all that apply.

select and arrange
select and filter
select and mutate
group_by and summarise

Show answer

Third and fourth options. The functions arrange and filter when paired with select will produce the same output regardless of function order. The order of select and mutate matter because the printed output will be different. You would need to add the name of the new column created with the mutate function in the select function if you use mutate before select. If you use summarise before select, R will not be able to pull the name(s) of the column(s) you are selecting. This will result in an error because the names aren't "known".

Advanced functions in dplyr

After these videos were recorded, R replaced several of the outdated functions mentioned in the video (summarise_each, mutate_if, transmute, and others) with the function across. You can still watch the video below if you want, but more important is to look at the updated code at the end of the code file, to see how to use across.

ReshapingAndManipulatingData.5.Advanced Functions in Dplyr.mp4

Question 6: What code would you use to create a new dataset containing all of the mean values for the continuous variables in the CPS dataset, which are wage, educ, exper, and age?

Show answer

CPS %>%

summarise(across(c(wage, educ, exper, age), mean))

In my opinion, tidyverse is simpler than original R for this task.

And now you know how to tidy in tidyverse.

During this tutorial you learned:

How to manage and reshape data with ‘dplyr’ and ‘tidyr’ packages, which are included in the ‘tidyverse’ package
About storing data in tibble format
How to rename variables in a tibble
How to pipe data using %>% operator
To subset columns with select()
To subset rows by a condition or column value with filter()
To sort data in ascending or descending order
About long and wide data
To create a new variable with mutate() or transmutate()
To manipulate string variables using unite() and separate()
How to calculate summary statistics with summarise()
How data manipulation operations in base R compare to the same operations in tidyverse

Operators in review:

%>%

Base R functions in review:

names(), order() [note, default is in ascending order], print(), with(), tapply(), aggregate(), is.na()

Some tidyverse functions in review:

as_tibble(), filter(), select(), contains(), arrange() [note, default is in ascending order], desc(), spread() [note, updated function is pivot_wider()], gather() [note, updated function is pivot_longer()], group_by(), mutate(), row_number(), unite(..,sep=), separate(..., sep=), transmutate()