The total length of the videos in this section is approximately 45 minutes. Feel free to do this in multiple sittings! You will also spend time running R code and answering short questions while completing this section.
You can view an outdated version of this module at the YouTube playlist linked here, but not the current version. If you are completing this module as part of a course or program, please watch the videos below rather than the outdated youtube videos.
As you go through this module, compare and contrast these techniques with the ones shown in the Data Cleaning module. We are giving you multiple sets of tools so that you can choose what you prefer in various situations as you go forward.
Knowing how to reshape and manipulate data in R is an important skill to learn for working in data science. The tidyverse package in R is a collection of packages, including dplyr and tidyr, that allows you to easily transform and sift through large datasets. For more information regarding the use of tidyverse, you might bookmark this excellent online textbook.
In this tutorial, we will be using functions from the packages dplyr and tidyr to reorganize a dataset from UNICEF that details information about the health of young women around the world. To access the dataset, follow the link here.
Note, you should download the file and save the second tab, which is labeled "Long", as a csv file. This is the dataset we will be working through.
As with all of the R modules in the course, you are not trying to memorize the functions introduced in these videos. Instead, your goal is to know that functions with these capabilities exist and be able to figure out how to use them when needed. Note that tidyverse is quite useful and becoming extremely common in data science.
Please download the following code file and run it as you watch the videos.
Question 0: Were you able to open tidyverse?
If you are not already using RStudio, try downloading it and installing tidyverse there
Try directly installing the packages dplyr and/or tidyr. Installing tidyverse is just a shortcut for installing these two packages and a few others.
One student had success with this: install.packages("tidyverse",type="binary")
One student had success after restarting R
If the above suggestions don't work, and you are using MacOS Catalina, you could use the process outlined in this link. However, it is not straightforward, and I'd rather you didn't spend a lot of time on it.
Question 1: What is a tibble?
a subset of a dataset
a type of data frame
a vector
A type of modern data frame that makes working with tidyverse functions easier. A tibble is a type of data frame in R with unique subsetting and printing defaults. There are several advantages to saving your dataset as a tibble. A major one is that you cannot print an entire dataset in your console unless you convert your dataset back to a data frame. This is especially helpful when working with large datasets because you won't accidentally print thousands of rows of data at once. If you want to learn more about tibbles input vignette("tibble").
The following question and some later questions refer to a data set called CPS85 that is part of the mosaicData package.
install.packages("mosaicData") # only if you have never used this package on this computer
library(mosaicData) # to make the package (and therefore the data) available
?CPS85 # to read about the data
head(CPS85)
CPS<-as_tibble(CPS85)
CPS
Try to answer the questions without actually running code. Then, try running the code that's in the answer.
Question 2: Which set of commands would sort the rows in CPS by wage, in descending order, while also keeping the sector column? Check all that apply.
First, use the function select to select the columns for sector and wage. Then use the function arrange with desc to order by wage.
First, use the function select to select the columns for sector and wage. Then use the function arrange with the desc to order by sector.
First, use the function arrange with desc to order by wage. Then use the function select to view the columns for sector and wage.
First, use the function arrange to order by sector. Then use the function select to view the columns for sector and wage.
The first and third options.
You will want to use the arrange function with wage to order wage. You will need to use desc inside the arrange function to change the default (ascending order). You will use select to choose the two columns sector and wage. The order of the functions select and arrange does not matter, as long as you include the variable you want to arrange on, if select comes first. Try running these lines of code to check for yourself:
_
1.) select then arrange with wage
CPS %>%
select(sector, wage) %>%
arrange(desc(wage))
_
2.) select then arrange with sector
CPS %>%
select(sector, wage) %>%
arrange(desc(sector))
_
3.) arrange with wage then select
CPS %>%
arrange(desc(wage)) %>%
select(sector, wage)
_
4.) arrange, without desc, with sector and then select
CPS %>%
arrange(sector) %>%
select(sector, wage)
Notes:
The code you have looked at so far manipulates a data set and prints part of the new data set on the screen. However, if you want to use the new data set further, you should give it a name:
CPSnew <- CPS %>% ....
By default, the first 10 rows of a tibble are printed. However, you can ask for the first n rows by adding print:
CPS %>%
select(sector, wage) %>%
arrange(desc(wage)) %>%
print(n=50)
Next, before watching the introduction to tidyr video, please familiarize yourself with "long" and "wide" datasets and transforming between the two.
The following example dataset provides information on the number of passengers who use different modes of transportation on a given day in three cities around the world.
The tidyr function, pivot_wider, could be used to widen this long dataset by creating separate columns for each type of transportation. The pivot_wider function could be used to widen the data in this manner:
The tidyr function pivot_longer could be used to elongate this wide dataset by combining separate modes of transportation into one column. The pivot_longer function could be used to elongate the data in this manner:
Question 3: Suppose you have a data set, d, that contains information about the number of kites flown on different dates. This data set consists of two columns: Date and Kites. The date column includes the name of the month, followed by a comma, and then a number denoting the day. Which of the following R commands would you use to create two different columns for month and day?
separate_wider_delim
unite
pivot_wider
pivot_longer
The first option. You would use the function separate_wider_delim to split a column on a character, such as a comma, into two separate columns. We know that pivot_wider is used to widen long data and pivot_longer is used to elongate wide data. Try running these lines of code to check for yourself:
Kites <- c(3, 4, 5, 7, 8)
Date <- c("Mar, 17", "May, 10", "Jul, 22", "Sep, 24", "Nov, 11")
d <- data.frame(Date, Kites)
d
d %>%
separate_wider_delim(Date, delim=",", names=c("Month", "Day"))
Question 4: Which of the following R commands would you use to express information regarding month and day from the kite dataset in one column?
separate_wider_delim
unite
pivot_wider
pivot_longer
Unite. The function unite allows you to combine information from two columns into one, separated by a character such as a comma.
Try running these lines of code to check for yourself.
Month <- c("Mar", "May", "Jul", "Sep", "Nov")
Day <- c(17, 10, 22, 24, 11)
d <- data.frame(Month, Day, Kites)
d
d %>%
unite(Date, Month, Day, sep = ",")
Question 5: The order of operations matters for which of these combination of functions? Check all that apply.
select and mutate
group_by and summarise
Both.
The order of select and mutate matter because the printed output will be different. You would need to add the name of the new column created with the mutate function in the select function if you use mutate before select.
If you use summarise before select, R will not be able to find the name(s) of the column(s) you are selecting. This will result in an error.
Question 6: What code would you use to create a new dataset containing all of the mean values for the continuous variables in the CPS dataset, which are wage, educ, exper, and age?
CPS %>%
summarise(across(c(wage, educ, exper, age), mean))
In my opinion, tidyverse is simpler than base R for this task. I don't feel that way about most tasks.
And now you know how to tidy in tidyverse.
During this tutorial you learned:
How to manage and reshape data with ‘dplyr’ and ‘tidyr’ packages, which are included in the ‘tidyverse’ package
About storing data in tibble format
How to rename variables in a tibble
How to pipe data using %>% operator
To subset columns with select()
To subset rows by a condition or column value with filter()
To sort data in ascending or descending order
About long and wide data, using pivot_wider and pivot_longer
To create a new variable with mutate()
To manipulate string variables using unite() and separate_wider_delim()
How to calculate summary statistics with summarise()
How data manipulation operations in base R compare to the same operations in tidyverse
Operators in review:
%>%
Base R functions in review:
order(), with(), tapply(), aggregate(), is.na()
Some tidyverse functions in review:
as_tibble(), filter(), select(), contains(), arrange(), desc(), pivot_wider(), pivot_longer(), group_by(), mutate(), unite(..,sep=), separate_wider_delim(..., sep=)