Photo by Carlos Muza from Unsplash
This is a data wrangling part of the IBM Data Science with R specialization course Capstone Project. The project aimed to build an R Shiny application that predicts the bike rental demand for each of the five cities - New York, London, Paris, Suzhou, and Seoul. Below is what you actually see in the app if you want to find out the bike demand level for the next 5 days in Seoul.
You are welcome to play with my app from here!
Building this app required applying all the skills I learned throughout this course - web scraping, data wrangling, EDA, predictive modeling, and R Shiny.
Today, I'll focus solely on the data wrangling part, which I think is one of the most important duties of a junior data analyst. This portfolio should showcase what I'm able to perform as a junior data analyst.
The look of a raw dataset that I collected in the first part of the project - Web Scraping
Previously, I collected raw datasets from several different sources through web scraping.
One typical challenge of web scraping is that data extracted from HTML pages may contain unnecessary or inconsistently formatted information.
Therefore, I need to perform data wrangling tasks to improve data quality.
I'll be using stringr, dplyr and regular expressions to perform the following data-wrangling tasks:
Task 1: Standardize column names for all collected datasets
Task 2: Remove undesired reference links from the scraped bike-sharing systems dataset
Task 3: Extract only the numeric value from undesired text annotations
Task 4: Detect and handle missing values
Task 1: Standardize column names for all collected datasets (using stringr)
You can see in the picture above that the format of the column names in the data set we collected is inconsistent.
To improve dataset readability by both human and computer systems, I need to standardize the column names of the datasets above using the following naming convention:
Column names should be UPPERCASE
The word separator should be underscore("_"), for example, COLUMN_NAME
I’ll use the ‘for ~ loop’ method to do all the work at once.
dataset_list <- c(
"C:/Users/yohan/Documents/R/IBM_Capstone/raw_bike_sharing_systems.csv",
"C:/Users/yohan/Documents/R/IBM_Capstone/raw_cities_weather_forecast.csv",
"C:/Users/yohan/Documents/R/IBM_Capstone/raw_seoul_bike_sharing.csv",
"C:/Users/yohan/Documents/R/IBM_Capstone/raw_worldcities.csv"
)
file_path <- "C:/Users/yohan/Documents/R/IBM_Capstone"
for (dataset_name in dataset_list){
# read datasets
dataset <- read.csv(dataset_name)
# convert all column names to uppercase using toupper() function
colnames(dataset) <- toupper(colnames(dataset))
# Replace any white space separators by underscores, using the str_replace_all() function
colnames(dataset) <- str_replace_all(colnames(dataset), " ", "_")
# Generate the full file path dynamically
output_file <- file.path(file_path, basename(dataset_name))
# save the dataset
write.csv(dataset, file = output_file, row.names = FALSE)
}
We can check if changes are made as intended with the code below.
for (dataset_name in dataset_list){
dataset <- read_csv(dataset_name)
print(colnames(dataset))
}
Task 2: Clean up the values from the web-scraped dataset (using Regex)
Now, I'm going to clean up the values in one of my datasets, "bike_sharing_df.csv."
bike_sharing_df <- read_csv("C:/Users/yohan/Documents/R/IBM_Capstone/bike_sharing_systems.csv")
As mentioned previously, the challenge of web scraping is that data extracted from HTML pages may contain unnecessary or inconsistently formatted information. For example:
Textual annotations in numeric fields: 1000 (Updated with 1050), initially 800 (later 2500)
Attached reference links: 9 October 2016[123], 500[72]
This is a screenshot of the arena_table2. The main_usage column has the values 'Basketball,' 'Hockey,' or 'Basketball, Hockey.'
Let's take care of reference links - "9 October 2016[123]", "500[72]" - in all relevant columns.
First, I'll define a pattern to help search for those reference links.
ref_pattern <- "\\[[A-z0-9]+\\]" # If we use this pattern to search the string, it will match thing like [ABC123] and [xyz789]
Now, let’s write a custom function using str_replace_all() to replace all reference links with an empty character (and immediately remove them with trim() command).
remove_ref <- function(strings) {
ref_pattern <- "\\[[A-z0-9]+\\]"
result <- str_replace_all(strings, ref_pattern, " ")
result <- str_trim(result)
return(result)
}
Then, use the mutate() function to apply our custom remove_ref() function to the CITY, SYSTEM, and BICYCLES columns we're interested in.
bike_sharing_df_ref_removed <-
bike_sharing_df %>% mutate(
CITY = remove_ref(CITY),
SYSTEM = remove_ref(SYSTEM),
BICYCLES = remove_ref(BICYCLES),
)
Finally, let's check whether all reference links are removed with the following code.
bike_sharing_df_ref_removed %>%
select(CITY, SYSTEM, BICYCLES) %>%
filter(find_reference_pattern(CITY) | find_reference_pattern(SYSTEM) | find_reference_pattern(BICYCLES))
The result should look like this: ZERO rows
Task 3: Extract the numeric value (using Regex)
Unfortunately, we have another problem in the BICYCLES column.
The column is supposed to include only numeric data, such as, the number of bicycles.
However, some rows in this column contain not only attached reference links (which we just took care of in the previous task) but also textual annotations - "1000 (Updated with 1050)", "initially 800 (later 2500)" - that don't leave our numeric values alone.
When a column in your dataset has a textual annotation problem
I'm going to create a function to extract the first number in each row for a given column. Also, I'll ensure that the resulting column has a numeric data type with the as.numeric() command.
extract_num <- function(columns){
#define a digital pattern
digitals_pattern <- "\\d+"
# find the first match using str_extract
result <- str_extract(columns, digitals_pattern)
# convert the result to numeric
result <- as.numeric(result)
return(result)
}
Let's apply the function to the BICYCLES column.
bike_sharing_df_num_extracted <- bike_sharing_df_ref_removed %>%
mutate(BICYCLES = extract_num(BICYCLES))
# check to see if we successfully extract numbers in the BICYCLES column
bike_sharing_df_num_extracted %>% select(BICYCLES) %>% filter(find_character(BICYCLES))
When a column in your dataset has a textual annotation problem
# check again to see if the BICYCLES column can show descriptive statistics (possible when only numeric values are included in the column)
summary(bike_sharing_df_num_extracted$BICYCLES)
When a column in your dataset has a textual annotation problem
Task 4: Detect and handle missing values (using dplyr)
In this task, we will look at a different dataset, Seoul bike-sharing demand(“raw-seoul-bike-sharing.csv”).
This was the core dataset when I built a predictive model because this dataset included variables related to bicycle rental demand, such as temperature, humidity, wind speed, visibility, rainfall, snowfall, and time of day.
Here, we're going to look at two columns of numeric data type:
RENTED_BIKE_COUNT, the response variable
TEMPERATURE, one of the main predictor variables
seoul_bike_df.csv (loaded into RStudio)
[Droping NAs]
Let’s drop the missing values in the RENTED_BIKE_COUNT column. First, you would want to check how many NAs are in the column.
sum(is.na(seoul_bike_df$RENTED_BIKE_COUNT)) # checking the number of NAs
The column has only about 3% missing values (295 / 8760). As such, you can safely drop any rows whose RENTED_BIKE_COUNT has NAs.
seoul_bike_na_removed <- seoul_bike_df[!is.na(seoul_bike_df$RENTED_BIKE_COUNT), ] # filtering for rows without NAs in the RENTED_BIKE_COUNT column
sum(is.na(seoul_bike_na_removed$RENTED_BIKE_COUNT)) # check again to see if all the NAs were removed successfully
[Replacing NAs]
We're moving on to processing missing values for the TEMPERATURE column. Let's first take a look.
seoul_bike_na_removed %>% filter(is.na(TEMPERATURE))
The NAs in the TEMPERATURE column seems to be only in Summer SEASON. Let's confirm it.
seoul_bike_na_removed %>%
filter(is.na(TEMPERATURE)) %>%
select(SEASONS) %>%
distinct()
It is reasonable to input those missing values(NAs) with the summer average temperature. Let’s first calculate the average temperature for the season, ‘Summer.’
avg_temp_summer <- mean(seoul_bike_na_removed$TEMPERATURE[seoul_bike_na_removed$SEASONS == "Summer"], na.rm = TRUE)
Now, let’s replace NAs in TEMPERATURE column with our ‘avg_temp_summer’ value we just calculated.
seoul_bike_na_removed$TEMPERATURE[seoul_bike_na_removed$SEASONS == "Summer"] <- avg_temp_summer
# check if we still have NAs in TEPERATURE column
seoul_bike_na_removed %>% filter(is.na(TEMPERATURE))