photo by me
My wife is in the music concert business. She specializes in K-pop tours. She travels with K-pop bands from city to city, sometimes away from home for over a month.
She takes care of the tour team on the road. But, that's not her only job - one of her most important duties is securing the venue before taking off for the tour.
Photo by Marvin Meyer from Unsplash
That means she must research the venues before discussing potential events with the venue managers. I thought, "I can help her with that."
So, here's what I'm going to do now: build an interactive map with all the venues, where she can find basic information about the venue – capacity, main usage, major tenant, etc. - by simply clicking on the point.
Yes, this exactly looks like what I'm trying to make here.
According to my wife, her colleagues identify venues as ‘stadiums’ or ‘arenas.’ Stadiums typically have a larger seating capacity and do not have roofs, while arenas are more like indoor spaces with a smaller capacity.
I want to use Wikipedia pages as my data source, and luckily, there are perfect ones: ‘List of U.S. stadiums by capacity’ and ‘List of Indoor arenas in the United States.’
So, somebody else will be doing the hardest part of the process for me, which is actually updating the data tables for those Wikipedia pages. All I have to do is perform web scraping to get the data from the tables for my map and make sure my map updates following updates in Wikipedia tables.
First, I loaded the “tidyverse” package as an essential toolkit. And then, I installed and loaded the “rvest” package to scrape the data. Then, I found the URL of the Wikipedia page and identify the table you want to extract.
URL 1: List of U.S. stadiums by capacity
url1 <- "https://en.wikipedia.org/wiki/List_of_U.S._stadiums_by_capacity"
web_page1 <- read_html(url1)
tables1 <- web_page1 %>% html_table(fill = TRUE)
URL 2: List of indoor arenas in the United States
url2 <- "https://en.wikipedia.org/wiki/List_of_indoor_arenas_in_the_United_States"
web_page2 <- read_html(url2)
tables2 <- web_page2 %>% html_table(fill = TRUE)
I found one table on URL 1 and three on URL 2. Therefore, I extracted a total of four tables and loaded them into R as data frames.
stadiums_table <- tables1[[1]]
arena_table1 <- tables2[[2]]
arena_table2 <- tables2[[4]]
arena_table3 <- tables2[[5]]
Yes, I’m going to merge all four tables eventually. But I want to make sure they have the same column names and data types under each column so that I can use rbind() to merge them all.
At the end of Step 2, I want all tables to have those 8 columns.
name: name of a stadium/arena
city: the city where a stadium/arena is located
state: the state where a stadium/arena is located
capacity: the maximum number of seats in the stadium/arena
major_tenant: the organization that mainly uses the stadium/arena. It can be professional sports teams, school sports teams, or both.
main_usage: The purpose for which this venue was built. Typically, stadiums are built for soccer, football, and baseball games, and arenas for basketball and hockey games.
year_opened: the year when the stadium/arena first opened.
type: ‘Stadium’ or ‘arena’ will be entered in advance to be category values when later merged.
Based on the 8 columns listed above, I used the select() function to maintain the columns I needed in each table and the rename() function to ensure each column had the same name across the tables.
However, the arena Wiki page's first and second tables didn’t have the main_usage column. Instead, they had ‘Basket.’ and ‘Hockey’ columns to indicate the seats available for each sporting event.
So, I decided to use the 'Basket.' and 'Hockey' columns to create the main_usage column. Here's how I did it exactly: I made the new column, ‘main_usage’ to have value ‘Basketball, Hockey’ when both columns in a row have values, ‘Basketball’ when only ‘Basketball’ column has value, and ‘Hockey’ when only ‘Hockey’ column has value. It'll make sense when you see the code.
arena_table1 <- arena_table1 %>%
mutate(main_usage = case_when(
Basket.!="" & Hockey =="" ~ "Basketball",
Basket.=="" & Hockey!="" ~ "Hockey",
Basket.!="" & Hockey!="" ~ "Basketball, Hockey"
))
arena_table2 <- arena_table2 %>%
mutate(main_usage = case_when(
Basket.!="" & Hockey =="" ~ "Basketball",
Basket.=="" & Hockey!="" ~ "Hockey",
Basket.!="" & Hockey!="" ~ "Basketball, Hockey"
))
This is a screenshot of the arena_table2. The main_usage column has the values 'Basketball,' 'Hockey,' or 'Basketball, Hockey.'
I also created the new 'type' column for all four tables before merging so each data point could identify with a stadium or an arena after merging.
stadiums_table$type <- "stadium"
arena_table1$type <- "arena"
arena_table2$type <- "arena"
arena_table3$type <- "arena"
Now, let’s merge tables using the rbind() function.
venues <- rbind(stadiums_table, arena_table1, arena_table2, arena_table3)
head(venues)
str(venues)
Previously, I merged four Wikipedia tables into a single data frame called 'venues' to include 1,200 stadiums and arenas across the United States.
As you can see from the images above - snapshots showing the first several rows by head() and str() - we have a few things to do to clean our data.
The first thing I want to do is take care of markers like [1], [2], [3] in the ‘name’ column, which indicates that a comment exists for each item within the Wikipedia pages.
# Remove markers ([1], [2], [3]...) in the 'name' column
venues$name <- gsub("\\[\\d+\\]", "", venues$name)
The next thing to do is remove another marker, such as [citation needed], and the plus(+) signs next to some numbers in the 'capacity' column. Let's try to remove [1], [2], [3] markers again in case they're also used in the 'capacity' column.
# Remove markers and other signs in the 'capacity' column
venues$capacity <- gsub("\\[\\d+\\]", "", venues$capacity)
venues$capacity <- gsub("\\[citation needed\\]", "", venues$capacity)
venues$capacity <- gsub("\\+$", "", venues$capacity)
Also, I want to remove thousand-separating commas in numbers(in fact, they're number-like characters) from the 'capacity' column.
# Remove thousand-separating commas in 'capacity' numbers
venues$capacity <- gsub(",", "", venues$capacity)
Now, I can change the data type of the ‘capacity’ column to numeric.
# Change the data type of the 'capacity' column (character >>> numeric)
venues$capacity <- as.numeric(venues$capacity)
We can now see [1], [2], [3] markers are removed from the 'name' column.
Also, the data type in the 'capacity' column has changed to number, and thousand-separating commas have been removed.
Remember, I’m trying to build a map here, and I'm going to create an interactive map with Leaflet. But no matter what library you use to make a map, you’re going to need geocodes for your data points anyway.
So, for this geocoding part, I will use the Google Maps API. To do this, I must install the “ggmap” packages and register a Google Maps API key in R, which I can acquire from the Google Cloud Platform.
install.packages(“ggmap”)
library(ggmap)
register_google(key = "your_google_maps_api_keys_here")
Now, we can run the geocode() function for geocoding. The geocode() function requires address information that looks like ‘Michigan Stadium, Michigan.’
Since I already have both the name and the state information in the 'venues' data frame, I can just combine the two to create the exact address form that can be fed into the geocode() function.
venues$full_address <- paste(venues$name, venues$state, sep = ", ")
With the code above, I can create a new column, ‘full_address,’ to store the address in the form that can be fed into the geocode() function.
Now, let's use the geocode() function to get coordinates (latitude and longitude) for each venue.
geocode_results <- geocode(venues$full_address)
Executing the geocode() function results in a new data frame, 'geocode_result,' which contains the coordinates (lat and long) of each venue.
We need to feed the coordinates from the 'geocode_result' back to our 'venues' data frame with the codes below:
venues$latitude <- geocode_results$lat
venues$longitude <- geocode_results$lon
We have added two more variables ('latitude' and 'longitude' columns) to our 'venues' data frame.
Now, our data frame 'venues' has everything it needs to turn into an interactive map.
The leaflet() package in R is known to have several benefits for building interactive maps, which make it popular among data analysts and developers.
Personally, the biggest advantage is that it does not require an API key to access its default maps, which is a good starting point.
We can get into it simply by installing the “leaflet” packages.
install.packages("leaflet")
library(leaflet)
For now, I would like to keep the map as simple as possible; keep it very close to the default form, with little customization.
The only thing I'm going to change is the colors. I want different types of venues to appear in different colors on the map. As you know, there are only two types of venues from the 'type' variable: stadium and arena. I want stadiums to appear in dark green and arenas in dark red.
pal <- colorFactor(palette = c("darkred", "darkgreen"), domain = venues$type) # applying different colors to the data points by the type of the venue
Well, that's about everything I wanted to make sure before writing the main code with leaflet() function.
Here's the main code, and we're completing the entire task with this:
leaflet(data = venues_us) %>%
addTiles() %>% # adding default OpenStreetMap tiles
addCircleMarkers(~longitude, ~latitude,
popup = ~paste("<b>Name:</b>", name, "<b>Type:</b>", type, "<br><b>Capacity:</b>", capacity, "<br><b>Main Usage:</b>", main_usage, "<br><b>Major Tenant:</b>", major_tenant), # information displaying on click
label = ~paste(name, "(", main_usage, ")"), # information displaying on hover
radius = ~capacity/20000, # different marker circle sizes by capacity
color = ~pal(type), # different marker circle colors by venue type (stadium or arena)
fillOpacity = 1) %>%
setView(lng = -95.7129, lat = 37.0902, zoom = 4) %>% # initial view centered on the United States
addLegend("bottomright", pal = pal, values = ~type, title = "Venue Type") # adding a legend to show the color categories (stadium - dark green, arena - dark red)
The initial view of the map centered on the U.S.
Still, you can zoom in on any city in the world.
You can see the name and main usage of the venue
by hovering over a venue point.
And then you click on the point,
You can see the detailed information about the venue.