We did the following to clean our datset:
Changed our variables names so that it is shorter, but still conveys the information.
Spaces in variable names were removed.
Since there were 350,000 rows, observations with lesser number of people coming in were neglected (less than 50).
Date column had two pieces of information. Time and Date. Time information was removed because it was same for all observations and it provided no additional information.
There were two unique values of Border variable. These two values were changed to a shorter name. For example, US-Canada Border was changed to US-Canada.
Any rows with missing values for variables were ignored. For example, if the value was empty.
Made a rough graph to get an idea of which border had the most number of border crossings. Found that US-Mexico has the most border crossings.
Made a graph to see the pattern of number of border crossings from 1996-2020. Found less people are coming in recent years as compared to the past.
Group Project: Data Exploration
Group X
10/28/2020
library(tidyverse)
## -- Attaching packages -------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
Border_Crossing_Entry_Data <- read_csv("Border_Crossing_Entry_Data.csv")
## Parsed with column specification:
## cols(
## `Port Name` = col_character(),
## State = col_character(),
## `Port Code` = col_double(),
## Border = col_character(),
## Date = col_character(),
## Measure = col_character(),
## Value = col_double()
## )
View(Border_Crossing_Entry_Data)
write.csv(Border_Crossing_Entry_Data,"border.csv")
bdata <- Border_Crossing_Entry_Data
head(bdata)
## # A tibble: 6 x 7
## `Port Name` State `Port Code` Border Date Measure Value
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Alcan AK 3104 US-Canada B~ 2/1/2020~ Personal Vehicle~ 1414
## 2 Alcan AK 3104 US-Canada B~ 2/1/2020~ Personal Vehicles 763
## 3 Alcan AK 3104 US-Canada B~ 2/1/2020~ Truck Containers~ 412
## 4 Alcan AK 3104 US-Canada B~ 2/1/2020~ Truck Containers~ 122
## 5 Alcan AK 3104 US-Canada B~ 2/1/2020~ Trucks 545
## 6 Alexandria B~ NY 708 US-Canada B~ 2/1/2020~ Bus Passengers 1174
summary(bdata)
## Port Name State Port Code Border
## Length:355511 Length:355511 Min. : 101 Length:355511
## Class :character Class :character 1st Qu.:2304 Class :character
## Mode :character Mode :character Median :3013 Mode :character
## Mean :2454
## 3rd Qu.:3402
## Max. :4105
## Date Measure Value
## Length:355511 Length:355511 Min. : 0
## Class :character Class :character 1st Qu.: 0
## Mode :character Mode :character Median : 100
## Mean : 28448
## 3rd Qu.: 2598
## Max. :4447374
head(bdata)
## # A tibble: 6 x 7
## `Port Name` State `Port Code` Border Date Measure Value
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Alcan AK 3104 US-Canada B~ 2/1/2020~ Personal Vehicle~ 1414
## 2 Alcan AK 3104 US-Canada B~ 2/1/2020~ Personal Vehicles 763
## 3 Alcan AK 3104 US-Canada B~ 2/1/2020~ Truck Containers~ 412
## 4 Alcan AK 3104 US-Canada B~ 2/1/2020~ Truck Containers~ 122
## 5 Alcan AK 3104 US-Canada B~ 2/1/2020~ Trucks 545
## 6 Alexandria B~ NY 708 US-Canada B~ 2/1/2020~ Bus Passengers 1174
cleaned_data <- bdata %>%
rename(Name = `Port Name`) %>%
rename(Code = `Port Code`) %>%
rename(Transportation = Measure) %>%
rename(Count = Value)
glimpse(cleaned_data)
## Rows: 355,511
## Columns: 7
## $ Name <chr> "Alcan", "Alcan", "Alcan", "Alcan", "Alcan", "Alexan...
## $ State <chr> "AK", "AK", "AK", "AK", "AK", "NY", "NY", "NY", "NY"...
## $ Code <dbl> 3104, 3104, 3104, 3104, 3104, 708, 708, 708, 708, 70...
## $ Border <chr> "US-Canada Border", "US-Canada Border", "US-Canada B...
## $ Date <chr> "2/1/2020 00:00", "2/1/2020 00:00", "2/1/2020 00:00"...
## $ Transportation <chr> "Personal Vehicle Passengers", "Personal Vehicles", ...
## $ Count <dbl> 1414, 763, 412, 122, 545, 1174, 36, 68630, 31696, 18...
cleaned_data <- cleaned_data %>%
filter(Count > 50)
glimpse((cleaned_data))
## Rows: 192,189
## Columns: 7
## $ Name <chr> "Alcan", "Alcan", "Alcan", "Alcan", "Alcan", "Alexan...
## $ State <chr> "AK", "AK", "AK", "AK", "AK", "NY", "NY", "NY", "NY"...
## $ Code <dbl> 3104, 3104, 3104, 3104, 3104, 708, 708, 708, 708, 70...
## $ Border <chr> "US-Canada Border", "US-Canada Border", "US-Canada B...
## $ Date <chr> "2/1/2020 00:00", "2/1/2020 00:00", "2/1/2020 00:00"...
## $ Transportation <chr> "Personal Vehicle Passengers", "Personal Vehicles", ...
## $ Count <dbl> 1414, 763, 412, 122, 545, 1174, 68630, 31696, 1875, ...
head(cleaned_data %>% group_by(Border) %>% summarise(Count = n()))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## Border Count
## <chr> <int>
## 1 US-Canada Border 138703
## 2 US-Mexico Border 53486
cleaned_data <- cleaned_data %>%
mutate(Border = case_when(Border == "US-Canada Border" ~ "US-Canada", Border == "US-Mexico Border" ~ "US-Mexico"))
summary(cleaned_data)
## Name State Code Border
## Length:192189 Length:192189 Min. : 101 Length:192189
## Class :character Class :character 1st Qu.:2301 Class :character
## Mode :character Mode :character Median :2604 Mode :character
## Mean :2372
## 3rd Qu.:3325
## Max. :4105
## Date Transportation Count
## Length:192189 Length:192189 Min. : 51
## Class :character Class :character 1st Qu.: 401
## Mode :character Mode :character Median : 1985
## Mean : 52618
## 3rd Qu.: 14700
## Max. :4447374
cleaned_data <- na.omit(cleaned_data)
cleaned_data <- cleaned_data %>% separate(Date, into = c("Date", "Time"), sep =" ")
cleaned_data <- cleaned_data %>% select(-Time)
head(cleaned_data)
## # A tibble: 6 x 7
## Name State Code Border Date Transportation Count
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Alcan AK 3104 US-Canada 2/1/2020 Personal Vehicle Passenge~ 1414
## 2 Alcan AK 3104 US-Canada 2/1/2020 Personal Vehicles 763
## 3 Alcan AK 3104 US-Canada 2/1/2020 Truck Containers Empty 412
## 4 Alcan AK 3104 US-Canada 2/1/2020 Truck Containers Full 122
## 5 Alcan AK 3104 US-Canada 2/1/2020 Trucks 545
## 6 Alexandria Bay NY 708 US-Canada 2/1/2020 Bus Passengers 1174
cleaned_data <- cleaned_data %>% separate(Date, into = c("Month", "Day", "Year"), sep ="/")
cleaned_data <- cleaned_data %>% select(-Month, -Day)
cleaned_data$Year <- as.numeric(cleaned_data$Year)
head(cleaned_data)
## # A tibble: 6 x 7
## Name State Code Border Year Transportation Count
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 Alcan AK 3104 US-Canada 2020 Personal Vehicle Passengers 1414
## 2 Alcan AK 3104 US-Canada 2020 Personal Vehicles 763
## 3 Alcan AK 3104 US-Canada 2020 Truck Containers Empty 412
## 4 Alcan AK 3104 US-Canada 2020 Truck Containers Full 122
## 5 Alcan AK 3104 US-Canada 2020 Trucks 545
## 6 Alexandria Bay NY 708 US-Canada 2020 Bus Passengers 1174
unique(cleaned_data$State) %>% length()
## [1] 15
cleaned_data %>% ggplot() + geom_point(mapping = aes(x=Border, y=Count))
From our preliminary analysis, we get an idea that there are more people coming from US-Mexico border than from US-Canada. We will continue our research with this information in consideration.