Data Exploration

Follow this link to see cleaned dataset.

Follow this link to see data cleaning RMD file.

We did the following to clean our datset:

Changed our variables names so that it is shorter, but still conveys the information.
Spaces in variable names were removed.
Since there were 350,000 rows, observations with lesser number of people coming in were neglected (less than 50).
Date column had two pieces of information. Time and Date. Time information was removed because it was same for all observations and it provided no additional information.
There were two unique values of Border variable. These two values were changed to a shorter name. For example, US-Canada Border was changed to US-Canada.
Any rows with missing values for variables were ignored. For example, if the value was empty.
Made a rough graph to get an idea of which border had the most number of border crossings. Found that US-Mexico has the most border crossings.
Made a graph to see the pattern of number of border crossings from 1996-2020. Found less people are coming in recent years as compared to the past.

Group Project: Data Exploration

Group X

10/28/2020

library(tidyverse)

## -- Attaching packages -------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2 v purrr 0.3.4

## v tibble 3.0.3 v dplyr 1.0.1

## v tidyr 1.1.1 v stringr 1.4.0

## v readr 1.3.1 v forcats 0.5.0

## -- Conflicts ----------------------------------- tidyverse_conflicts() --

## x dplyr::filter() masks stats::filter()

## x dplyr::lag() masks stats::lag()

Download the csv file from the url:“https://www.kaggle.com/divyansh22/us-border-crossing-data/download” and unzip it before running.

Border_Crossing_Entry_Data <- read_csv("Border_Crossing_Entry_Data.csv")

## Parsed with column specification:

## cols(

## `Port Name` = col_character(),

## State = col_character(),

## `Port Code` = col_double(),

## Border = col_character(),

## Date = col_character(),

## Measure = col_character(),

## Value = col_double()

## )

View(Border_Crossing_Entry_Data)

write.csv(Border_Crossing_Entry_Data,"border.csv")

bdata <- Border_Crossing_Entry_Data

Seeing a summary of original dataset

head(bdata)

## # A tibble: 6 x 7

## `Port Name` State `Port Code` Border Date Measure Value

## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>

## 1 Alcan AK 3104 US-Canada B~ 2/1/2020~ Personal Vehicle~ 1414

## 2 Alcan AK 3104 US-Canada B~ 2/1/2020~ Personal Vehicles 763

## 3 Alcan AK 3104 US-Canada B~ 2/1/2020~ Truck Containers~ 412

## 4 Alcan AK 3104 US-Canada B~ 2/1/2020~ Truck Containers~ 122

## 5 Alcan AK 3104 US-Canada B~ 2/1/2020~ Trucks 545

## 6 Alexandria B~ NY 708 US-Canada B~ 2/1/2020~ Bus Passengers 1174

summary(bdata)

## Port Name State Port Code Border

## Length:355511 Length:355511 Min. : 101 Length:355511

## Class :character Class :character 1st Qu.:2304 Class :character

## Mode :character Mode :character Median :3013 Mode :character

## Mean :2454

## 3rd Qu.:3402

## Max. :4105

## Date Measure Value

## Length:355511 Length:355511 Min. : 0

## Class :character Class :character 1st Qu.: 0

## Mode :character Mode :character Median : 100

## Mean : 28448

## 3rd Qu.: 2598

## Max. :4447374

Changing the variable name of the dataset so that it is easier to read. Also removing spaces from variable name if any.

head(bdata)

## # A tibble: 6 x 7

## `Port Name` State `Port Code` Border Date Measure Value

## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>

## 1 Alcan AK 3104 US-Canada B~ 2/1/2020~ Personal Vehicle~ 1414

## 2 Alcan AK 3104 US-Canada B~ 2/1/2020~ Personal Vehicles 763

## 3 Alcan AK 3104 US-Canada B~ 2/1/2020~ Truck Containers~ 412

## 4 Alcan AK 3104 US-Canada B~ 2/1/2020~ Truck Containers~ 122

## 5 Alcan AK 3104 US-Canada B~ 2/1/2020~ Trucks 545

## 6 Alexandria B~ NY 708 US-Canada B~ 2/1/2020~ Bus Passengers 1174

cleaned_data <- bdata %>%

rename(Name = `Port Name`) %>%

rename(Code = `Port Code`) %>%

rename(Transportation = Measure) %>%

rename(Count = Value)

glimpse(cleaned_data)

## Rows: 355,511

## Columns: 7

## $ Name <chr> "Alcan", "Alcan", "Alcan", "Alcan", "Alcan", "Alexan...

## $ State <chr> "AK", "AK", "AK", "AK", "AK", "NY", "NY", "NY", "NY"...

## $ Code <dbl> 3104, 3104, 3104, 3104, 3104, 708, 708, 708, 708, 70...

## $ Border <chr> "US-Canada Border", "US-Canada Border", "US-Canada B...

## $ Date <chr> "2/1/2020 00:00", "2/1/2020 00:00", "2/1/2020 00:00"...

## $ Transportation <chr> "Personal Vehicle Passengers", "Personal Vehicles", ...

## $ Count <dbl> 1414, 763, 412, 122, 545, 1174, 36, 68630, 31696, 18...

Rows with smaller count values (less than 50) were eliminated because of the negligible number of people entering as compared to other rows

cleaned_data <- cleaned_data %>%

filter(Count > 50)

glimpse((cleaned_data))

## Rows: 192,189

## Columns: 7

## $ Name <chr> "Alcan", "Alcan", "Alcan", "Alcan", "Alcan", "Alexan...

## $ State <chr> "AK", "AK", "AK", "AK", "AK", "NY", "NY", "NY", "NY"...

## $ Code <dbl> 3104, 3104, 3104, 3104, 3104, 708, 708, 708, 708, 70...

## $ Border <chr> "US-Canada Border", "US-Canada Border", "US-Canada B...

## $ Date <chr> "2/1/2020 00:00", "2/1/2020 00:00", "2/1/2020 00:00"...

## $ Transportation <chr> "Personal Vehicle Passengers", "Personal Vehicles", ...

## $ Count <dbl> 1414, 763, 412, 122, 545, 1174, 68630, 31696, 1875, ...

Verifying that there are two borders mentioned in the dataset

head(cleaned_data %>% group_by(Border) %>% summarise(Count = n()))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2

## Border Count

## <chr> <int>

## 1 US-Canada Border 138703

## 2 US-Mexico Border 53486

Removing redundant information from Border column

cleaned_data <- cleaned_data %>%

mutate(Border = case_when(Border == "US-Canada Border" ~ "US-Canada", Border == "US-Mexico Border" ~ "US-Mexico"))

Seeing a summary of cleaned dataset

summary(cleaned_data)

## Name State Code Border

## Length:192189 Length:192189 Min. : 101 Length:192189

## Class :character Class :character 1st Qu.:2301 Class :character

## Mode :character Mode :character Median :2604 Mode :character

## Mean :2372

## 3rd Qu.:3325

## Max. :4105

## Date Transportation Count

## Length:192189 Length:192189 Min. : 51

## Class :character Class :character 1st Qu.: 401

## Mode :character Mode :character Median : 1985

## Mean : 52618

## 3rd Qu.: 14700

## Max. :4447374

Any rows with missing values for variables were ignored. For example, if the value was empty

cleaned_data <- na.omit(cleaned_data)

Removing time value from Date column because we don’t need it as it gives no information besides making data look dirty

cleaned_data <- cleaned_data %>% separate(Date, into = c("Date", "Time"), sep =" ")

cleaned_data <- cleaned_data %>% select(-Time)

head(cleaned_data)

## # A tibble: 6 x 7

## Name State Code Border Date Transportation Count

## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>

## 1 Alcan AK 3104 US-Canada 2/1/2020 Personal Vehicle Passenge~ 1414

## 2 Alcan AK 3104 US-Canada 2/1/2020 Personal Vehicles 763

## 3 Alcan AK 3104 US-Canada 2/1/2020 Truck Containers Empty 412

## 4 Alcan AK 3104 US-Canada 2/1/2020 Truck Containers Full 122

## 5 Alcan AK 3104 US-Canada 2/1/2020 Trucks 545

## 6 Alexandria Bay NY 708 US-Canada 2/1/2020 Bus Passengers 1174

Keeping only Year information because we don’t need Month and Day for the purpose of our study

cleaned_data <- cleaned_data %>% separate(Date, into = c("Month", "Day", "Year"), sep ="/")

cleaned_data <- cleaned_data %>% select(-Month, -Day)

cleaned_data$Year <- as.numeric(cleaned_data$Year)

head(cleaned_data)

## # A tibble: 6 x 7

## Name State Code Border Year Transportation Count

## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>

## 1 Alcan AK 3104 US-Canada 2020 Personal Vehicle Passengers 1414

## 2 Alcan AK 3104 US-Canada 2020 Personal Vehicles 763

## 3 Alcan AK 3104 US-Canada 2020 Truck Containers Empty 412

## 4 Alcan AK 3104 US-Canada 2020 Truck Containers Full 122

## 5 Alcan AK 3104 US-Canada 2020 Trucks 545

## 6 Alexandria Bay NY 708 US-Canada 2020 Bus Passengers 1174

How many borders are there

unique(cleaned_data$State) %>% length()

## [1] 15

Making scatterplot between numeric variables to determine relationships if any.

cleaned_data %>% ggplot() + geom_point(mapping = aes(x=Border, y=Count))

From our preliminary analysis, we get an idea that there are more people coming from US-Mexico border than from US-Canada. We will continue our research with this information in consideration.