Links to the data
Data Preparation
Pre-importation:
As per the instructions of the project, I whittled down 7 GB file down to roughly 1 GB by only including the specified columns. I did this by using the Import-Csv linux command and using the "select" option within the command to select the specified columns. Here was the specific command:
"Import-Csv '.\Taxi.csv'| select "Trip Start Timestamp","Trip Seconds","Trip Miles","Pickup Community Area","Drop-off community Area","Company"|Export-Csv -Path .\Taxi2.csv -NoTypeInformation"
From there, I kept reducing the file contents as specified by the instructions using the "where" option of the Import-Csv command.
Trips less than 0.5 miles: Import-Csv 'Taxi2.csv' | Where { $_."Trip Miles" -gt 0.5} | Export-Csv -Path 'Taxi3.csv' -NoTypeInformation
Trip greater than 100 miles: Import-Csv 'Taxi3.csv' | Where { $_."Trip Miles" -lt 100} | Export-Csv -Path 'Taxi4.csv' -NoTypeInformation
Trip less than 60 seconds: Import-Csv 'Taxi4.csv' | Where { $_."Trip Seconds" -lt 60} | Export-Csv -Path 'Taxi5.csv' -NoTypeInformation
Trip greater than 5 hours: Import-Csv 'Taxi5.csv' | Where { $_."Trip Seconds" -gt 18000}| Export-Csv -Path 'Taxi6.csv' -NoTypeInformation
Trips that start outside of Chicago community area: Import-Csv 'Taxi6.csv' | Where { $_."Pickup Community Area"}| Export-Csv -Path 'Taxi7.csv' -NoTypeInformation
Trips that end outside of a Chicago community area: Import-Csv 'Taxi7.csv' | Where { $_."Dropoff Community Area"}| Export-Csv -Path 'Taxi8.csv' -NoTypeInformation
Lastly, I split up the final data file into multiple data files which 10,000 rows in each file using this python code:
which I found out about from this video:
Post-importation
Read in all of the split up data files using the ldply function from the plyr library using the pattern *.csv because I wanted all files with the CSV extension to be read
Used fread instead of read.csv because fread is supposedly faster
Named the read in data "Data"
Changed the column names of "Data" to replace spaces with underscores
Created different columns in "Data" for the timeframe filters using lubridate
Read in the community area data file using readOGR from the rgdal library and created a data frame version of it called "myspdf.df"
Changed some of the "myspdf.df" column names to make it easily mergeable
Merged the column that contains the community area name in "myspdf.df" with "Data"