Data Information

Links to the data
- Main Taxi Data: https://data.cityofchicago.org/Transportation/Taxi-Trips-2019/h4cq-z3dy
- Community Area Data: https://data.cityofchicago.org/Facilities-Geographic-Boundaries/Boundaries-Community-Areas-current-/cauq-8yn6

Data Preparation
- Pre-importation:
  - As per the instructions of the project, I whittled down 7 GB file down to roughly 1 GB by only including the specified columns. I did this by using the Import-Csv linux command and using the "select" option within the command to select the specified columns. Here was the specific command:
    - "Import-Csv '.\Taxi.csv'| select "Trip Start Timestamp","Trip Seconds","Trip Miles","Pickup Community Area","Drop-off community Area","Company"|Export-Csv -Path .\Taxi2.csv -NoTypeInformation"
  - From there, I kept reducing the file contents as specified by the instructions using the "where" option of the Import-Csv command.
    - Trips less than 0.5 miles: Import-Csv 'Taxi2.csv' | Where { $_."Trip Miles" -gt 0.5} | Export-Csv -Path 'Taxi3.csv' -NoTypeInformation
    - Trip greater than 100 miles: Import-Csv 'Taxi3.csv' | Where { $_."Trip Miles" -lt 100} | Export-Csv -Path 'Taxi4.csv' -NoTypeInformation
    - Trip less than 60 seconds: Import-Csv 'Taxi4.csv' | Where { $_."Trip Seconds" -lt 60} | Export-Csv -Path 'Taxi5.csv' -NoTypeInformation
    - Trip greater than 5 hours: Import-Csv 'Taxi5.csv' | Where { $_."Trip Seconds" -gt 18000}| Export-Csv -Path 'Taxi6.csv' -NoTypeInformation
    - Trips that start outside of Chicago community area: Import-Csv 'Taxi6.csv' | Where { $_."Pickup Community Area"}| Export-Csv -Path 'Taxi7.csv' -NoTypeInformation
    - Trips that end outside of a Chicago community area: Import-Csv 'Taxi7.csv' | Where { $_."Dropoff Community Area"}| Export-Csv -Path 'Taxi8.csv' -NoTypeInformation
  - Lastly, I split up the final data file into multiple data files which 10,000 rows in each file using this python code:
    - import osimport pandas as pdimport uuidclass FileSettings(object): - Pastebin.com

which I found out about from this video:

How can I split a large csv file (7GB) into smaller Csv files using Python| Stack overflow Question - YouTube

Post-importation
- Read in all of the split up data files using the ldply function from the plyr library using the pattern *.csv because I wanted all files with the CSV extension to be read
- Used fread instead of read.csv because fread is supposedly faster
- Named the read in data "Data"
- Changed the column names of "Data" to replace spaces with underscores
- Created different columns in "Data" for the timeframe filters using lubridate
- Read in the community area data file using readOGR from the rgdal library and created a data frame version of it called "myspdf.df"
- Changed some of the "myspdf.df" column names to make it easily mergeable
- Merged the column that contains the community area name in "myspdf.df" with "Data"

Page updated

Report abuse