Cyclistic Data Analysis - Continuation - Process of merging all the data available

Process of Merging Data From 2013 To 2019 with Data From 2020 To 2022

Sum up

This is the continuation of the explanation of the process followed in order to try to merge all the data sets provided by Divvy.

Given that all data sets from 2020 to 2022 have been already uploaded to Rstudio, I am going to continue with the process of uploading the data sets from 2013 to 2019 to merge them afterwards with the data set already created called "merged_20_21_22"

Year 2013

Year 2014

Year 2015

Year 2016

Now, I try to merge these 4 years, but there is an error. Here, I find that for 2014, 2015 and 2016 the starttime and endtime columns are character type and not date-time type (POSIXct).

So, I proceed to join 2013, 2014 and 2015 so we can amend their column types at the same time:

Once this is done, I try to merge 2013, 2014, 2015 and 2016.

The following step is to upload the data sets from 2017 to 2019, but before that, as 2019_Q2 has different column names to the rest of data sets, I need to rename them to be able to merge all of it.

Next, I upload the rest of data sets

I have realised that the type for columns start_time and end_time are character instead of POSIXct for all 2017 data sets. As well, the column tripduration for Q1_2018 is character type and we want it to be numeric. So, now I proceed to convert the column types as follows.

Q1_2017

Q2_2017

Q3_2017

Q4_2017

Q1_2018

Once all column types coincide for 2017, 2018 and 2019, I merge them

Now, I have merged_13_14_15_16 and merged_17_18_19, so I can join them into one data frame as follows, but in order to do that, first I have to make the column names to coincide with the below function:

After this, I can merge them:

In order to be able to merge this new data frame later with merged_20_21_22, there are some column types to convert and also some column names.

First, I will convert on merged_2013_to_2019 the columns trip_id, from_station_id and to_station_id from numeric to character.

Next, I am going to rename some columns on merged_2013_to_2019 so they coincide and can merge later.

Now that I have all columns ready to be merged, I need to find a function which would help me to join merged_2013_to_2019 and merged_20_21_22, including the columns that they do not have in common.

Unfortunately, I could not find any function which merged two data frames with different observations (rows) and different variables (columns).

The only idea that comes to my mind is to remove the columns they do not have in common and try to add them later with any function similar to VLOOKUP on Excel.

Before continuing, I have realised that I have another issue with the station_name, station_id between merged_2013_to_2019 and merged_20_21_22. There are many trips without this information in merged_20_21_22, and from the stations gathered from 2013 to 2019, there are same station_id for different station_names, which is a problem.

So, in order to check what we have from 2020 to 2022, I am going to create a new data frame which accounts the different station names, ids and the number of times each station name appears.

After running the code above, I realised that there are many different coordinates for the same stations and same names. Taking this into account, I deduced that the differences may come per year, so I decide to create different data frames per year.

Once this is done, my attempt is to extract the different Stations list per year, and if luckily they are small enough, I will be able to export them to Excel and merge it with the excel I already have with stations from 2013 to 2019.

Now, I am going to export it to Excel and in order to do that we need to install and load a new package.

After doing this, I have realised that the problem addressed previously is still there and that less than the 5% of the observations on each excel has NA values

I was hoping to be able to VLOOKUP those NA values, but due to the fact that there are recorded many different coordinates for each location, it makes almost impossible to VLOOKUP without committing mistakes.

After giving it a thought, it is very complicated to join both data frames (merged_20_21_22 and merged_2013_to_2019) due to the fact that all data from 2013 to 2019 is lacking longitude and latitude information. But there is still the possibility of adding longitude and latitude information if we export the data frames to excel.

After trying the above functions, the only data set that could be amended on Excel was Year 2013. Sadly, the data sets for years 2014 to 2016 are too big to be opened by Excel. So, the only option left is to upload them to BigQuery and perform VLOOKUP with SQL.

But again, the free version of SQL has a limit on the size of the data sets uploaded so I cannot continue down this path either.

The conclusion about this is that I will have to proceed just with data from 2020 to 2022, due to the fact that the data we have from 2013 to 2019 is incomplete and, currently, I lack the knowledge to be able to join it altogether.

Thank you for your interest on how I tried to merge all the data available, please continue reading my analysis of this case on the main page.

Page updated

Google Sites

Report abuse