Tidyverse R
An opinionated collection of R packages sharing a unified philosophy and grammar for Data Science. An office productivity resource that makes Data Science more a tidy universe than a METAVERSE.
An opinionated collection of R packages sharing a unified philosophy and grammar for Data Science. An office productivity resource that makes Data Science more a tidy universe than a METAVERSE.
A good starting point for introducing Data Science to newbies is the area of Visualization and Data transformation. These are key tools for writing up reports, developing business plans and presenting ideas. Producing stunning graphs and state-of-the-art data dashboards can help supply some of the sophistication that third parties may wish to observe in you. In particular, the tidyverse umbrella package provides a free-to-use enterprise level set of solutions. The "tidyverse" suite assembles some of the most versatile R packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble. In addition to R for Data Science we also find these additional texts: ggplot2: Elegant Graphics for Data Analysis and R graphics cookbook to be useful. Jenny Bryan's STAT545 text and coursework also provides valuable insights for data wrangling and exploration using R. The tidyverse package work in harmony to clean, process, model, and visualize data. To install tidyverse onto RStudio is straighforward. Please follow the video clip below:
RStudio Desktop App Installation
RStudio Cloud Installation
Installation of RStudio for Launching in Anaconda
R Set up forGoogle Colab
Graphing is highly important for populating written reports and communicating with prospective financial backers, suppliers and clients. ggplot2 is a dedicated data visualization package for R. Hadley Wickham in 2005 pioneerd ggplot2 which departed from base R somehwhat by dis-aggregating visualization semantics across scales and layers. ggplot2 can serve as a replacement or add-on for the base graphics in R and contains a number of defaults for web and print display of common scales. It is licensed under GNU. The ggplot2 package from tidyverse provides a simplified syntax for producing a sophisticated range of visualizations for small to large datasets. It is ideal for dismantling hard-to-explain concepts and breaking down into capsule form business intelligence. ggplot2 offers an unparalleled level of intuition - easily projected through dashboards, pdfs, powerpoints etc. The playlist introduces the basic tools touched on in R for Data Science :
# Variable Type Description Details
# manufacturer string car manufacturer 15 manufacturers
# model string model name 38 models
# displ numeric engine displacement in liters 1.6 - 7.0, median: 3.3
# year integer year of manufacturing 1999, 2008
# cyl number of cylinders 4, 5, 6, 8
# trans string type of transmission automatic, manual (many sub types)
# drv string drive type f, r, 4, f=front wheel, r=rear wheel, 4=4 wheel
# cty integer city mileage miles per gallon
# hwy integer highway mileage miles per gallon
# fl string fuel type 5 fuel types (diesel, petrol, electric, etc.)
# class string vehicle class 7 types (compact, SUV, minivan etc.)
The tidyverse is an amalgam of packages that can be installed relatively troubled-free and expeditiously with a single “meta”-package, which is called “tidyverse”. This provides a convenient way of downloading and installing all tidyverse packages with a unified R command:
install.packages("tidyverse")
The core tidyverse includes the packages that you’re likely to use in everyday data analyses, and these are attached when you attach the tidyverse package:
library(tidyverse)
#> -- Attaching packages ---------------------- tidyverse 1.3.0.9000 --
#> v ggplot2 3.2.1 v purrr 0.3.3
#> v tibble 2.1.3 v dplyr 0.8.3
#> v tidyr 1.0.0 v stringr 1.4.0
#> v readr 1.3.1 v forcats 0.4.0
#> -- Conflicts ------------------------------ tidyverse_conflicts() --
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
The underlying philosophy is regularly presented using the following iconography. The program outline below has become something of a signature of the tidyverse simplicity and style.
For any enterprise, a mastery of data/business intelligence helps reveal the productive engine and underlying strengths, weaknesses, opportunities and threats. More data sometimes can lead to more confusion unless you have a tool that organizes the key talking points and leverages those vital nuggets of knowledge that you want to convey to peers. dplyr from the tidyverse suite offers again a simplified syntax to express your best ideas through numbers. It is useful for creating tables of summary statistics across specific groups of data. Rarely will data arrive in exactly in the form you anticipate or desire in order to analyze it appropriately. As part of the data science workflow, you will need to transform your data. To demonstrate some of the functionality of tidyverse and in particular dplyr Hadley Wickham compiled the nycflights13 dataset which is available as a downloadable package in R. This dataset present the information below for Airline flights departing the three main NYC airports in 2013. The package also includes useful 'metadata' on airlines, airports, weather, and planes. Using the 'str' command, we tease out the dimensions and composition of the volumable dataset:
tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
$ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
$ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
$ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
$ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
$ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
$ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
$ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
$ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
$ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
$ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
$ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
$ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
$ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
$ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
$ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
$ distance : num [1:336776] 1400 1416 1089 1576 762 ...
$ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
$ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
$ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
The nycflights13 contains 336,776 rows by 19 columns of data points. The data was originally obtained from the Bureau of Transportation Statistics.
US Airport codes can be found here.
Variables include:
year, month, day Date of departure.
dep_time, arr_time Actual departure and arrival times (format HHMM or HMM), local tz.
sched_dep_time, sched_arr_time Scheduled departure and arrival times (format HHMM or HMM), local tz.
dep_delay, arr_delay Departure and arrival delays, in minutes. Negative times represent early departures/arrivals.
carrier Two letter carrier abbreviation. See airlines to get name.
flight Flight number.
tailnum Plane tail number. See planes for additional metadata.
origin, dest Origin and destination.
air_time Amount of time spent in the air, in minutes.
distance Distance between airports, in miles.
hour, minute Time of scheduled departure broken into hour and minutes.
Below, we set out step-by-step instruction introducing the dplyr verbs from chapter 5 of R for Data Science
Google Colab provides an enormous resource to those in the data community. Its free like many things on the internet which is contrary to the Economist adage of "no free meals". Still it is free where you possibly need it most and it will be, no doubt, transformative like many of the google technologies. It is essentially an online browser-based platform that permits users to train their models on machines gratis and enables a full suite executable coding to run including Python, R, C++, Javascript........ It enables working with large datasets, complex modeling and sharing work seamlessly with collaborators. It makes no difference which computer you possess or how it is configured or how old. You can still use Google Colab and has transformed the fortunes and mojo of my ailing 2011 mac book pro. All you require is a Google/Gmail account and a web browser and of course an internet connection. Colab provides a practically free almost unimpeded access to GPUs like Tesla K80 and even a TPU and with some provisos (e.g. no bitcoin mining allowed).
Google Colab is typically used to execute Python code in the browser environment. The integration of Python into Google Drive no doubt serves to democratize data science and expanded opportunities for learning and developing skills outside the university nexus. An important development recently has been the extension of same to R. There are two ways to execute R in Google Colaboratory:
1 Run Rmagic by executing the command %load_ext rpy2.ipython and in each cell thereafter add %%R see just below. This permits R and Python to be executed in the same notebook.
It quite likely you may have to perform a pip install - if you experience difficulties.
!pip install rpy2==3.5.1
or
2 load the following line of code into your browser https://colab.to/r (this link seems to be more robust https://colab.research.google.com/#create=true&language=r )
using %load_ext rpy2.ipython and %%R
The Google Colaboratory although normally used for running python can also be used for running R. This increases versatility and the code with open directly once you click on the yellow colab icon.
# Chapt 3 Visualisation
# https://r4ds.had.co.nz/data-visualisation.html
# This mpg dataset provides fuel economy data from 1999 and 2008
# for 38 popular models of cars. The dataset is
# shipped with ggplot2 package.
using %load_ext rpy2.ipython and %%R
The Google Colaboratory although normally used for running python can also be used for running an R script. Double click the yellow CO Icon to open the Google Colaboratory Environment
# Chapt 5 Data Transformation
# https://r4ds.had.co.nz/transform.html
# The nycflights13 dataset and dplyr verbs filter(), select(), rename(), summarise, mutate()summary count, na.rm, count n
1. Hartsfield–Jackson Atlanta International Airport (ATL) - 103.9 Million Passengers
2. Los Angeles International Airport (LAX) - 84.5 Million Passengers
3. O’Hare International Airport (ORD) - 79.8 Million Passengers
4. Dallas/Fort Worth International Airport (DFW) - 75 Million Passengers
5. Denver International Airport (DEN) - 61 Million Passengers
6. John F. Kennedy International Airport (JFK) - 61 Million Passengers
7. San Francisco International Airport (SFO) - 56 Million Passengers
8. McCarran International Airport (LAS) - 51 Million Passengers
9. Seattle-Tacoma International Airport (SEA) - 50 Million Passengers
10. Charlotte Douglas International Airport (CLT) - 46 Million Passengers
11. Orlando International Airport (MCO) - 45 Million Passengers
12. Miami International Airport (MIA) - 44 Million Passengers
13. Phoenix Sky Harbor International Airport (PHX) - 44 Million Passengers
14. Newark Liberty International Airport (EWR) - 43 Million Passengers
15. George Bush Intercontinental Airport (IAH) - 41 Million Passenger
Below we include a google colab project that combines R code from R for Data Science and Python Code (from Lampu Bhutia). The combination of powerful Python and R libraries means you can manipulate data flexibly using slightly different approaches. Interesting combinations would include using R tidyverse and then applying the pandas library from Python. To compare pandas to dplyr - it worth exploring this link. Also Python for Data Science mirror R for Data Science
We can load the following line of code into your browser
also possible
https://colab.research.google.com/#create=true&language=r
ggplot2 is an R package for producing data graphics/visualizations. Unlike most other graphics packages, ggplot2 has an underlying grammar, based on the Grammar that allows you to compose graphs by combining independent components.
Two useful html based texts useful for learning ggplot2:
https://ggplot2-book.org/index.html
We can implement R in a dedicated R notebook. The embedded Google Colab below uses this approach where https://colab.to/r is entered into the browser. This avoids having to use %load_ext rpy2.ipython and repeat %%R in each cell thereafter.
IBM Watson Studio provides a useful platform for running Machine Learning and the lite version may prove particularly fruitful to small businesses wanting to become smart. The following resources and tools are available for free to explore collaboratively data with AI and machine learning in the Watson Studio Cloud - Lite Plan :
50 capacity unit hours/month
Integrated environments
Publish and collaborate in the cloud
Notebook servers and R Studio for interactivity and data visualization with Python, R, and Scala
Below I will run through a number of cloud resources available in the IBM Watson Studio and set out a more thorough exploratory data analysis of the HDMA dataset using R Tidyverse. R Tidyverse code is provided below the video clips.
with Tidyverse Installation
with dplyr and ggplot2
and some more