Contact me at the buttons in the footer!

Machine Learning with Tidy Models in R 2020

May 5, 2020

Tidymodels in R

In early 2020, R promoted a new package on CRAN known as "tidymodels". This is a collection of packages for modeling and machine learning that follows R's coveted Tidyverse principles.

Although R has had the functionality to streamline perfected processes to use multiple supervised machine learning algorithms synchronously using the caret function and dplyr, tidymodels takes this to the next step.

For those who follow me and maybe don't speak data science, or are new to data science here is the short:

To use predictive modeling or "supervised machine learning" to make predictions, you need powerful tools that can harness the data you are feeding into it and can effectively and quickly make the best predictions as possible. Tidymodels helps to speed this process up and allows everyone using them to speak the same code syntax.

For beginners, review this course by renowned Data Scientist Julia Silge.

To install tidymodels in your R console:

install.packages("tidymodels")

What is tidyverse? Tidyverse is a simple term that designates R code as a collection of philosophy, grammar, and data structure that help promote good coding practices for efficiency and repeat ability. The packages "ggplot2" and "dplyr" are built around this philosophy as well as many more.

The Pipeline of Prediction Modeling in R:

1. Define your data and ingest it.

2. Exploratory Data Analysis.

3. Transforming Data for Modeling.

4. Executing your model(s). Also known as Supervised Machine Learning.

5. Evaluating your models.

When I am dabbling with a new package in R, especially one revolving around Machine Learning I typically reach out to Kaggle to get a dataset.

I scrolled through and found a dataset specifically revolved around New York City Airbnb Data.

For the first case study, I am going to try to predict the price of a stay at any location in the dataset using very few predictors using a regression model.

Regression models are used to predict a quantitative outcome. These are numeric and continuous.

New York City AirBnB Data

The following information comes from the Description on Kaggles website pinned above.

Context

Since 2008, guests and hosts have used Airbnb to expand on traveling possibilities and present more unique, personalized way of experiencing the world. This dataset describes the listing activity and metrics in NYC, NY for 2019.

Content

This data file includes all needed information to find out more about hosts, geographical availability, necessary metrics to make predictions and draw conclusions.

Acknowledgements

This public dataset is part of Airbnb, and the original source can be found on this website.

Inspiration

What can we learn about different hosts and areas?
What can we learn from predictions? (ex: locations, prices, reviews, etc)
Which hosts are the busiest and why?
Is there any noticeable difference of traffic among different areas and what could be the reason for it?

Step 1. Define your data and ingest it.

I tend to only use .csv files when I am not connecting to a database using an ODBC connection. The following R code will read your .csv file into a useable format.

This data has a multitude of predictors that could be utilized, even if not that clean. This dataset is specific to one geographical location in the United States and is only in 2019.

NYCairBnb <- read.csv("AirBnB_NYC_2019.csv")

Step 2. Exploratory Data Analysis.

Features in the Dataset:

1. id

2. Name

3. Host_Id

4. Host Name

5. Neighborhood Group

6. Neighborhood

7. Latitude

8. Longitude

9. Room Type - exists of the values (Entire Home/Apt, Private Room, Shared Room)

10. Price

11. Minimum Nights

12. Number of Reviews

13. Last Review - The date of the most recent review.

14. Reviews Per Month

15. Calculated Host Listings Count

16. Availability 365 - The number of days the rental is available in a calendar 365 day year.

We want to predict the cost of a stay at any of these locations using the characteristics of these location listings, like neighborhood, number of reviews, and host reputation.

library(tidyverse)

AirBnb = read.csv("AB_NYC_2019.csv")

# Print the AirBnb object using glimpse.

glimpse(AirBnb)

Price is a continuous numeric value that ranges from $0 to $10,000 in the dataset.

Let's visualize the distribution of price across the entire dataset using the ggplot2 package.

Build a histogram with Bins for Price.

#Create 6 Bins for price

AirBnb$price = currency(AirBnb$price, digits = 0L)

AirBnb$Price_Grp <- cut(AirBnb$price, breaks=data.frame(

classIntervals(AirBnb$price,n=6,method="quantile")[2])[,1], include.lowest=T,dig.lab=10)

#Build Plot

ggplot(AirBnb, aes(x = Price_Grp)) +

stat_count(width = 0.5) +

labs(x = "Price Grouped (U.S. Dollars)", y = "Number of Rentals")+

ggtitle("Distribution of Price by Group")

Step 3. Transforming Data for Modeling.

Remove Identifiers. Identifiers in the data such as the ID simply do not make much sense and have little to no worth in any model. Use all of the features in the dataset.

# Deselect the 2 columns to create cars_vars

car_vars <- cars2018 %>%

select(-Model, -`Model Index`)

Step 4. Executing your model(s). Also known as Supervised Machine Learning.

The quickest and easiest way to view your first model is to build the most simplistic model as possible. Lets fit a simple linear model using R's lm() function for "linear model".

fit_all <- lm(MPG ~ ., data = car_vars)

# Print the summary of the model

summary(fit_all)

Step 5. Evaluating your models.

Google Sites

Report abuse