Contact me at the buttons in the footer!
Information is from Tidy Modeling with R by Max Kuhn and Julia Silge. These are notes I took on the material.
Types of Models:
Descriptive Models
def. to describe or illustrate characteristics of some data. The analysis might have no other purpose than to visually emhasize some trend or artifcat in the data. - Page 6
Inferential Models
def. "to produce a decision for a research question or to explore a specific hypthesis, similar to how statistical tests are used. An inferential model starts with a predefined conjecture or idea about a population and produces a statistical conclusion such as an interval estimate or the rejection of a hypothesis" - Page 8
Predictive Models
def. Data modeled to produce the most accurate prediction possible for new data. The primary goal is that the predicted values have the highest possible fidelity to the true value of the new data. -page 9
Models can be categorized as being supervised or unsupervised. Unsupervised models are those that learn patterns, clusters, or other characteristics of the data, but lack an outcome (the dependent variable). Supervised models are those that have an outcome variable. - page 11
Within Supervised models, there are two types:
Regression
def. predicts a numeric outcome -page 11
Classification
def. predicts an outcome that is an ordered or unordered set of qualitative values - page 11
Outcomes can be called labels, endpoints, or dependent variables. These are what are being predicted in supervised models. Independent variables are the substrate for making predictions of the ouctome (also known as predictors, features, or covariates.
I prefer outcomes and features.
Two types of variables:
1. Quantitative
def. Real numbers and integers
2. Qualitative (Nominal data)
def. Represent some sort of discrete state that cannot be placed on a numeric scale. Like colors.
Clean the data
Understanding the data (Exploratory Data Analysis or EDA)
Back and forth between numerical analysis and data visualization in which different discoveries lead to more questions and data analysis side quests to gain more understanding.
Identify an outcome to be predicted
Identify a performance metric
Feature engineering
creation of feature set to use in modeling. Can use complex methodologies such as PCA or simple features such as the ratio of two predictors.
Model Tuning and Model Selection
Some models require parameter tuning in which some structural parameters must be specified or optimized.
Model Evaluation
Access models performance metrics, examine residual plots, and conduct EDA like analyses to understand how well the models work.
This entire process is then iterative to gain more insights and more predictive power.
After coming to an end, finalize, document, and communicate the model.
Tidyverse
def. a collection of R packages for data analysis that are developed with common ideas and norms.
This framework focuses on designing R packages and functions that can be easily understood.
Designed for the pipe %>% and functional programming
The magrittr pipe operator (%>%) is a tool to chain a sequence of R Functions. This increases readability. The first step to use pipe is to define the data set.
The majority of model functions cannot operate on non numeric data. For example, you need to encode qualitative data into a numeric format. The most common approach is to use indicator variables (or dummy variables) in place of the original qualitative values.
The tidymodels framework contains multiple packages. Loading the metapackage will show if the function naming has any conflicts with your actively loaded packages.
rsample package
focuses on data splitting and resampling
yardstick
focuses on performance metrics
broom
designed to convert output into tidy tibbles
dials
contains infastricture to create and manage values of tuning parameters for the tidymodels
dplyr
a fast consistent tool for working with data frame like objects
ggplot2
a system for declaratively creating graphics based on the grammar of graphics
infer
used to perform statistical inference
modeldata
data sets used for demonstrating ot testing model related packages.
parsnip
provide tidy, unified interface to models that can be used to try a range of models effectively
purrr
a set of tools for working with functions and vectors
recipes
preprocesing and feature engineering steps for modeling
tibble
simple data frames
tidyr
creates tidy data, where each column is a variable, each row is an observation, and each cell contains a single value
tune
Used to find reasonable values of hyper-parameters in models, pre-processing methods, and post-processing steps.
workflows
streamline the prepreoccess and parsnip model by bundling them together
workflowsets
used to create workflows in masse as well as training and visualizing results
The Ames Housing Data - Book Example introduced on page 45
This dataset contains information about 2,930 properties in Ames, Iowa including information related to:
House characteristics (bedrooms, garage, fireplace, pool, porch, etc.)
Location (Neighborhood)
Lot information (zoning, shape, size, etc.)
Ratings of condition and quality
Sale Price
Step 1: Exploratory Data Analysis
Define outcome. Looking to predict the last sale price of the house in USD.
Create a histogram of the distribution of sale prices.
Throughout book, the outcome column is prelogged.
Questions to ask during EDA:
1. Is there anything odd or noticeable about the distribution of the individual predictors? Is there much skewness or any pathological distribution?
2. Are there high correlations between predictors?
3. Are there associations between predictors and the ourcomes?
When data are reused for multiple tasks, instead of being carefully "spent" from the finite data budget, certain risks increase, such as the risk of accentuating bias or compounding effects from methodological errors.
For example, utilize a subset of the data to determine which predictors are informative, before consdering parameter estimation at all.
traditionally, the empircal model validation is to split the existing pool into two distinct sets.
1. Training Set - Used to develop and optimize the model. Typically the majority of the data. Sandbox.
2. Test Set - Held in reverse until one or two models are chosen as the methods most likely to succeed. This set is used as the arbiter to determine the efficacy of the model.
Suppose you allocate 80% to training and 20% to test. The most common way to do this, is to use simple random sampling. The rsample package does this. Utilize the function rsample::initial_split. Takes a data frame as an argument as well as the proportion to be placed into training.
Usually simple random sampling is appropriate, however there are exceptions. When there is drastic class imbalance in classification problems, one class occurs must less than another. To avoid infrequent samples, use stratified sampling. This is when the training/test split are conducted within each class, and then the subsamples are combined into overall training/test sets.
With time series data, sometimes random sampling is not the best choice. You would use the more recent data as the test set. Use rsample::initial_time_split instead. The proportion argument will denote the proportion that goes to training. The function assumes that the data is presorted.
Once the data has been encoded in a format ready for modeling algorithm, such as a numeric matrix, they can be used in the model building process.
with this package you must:
1. Specify the type of model based on its mathematical structure
(Linear regression, random forest, KNN, etc.)
2. Declare the mode of the model
(Either regression or classification)
Once the model specification is completed, the model estimation can be done with the fit() function. Use translate to provide details on how parsnip converts the users code to the packages syntax.
Modeling functions in parsnip separate model arguments into two categories:
1. Main Arguments
2. Engine Arguments - specific to a particular engine
Once the model is created and fit, we can use the results in a variety of ways.
plot
print
examine model output
Make Predictions
1. Results are always a tibble
2. Column names of the table are always preditable.
3. There are always as many rows in the tibble as there are in the input data set.
These 3 rules make it easy to merge predictions with original data.