R Part 4a

The total length of the videos in this section is approximately 22 minutes, but you will also spend time running code while completing this section.

You can also view all the videos in this section at the YouTube playlist linked here.

Please download the code file:

Splines

RPart4a.1.Splines.mp4

Question 1: In which of the following situations would the use of splines be appropriate?

  • You know outcomes for 2 treatments, and you want to predict outcomes for a 3rd treatment for which you have no data {discrete variable}

  • You have data for several outcomes on a continuous scale and want to try to predict what the outcome would be for another data point somewhere between your known outcomes {continuous variable}

Show answer

The second option. Splines are used to predict the outcomes when you already have data for several instances of the variable. They can't be used if you do not have any data for a discrete case.

Regression trees

The code you downloaded has been updated since this video was made, so the code and video don't match at the moment. The reason for the mismatch is that new and improved packages for regression trees come out frequently.

The video uses an old package. You should still watch the video, because the overall usage is similar.

The code you downloaded shows a package called party. This is my favorite tree package because the labels in the graphics are easy to understand. There is an updated version called partykit that you should try if you have any problems with party. Other popular tree packages include rpart and tree - feel free to try those, too.

Also, the end of video below discusses random forests, which I chose to omit from these modules.

RPart4a.2.Regression Trees & Random Forests.mp4

Question 2: Sometimes each node shows a boxplot, and sometimes each node shows a bar plot. Why?

Show answer

If the outcome variable is continuous, it's appropriate to show the distribution of the outcomes for data points in that node using a boxplot. If the outcome is categorical, it's appropriate to show the distribution of the outcomes for data points in that node using a bar plot.

Additions to plots

RPart4a.3.Graphs & Legends.mp4

Question 3: If you don't see something that you've added to your plot, what should you do?

Show answer

You should always make sure that the x and y values that you've specified are within range of your plot and that you haven't set the item to have the same color as the background. Repeatedly running the code won't do anything different than it did the first time, and you shouldn't give up!

Writing your own functions

Note to all: I thought about excluding the video below, because the example for making a new function refers to "k-means clustering," which I used to include in these modules but recently removed. However, I decided to include the video with this warning not to worry about the details of clustering, so that you have basic information about how to make your own functions.

RPart4a.4.Writing Functions.mp4

Question 4: Why might you want to write your own function?

Show answer

If you know that your function works, you are less likely to encounter errors later when you run it since you don't have to type out each step again. It saves a lot of time if you have to perform the same procedure more than once.

More graphics

RPart4a.5.Graphics.mp4

Question 5: What is the first thing you should do if you don't know how to use a new function?

Show answer

Reading the help page is always the first step to understanding what the function does and what parameters need to be put in. If you still have questions, you might ask a friend/teacher for help and/or scour the internet.

Summary of ways to run a model

This is a good moment to show you the notation for running a both linear regression models and trees, even though you may not have seen linear regression in these modules yet. The reason is that R uses similar syntax for various model types, including trees. You can look back on this page as needed as you run models.

Suppose that the data is called d, the outcome variable is Y, and the predictor variables are named X and Z.

The following lines show different ways to run a regression tree or linear regression.

ctree(d$Y~d$X) # works fine

lm(d$Y~d$X)


ctree(Y~X, data=d) # better, and necessary if using the predict function afterwards

lm(Y~X, data=d)


ctree(Y~X+Z, data=d) # multiple predictors

lm(Y~X+Z, data=d)


lm(Y~X+Z+X:Z, data=d) # interaction term (you don't ask for interactions when running a tree, because the idea of interactions is built into the tree algorithm)


lm(Y~X*Z, data=d) # short-cut for including main effects and also interaction


ctree(Y~., data=d) # short-cut for including all columns in d as predictors. Note that you can create a "d" that is a subset of the original data set before using this code, if helpful.

lm(Y~., data=d)

Question 6: If you want to use 10 predictors, should you type out their names with plus signs in between?

Show answer

Nope. It is usually easier to write a line of code that creates a subset of the data with only these predictors and the outcome variable:

d.subset<-d[ , 5:15]

and then run the regression this way:

output<-lm(Y~., data=d.subset)

And, you're done.

During this tutorial you learned:

  • How to graph splines

  • To run a regression tree with the party package (or ‘partykit’, rpart, ‘tree’)

  • To plot a regression tree and how to change features of the tree graphic

  • How to customize graphics

  • How to code your own functions with function() {...return()}

  • Multiple ways to define the model you want to run


Functions in review:

spline(), legend(), points(), abline(), text(), function() {...return()}, polygons(), grid(), segments(), box(), axis(), gray(), rainbow(), heat.colors(), colors(), layout(), image()


party function in review:

ctree()