Is it poisonous!??!? Or should I eat it? Mushrooms are notoriously difficult to categorize as poisonous or not. The same attribute in one yummy species can be present in a deadly one. Maybe some machine learning might help.
Download and then load the mushrooms dataset and then put it into a variable dataSet to make the rest of the code work. The documentation is here if you ever get curious what some of the columns mean.
You'll be predicting the "class" variable which has "p" for poisonous and "e" for edible.
Install and load some packages:
install.packages("rpart") # only need to do this once ever
install.packages("rpart.plot") # and this
library(rpart) # only need to do this once when reopening the code file
library(rpart.plot) # and this
Splitting your data
dataSet <- #put your data set here and then run the top part as is (don't change anything else)
n <- nrow(dataSet)
trainrows <- sample(1:n,n*0.75)
train <- dataSet[trainrows,]
test <- dataSet[-trainrows,]
Then you can you fit your model on your training set and test it using your test set. Here is the basic code to fit a tree. Here, pick around 6 variables.
decisionTree <- rpart(Predicted ~ predictor + predictor + predictor, data = train, method = "class") # CHANGE the predict things to your variables
prp(decisionTree)
To make the tree have more or fewer items in each node, add the argument , minsplit = NUM (specifies the minimum number of observations in a node for it to split). For example, if minsplit=1, then each new branch could have only 1 observation. If minsplit=10, then each branch would need 10 items from the dataset to split off into a new branch.
decisionTree <- rpart(Predicted ~ predictor + predictor + predcitor, data = train, method = "class", minsplit = __ )
prp(decisionTree)
Testing your predictions. For a decision tree, call the model on the training set and see how many it classified correctly. Then, do the same for the test set. The percentages should be similar.
train$predictedValues <- predict(decisionTree,train,type="class")
trainCorrect <- train$predictedValues == train$actualValues # CHANGE actualValues to the column you're predicting
sum(trainCorrect)/length(trainCorrect)
test$predictedValues <- predict(decisionTree,test,type="class")
testCorrect <- test$predictedValues == test$actualValues # CHANGE actualValues to the column you're predicting
sum(testCorrect)/length(testCorrect)
Make an overfit model and a well fit model.
How well does your good model perform? Will you trust them if you find a mushroom in the wild?
Okay, well... there are a lot of variables! Let's use a random forest instead, a form of machine learning that builds many trees, and then predicts a classification with all the trees by taking a majority vote from the trees. These tend to overperform individual trees, which are often overfit to their dataset.
You'll need to install another library.
install.packages("randomForest")
library(randomForest)
Then, let's build the model:
train$class <- as.factor(train$class) # This is really only necessary when we have numbers in the "class" column, but hey
set.seed(555) # this just makes it so we're all looking at the same "random" set
rf <- randomForest(class ~ ., ntree=5, data = train)
rf
This grew 5 trees and averaged their results. Can you find the two-way table that shows how accurate it was on the training set? Let's look at the accuracy as the trees grew and shifted the probability that we got mushrooms correct.
plot(rf)
Let's look at what variables ended up being most important!
varImpPlot(rf, sort = T, n.var=10, main="Top 10 - Variable Importance")
And then let's see how it performed on the test and train set.
test$predictions <- predict(rf, test)
sum(test$predictions == test$class) # number correct
sum(test$predictions != test$class) # number incorrect
Go back and grow 1000 trees instead!
Use the Titanic Dataset to find a tree that predicts whether someone will survive or not.
Do so with a manual tree and then also do so with a random forest.
What variables ended up being important with the random forest?