A professional sports team is essentially a business. They need to attract fans to come to their games and purchase various items, from tickets to concessions to special deals. Like any business, advertising can be more effective if the organization can predict whether a fan is the type of person who will purchase a specific item. To do this, we will need a technique similar to multiple linear regression, but for a categorical prediction, called logistic regression.
Will little 6 year old Timmy like a Max Scherzer bobblehead? How about 78 year old Ethel? Does it matter if she's been a season ticket holder since 1978?
Before we try and do some predictions with customer data, we will try to predict what aspects of a Titanic passenger made them more or less likely to survive. Download the dataset here.
Looking at the dataset, the most important column is called "Survived". A few columns are unusable for various reasons. See if you can figure out why... they are:
PassengerID
Name
Ticket
Cabin
We will use the rest to predict whether a passenger survived or not. One other issue... notice that the ticket class is entered as numbers, 1, 2, 3 etc. But those aren't actually numbers (they're categories). So we need to tell R that these are actually categories instead.
titanic$Pclass <- as.factor(titanic$Pclass)
Okay, now we're ready. The process is similar to multiple linear regression, but uses a different command... glm instead of lm.
model <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, family = "binomial", data = titanic)
There's an extra term in there, family = "binomial", but otherwise, it's the same idea as multiple linear regression with the term before the ~ being the predicted, and all the terms after that being the predictors. Let's chop out the terms that are not significant. Look at the output from the summary and see if you can figure out what to knock out.
summary(model)
Come up with the best model! Then, pause so Mr. Dickson can explain what to look for in this type of model...
It's harder to evaluate these models than multiple linear regression because there is no r^2 but there are various things that we can do. One is that we can see how well the model predicted the survival of the passengers in this dataset.
results <- predict(model, newdata=titanic, type="response")
results <- ifelse(results>0.5,1,0)
mean(results == titanic$Survived, na.rm=TRUE)
The percentage you see at the end is the percentage of passengers predicted correctly by the model.
One thing to be wary of though is that we can work really hard to make a model but then overfit it to our data - we can make our model work really well on the data we have, but we really want it to work on FUTURE data. If we don't have future data, what we can do is split our data into two parts - one part to train the data and one part to test the model. Usually, we would reseve 20-25% of the data to test it.
set.seed(123) # a line of code so all our codes do the same "random"
sample <- sample(1:891, 668, replace=FALSE) # randomly choose ~75% of the data
train <- titanic[sample,] # and put those values into the training set
test <- titanic[-sample,] # and NOT those values into the test set
Now, recalibrate your model. Train it on your training set, and then test it on both your training and test set.
# Besides tinkering with what variables are included, you need to change something in this line... the word after data =...
model <- glm(Survived ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked, family = "binomial", data = titanic)
# testing it on the training set... notice the changes from above
results <- predict(model, newdata = train, type="response")
results <- ifelse(results>0.5,1,0)
mean(results == train$Survived, na.rm=TRUE)
# testing it on the test set... notice the changes...
results <- predict(model, newdata = test, type="response")
results <- ifelse(results>0.5,1,0)
mean(results == test$Survived, na.rm=TRUE)
Here is a dataset with some customer information and whether they purchased a promotion or not. Use this to come up with a model for whether to run an ad for that customer or not. Do the following:
Split the data into a training and test set.
Include all variables and see what percentage of customers are predicted correctly.... test it on both the train/test set.
Include only statistically significant variables and do the same.... test it on both the train/test set.