Course Project: Bikeshare Demand

This project is also done in statistical learning course at Queen's University. The project is taken from Kaggle and is written in R language.

Introduction:

Over the past several years, bike sharing become more and more popular around all over the world. People are able to rent a bike from one location and return it to a different location by using Bike sharing system. In this project, I would like to use R Programming Language to analyze the bike share data provided by Kaggle Competition and Captical Bikeshare. The goal for this project is to forecast bike rental demand hourly in the Capital Bikeshare program in Washington, D.C.

Data Set:

I will be using the following dataset in this project: https://www.kaggle.com/c/bike-sharing-demand/data: This including three datasets: training dataset, testing dataset and a sample submission dataset. I will give a brief explanation for each dataset.

Training dataset: This dataset contains data for training purpose. Furthermore, this dataset only contains first 19 days for each month, from 2011-2012.

Test dataset: This dataset is used for testing purpose. It only contains 20th day to the end of each month, from 2011-2012.

Data Cleaning and Preprocessing:

I will start with cleaning the training dataset and perform some preprocessing steps.

library(lubridate)

library(dplyr)

train_data <- read.csv('C:/Users/Simon/Documents/R/train_bikeshare.csv', header = TRUE) ## Gneral view of training data

test_data <- read.csv('C:/Users/Simon/Documents/R/test_bikeshare.csv', header = TRUE)

Notice that number of casual number plus registered number is equals to count. Therefore, I decide to drop these two columns. Also I change seasons, weather, and holiday/workingday to factor and add some columns based on datetime:

preprocessing <- function(old_data){

#drops two variables: 'casual','registered'

cf = c("datetime","season","holiday","workingday",

"weather","temp","atemp","humidity","windspeed")

new_data = old_data[, cf]

#add more variances, from datetime

new_data$hour <- hour(ymd_hms(old_data$datetime)) #hour is an integer

new_data$hourf <- as.factor(new_data$hour)

new_data$year <- as.factor(year(old_data$datetime))

new_data$month <- as.factor(month(old_data$datetime))

new_data$mday <- as.factor(mday(old_data$datetime))

new_data$wday <- ifelse(wday(old_data$datetime)==1, 7, wday(old_data$datetime)-1)

new_data$wday <- as.factor(new_data$wday)

new_data$season = as.factor(new_data$season)

#weather, make 4 to 3

new_data$weather[new_data$weather == 4] = 3

new_data$weather = as.factor(new_data$weather)

#make holiday,workingday factor

new_data$holiday = as.factor(new_data$holiday)

new_data$workingday = as.factor(new_data$workingday)

return(new_data)

}

train_new <- preprocessing(train_data)

train_new$count <- train_data$count

test_new <- preprocessing(test_data)

test_new <- select(test_new, -c(hourf))

Exploratory Data Analysis and Feature Engineering:

In this section, I will perform some data visualizations and feature engineering on both datasets.

Missing value Analysis:

First, let’s see whether there are some missing values in both training and test dataset:

which(is.na(train_new))

## integer(0)

which(is.na(test_new))

## integer(0)

Note that there is no return from above command. Therefore, there is no missing value in both datasets.

Basic Data Visualizations:

One boxplot - detect outlier

One histogram, several line charts to help you better understanding the dataset and one additional correlation plot.

Feature Engineering:

First notice that inconsistency of variance between hour and count, so first do log transformation on the response variable

Note that there are many 0s windspeed in this dataset. One proper explanation would be there are initially NA values, and then replace by 0. This will cause a problem when we predict the response. I will use a popular method to replace the 0 windspeed. This problem occurs in both training and testing dataset.

I will proceed by taking out those rows that not have 0 windspeed, and use Random Forest to build up a model for those rows and then predict the for the 0 windspeed rows.

Random Forest Algorithm to predict and replace for 0 windspeed:

I will proceed by leveraging Random Forest Algorithm and do some data cleaning/processing.

library(randomForest)

set.seed(3)

predictWind <- function(old_data){

cf = c("datetime", "season","holiday","workingday",

"weather","temp","atemp","humidity","windspeed", "hour", "year", "month", "mday", "wday")

new_data <- old_data[, cf]

new_data_win0 <- new_data[which(new_data["windspeed"] == 0), ]

new_data_winN0 <- new_data[which(new_data["windspeed"] != 0), ]

winModel <- randomForest(windspeed ~ season+weather+humidity+temp+month+year+atemp, data = new_data_winN0, importance = TRUE, ntree= 500)

rf.pred <- predict(winModel, newdata = new_data_win0)

new_data_win0["windspeed"] <- rf.pred

new_final <- rbind(new_data_win0, new_data_winN0)

new_final <- new_final[order(new_final$datetime), ]

return(new_final)

}

Model Building:

In this section, I will perform three models that suitable for this dataset. Before that, I used the batch function provided by Prof. Song to find the rows corresponds to the start and end of each month.

batches <- function(dataset){

batch_index <- matrix(nrow = 2, ncol = 24) # two years, total of 24 months

cnt <- 1

tt_index <- 1:dim(dataset)[1]

for (y in unique(dataset$year)){ # find rows for corresponding month

for (m in unique(dataset$month)) {

tmp = tt_index[dataset$year == y & dataset$month == m]

batch_index[1,cnt] <- min(tmp)

batch_index[2,cnt] <- max(tmp)

cnt = cnt + 1

}

return(batch_index)

}

train_bat <- batches(train_final_bin)

test_bat <- batches(test_final_bin)

First Model: Multivariate Adaptive Regression Splines (MARS)

Reference: https://bradleyboehmke.github.io/HOML/mars.html

I will start with linear model first and use this model as a baseline. I selected multivariate adaptive regression splines for the following reason:

1. By looking the dataset, there is no evidence to prove that covariates have a linear relation with the response.

2. This algorithm is a convenient approach to fit data that is non-linear. It can fit polynomial regression by placing knots in the entire dataset.

3. Unlike linear regression, multivariate adaptive regression spline does not require to standardize the data.

library(earth)

library(caret)

for (i in 1:dim(test_bat)[2]){

train_mars2 <- train_final_bin[train_bat[1,1]:train_bat[2,i], c(2:10,15,16)]

test_mars2 <- test_final_bin[test_bat[1,i]:test_bat[2,i], c(2:10,15)]

hyper_grid_mars <- expand.grid(# hyperparameter tuning using train function in caret

degree = c(1:6),

nprune = seq(2,20, length.out = 10)

)

tuned_mars <- train(

x = subset(train_mars2, select = -count),

y = train_mars2$count,

method = "earth",

metric = "RMSE",

trControl = trainControl(method = "cv", number = 10), # use 10-fold cv to train

tuneGrid = hyper_grid_mars

)

best_para <- tuned_mars$bestTune

best_nprune <- best_para[, "nprune"] # get the best parameter for nprune

best_degree <- best_para[, "degree"]# get the best parameter for degree

mars_fit <- earth(count ~ ., data = train_mars2, degree = best_degree, nprune = best_nprune, minspan = 1, endspan = 1)

pred_mars <- predict(mars_fit, test_mars2)

submission_mars[test_bat[1,i]:test_bat[2,i], "count"] <- exp(pred_mars) - 1

}

I tuned degree and nprune in this model. Degree represents to highest degree in the polynomial; nprune represents the max number of terms in the prune model.

This model will give me a prediction error of 0.54 in Kaggle. I will use this score as a baseline score for future algorithms.

Second model: Random Forest

Reference: https://topepo.github.io/caret/model-training-and-tuning.html

The reason why I selected random forest is because of the following reason:

1. This algorithm can maintain the accuracy of a large dataset and It will prevent the overfitting problem.

2. It can handle large dataset with high dimension. In this case, The dimension is 12. 3. The bagging algorithm will build multiple trees and combined them for a better one.

library(ranger)

library(dplyr)

registerDoParrallel(cores = 4)

for (i in 1:dim(test_bat)[2]){

month_train <- train_final_bin[train_bat[1,1]:train_bat[2,i], c(2:10,15,16)] # use all historical data

month_test <- test_final_bin[test_bat[1,i]:test_bat[2,i], c(2:10,15)]

hyper_grid <- expand.grid( # hyperparameter tuning by GridSearch

mtry = seq(2,8), # values for mtry

node_size = seq(3,6), # values for node size

max_depth = seq(2,7), # values for max depth

OOB_RMSE = 0 # this is used for find the minimum RMSE value

)

for (j in 1:nrow(hyper_grid)){ # loop though all hyperparameters

rf <- ranger( # train model

formula = count ~ .,

data = month_train,

num.trees = 650, # use a large number of trees, I selected 650 trees

mtry = hyper_grid$mtry[j], # record each mtry values

min.node.size = hyper_grid$node_size[j], # record minimum node size values

max.depth = hyper_grid$max_depth[j], # record max depth values

num.threads = 4,

seed = 3

)

hyper_grid$OOB_RMSE[j] <- sqrt(rf$prediction.error) # find the minimum prediction error

}

ind <- which.min(hyper_grid[, "OOB_RMSE"]) # find the index for the minimum prediction error

optimal_mtry <- hyper_grid[ind, 1] # find the set of optimal parameters in that index

node_size <- hyper_grid[ind, 2]

max_dep <- hyper_grid[ind, 3]

rf <- ranger(count ~ ., data = month_train, num.trees = 650, mtry = optimal_mtry, min.node.size = node_size, sample.fraction = 0.7, max.depth = max_dep, seed = 3, num.threads = 4) # fit final model

pred <- predict(rf, month_test)

new_submission[(test_bat[1,i]:test_bat[2,i]), 'count'] <- exp(pred$predictions)-1

}

For Random Forest, the best score achieved is 0.47, which is much better than MARS. The sample fraction is the number of samples used to train, lower number may result in high bias, whereas higher number may result in high variance. The default in 63.25%, I select 70% here.

Mtry is a important parameter, it represents the number of variables to random sample at each split. Usually, the value is around #of covariates / 3. In this case, mtry value for minimize prediction error is 5.

Third Model: Gradient Boosting Machine

Reference: https://www.listendata.com/2015/07/gbm-boosted-models-tuning-parameters.html， STAT457 Notes from Prof. Song

I selected GBM for the following reason:

1. It is an ensemble model for improving model accuracy.

2. For GB, each model is added by sequentially to an ensemble. For each model, it will correct its own predecessor in terms of the residual error.

I will tune learning rate, interaction depth, minobsinnode, bag fraction in this model.

library(gbm)

registerDoParallel(cores = 4)

for (j in 1:dim(test_bat)[2]){

train <- train_final_bin[train_bat[1,1]:train_bat[2,j], c(2:10,15,16)] # get train and test data for GBM

test <- test_final_bin[test_bat[1,j]:test_bat[2,j], c(2:10,15)]

hyper_grid_gbm <- expand.grid( # hyperparameter tuning by GridSearch

shrinkage = c(0.01, 0.03, 0.05, 0.1, 0.3), # learning rate

interaction.depth = c(1, 3, 5), # interaction depth

n.minobsinnode = c(2:6),# min number of observation allowed in terminal node

bag.fraction = c(.65, .8, 1), # stochastic gradient descent

optimal_trees = 0,

min_RMSE = 0

)

for(i in 1:nrow(hyper_grid_gbm)) {

set.seed(123)

gbm.tune <- gbm( # loop through all parameter to find the optimal set

formula = count ~ .,

distribution = "gaussian",

data = train,

n.trees = 2500,

interaction.depth = hyper_grid_gbm$interaction.depth[i],

shrinkage = hyper_grid_gbm$shrinkage[i],

n.minobsinnode = hyper_grid_gbm$n.minobsinnode[i],

bag.fraction = hyper_grid_gbm$bag.fraction[i],

train.fraction = .80, # use 80% training data for train, and 20% for validation(used to compute validation error)

n.cores = 4,

verbose = FALSE

)

# add the minium training error and optimal trees to grid

hyper_grid_gbm$optimal_trees[i] <- which.min(gbm.tune$valid.error)

hyper_grid_gbm$min_RMSE[i] <- sqrt(min(gbm.tune$valid.error))

}

ind_gbm <- which.min(hyper_grid_gbm[, "min_RMSE"]) # Get the minimum validation error

shrinkage <- hyper_grid_gbm[ind_gbm, 1] # get each optimal parameter

interaction <- hyper_grid_gbm[ind_gbm, 2]

min_node <- hyper_grid_gbm[ind_gbm, 3]

fraction <- hyper_grid_gbm[ind_gbm, 4]

trees <- hyper_grid_gbm[ind_gbm, 5]

gbm1 <- gbm( # fit final model

formula = count ~ .,

data = train,

distribution = "gaussian",

n.trees = trees,

interaction.depth = interaction,

shrinkage = shrinkage,

n.minobsinnode = min_node,

bag.fraction = fraction,

train.fraction = 1,

n.cores = 4,

verbose = FALSE

)

pred_gbm <- predict(gbm1, test, n.trees = gbm1$n.trees) # prediction

submission_gbm[(test_bat[1,j]:test_bat[2,j]), 'count'] <- exp(pred_gbm)-1 # record

}

For GBM, usually more trees may end up with better score, however, because of the limitation of my computer, I tried up to 2500 trees.

The interaction depth is always 3 to 5; the learning rate is between 0.05~0.3, as this dataset is relatively large. The train fraction is 0.80, means that the first 80% of data are used for training, and remaining are used to compute out of sample estimate for loss function. This train fraction has the similar functionality as cross validation.

For more details, you can refer to my GitHub page.

Citation:

@article{mz2021,

title = "Kaggle Competition: Washington D.C. Bike Share Demand",

author = "Zhou, Meng",

year = "2021",

url = "https://sites.google.com/view/mengzhou/project/kaggle-project-1"

}

Page updated

Google Sites

Report abuse