Our aim is to analyze the data of Vision Hotels to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
Skills Covered:
Exploratory Data Analysis (EDA)
Data Preprocessing
Logistic Regression
Multicollinearity
AUC -ROC Curve
Decision Tree Pruning.
Tools Used:
Python: Jupyter Notebook
Libraries: Numpy, Pandas, Matplotlib, Seaborn, scikit-learn.
● InnHotels Group has a chain of hotels in Europe. They are facing problems with the high number of booking cancellations.
● A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with.
Such losses are particularly high on last-minute cancellations.
● The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior.
● The task at hand is to analyze the data provided, to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.
There are very few repeat customers but the cancellation among them is very less.
This is a good indication as repeat customers are important for the hospitality industry as they can help in spreading the word of mouth.
A loyal guest is usually more profitable for the business because they are more familiar with what is on offer at a hotel they have visited before.
Attracting new customers is tedious and costs more as compared to a repeated guest.
Rooms booked online have high variations in prices.
The offline and corporate room prices are almost similar.
Complementary market segment gets the rooms at very low prices, which makes sense.
Around 40% of the online booking were canceled.
Bookings made offline are less prone to cancellations.
Corporate segment shows very low cancellations.
If a customer has made more than 2 requests there's a very high chance that the booking will not be canceled.
The median prices of the rooms where some special requests were made by the customers are slightly higher than the rooms where customer didn't make any requests.
The distribution of price for canceled bookings and not canceled bookings is quite similar.
The prices for the canceled bookings are slightly higher than the bookings which were not canceled.
● There's a big difference in the median value of
lead time for bookings that were canceled and
bookings that were not canceled.
● The higher the lead time higher is the chances of
a booking being canceled.
● The trend shows the number of bookings remains consistent from April to July and the hotel sees around 3000 to 3500 guests.
● Most bookings were made in October - more than 5000 bookings but 40% of these bookings got canceled.
● Least bookings were canceled in January and December.
Outlier checking
There are quite a few outliers in the data.
However, we will not treat them as they are proper values.
Model Building
Predicting a customer will not cancel their booking but in reality, the customer will cancel their booking.
Predicting a customer will cancel their booking but in reality, the customer will not cancel their booking.
Both the cases are important as:
If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage the brand equity.
Hotel would want F1 Score to be maximized, greater the F1 score higher are the chances of minimizing False Negatives and False Positives.
Negative values of the coefficient show that the probability of customers canceling the booking decreases with the increase of the corresponding attribute value.
Positive values of the coefficient show that the probability of customer canceling increases with the increase of corresponding attribute value.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
But these variables might contain multicollinearity, which will affect the p-values.
We will have to remove multicollinearity from the data to get reliable coefficients and p-values.
There are different ways of detecting (or testing) multi-collinearity, one such way is the Variation Inflation Factor.
None of the numerical variables show moderate or high multicollinearity.
We will ignore the VIF for the dummy variables.
We will drop the predictor variables having a p-value greater than 0.05 as they do not significantly impact the target variable.
But sometimes p-values change after dropping a variable. So, we'll not drop all variables at once.
Instead, we will do the following:
Build a model, check the p-values of the variables, and drop the column with the highest p-value.
Create a new model without the dropped feature, check the p-values of the variables, and drop the column with the highest p-value.
Repeat the above two steps till there are no columns with p-value > 0.05.
The above process can also be done manually by picking one variable at a time that has a high p-value, dropping it, and building a model again. But that might be a little tedious and using a loop will be more efficient.
We have been able to build a predictive model that can be used by the hotel to predict which bookings are likely to be cancelled with an F1 score of 0.69 on the training set and formulate marketing policies accordingly.
The logistic regression models are giving a generalized performance on training and test set.
Using the model with default threshold the model will give a low recall but good precision score
- The hotel will be able to predict which bookings will not be cancelled and will be able to provide satisfactory services to those customers which help in maintaining the brand equity but will lose on resources.
Using the model with a 0.37 threshold the model will give a high recall but low precision score
- The hotel will be able to save resources by correctly predicting the bookings which are likely to be cancelled but might damage the brand equity.
Using the model with a 0.42 threshold the model will give a balance recall and precision score - The hotel will be able to maintain a balance between resources and brand equity.
Coefficients of required_car_parking_space, arrival_month, repeated_guest, no_of_special_requests and some others are negative, an increase in these will lead to a decrease in chances of a customer canceling their booking.
Coefficients of no_of_adults, no_of_children, no_of_weekend_nights, no_of_week_nights, lead_time, avg_price_per_room, type_of_meal_plan_Not Selected and some others are positive, an increase in these will lead to a increase in the chances of a customer canceling their booking.
We can see that the tree has become simpler and the rules of the trees are readable.
The model performance of the model has been generalized.
We observe that the most important features are:
Lead Time
Market Segment - Online
Number of special requests
Average price per room
The rules obtained from the decision tree can be interpreted as:
The rules show that lead time plays a key role in identifying if a booking will be cancelled or not. 151 days has been considered as a threshold value by the model to make the first split.
Bookings made more than 151 days before the date of arrival:
If the average price per room is greater than 100 euros and the arrival month is December, then the the booking is less likely to be cancelled.
If the average price per room is less than or equal to 100 euros and the number of special request is 0, then the booking is likely to get canceled.
Bookings made under 151 days before the date of arrival:
If a customer has at least 1 special request the booking is less likely to be cancelled.
If the customer didn't make any special requests and the booking was done Online it is more likely to get canceled, if the booking was not done online, it is less likely to be canceled.
Decision tree model with default parameters is overfitting the training data and is not able to generalize well.
Pre-pruned tree has given a generalized performance with balanced values of precision and recall.
Post-pruned tree is giving a high F1 score as compared to other models but the difference between precision and recall is high.
The hotel will be able to maintain a balance between resources and brand equity using the pre-pruned decision tree model.
Model Performance Summary
● We want to predict whether a booking will be cancelled or not using the information provided to us.
● We will use the F1 Score as the performance metric for our model because:
If we predict that a booking will not be canceled and the booking gets canceled then the hotel will lose resources and will have to bear additional costs of distribution channels.
If we predict that a booking will get canceled and the booking doesn't get canceled the hotel might not be able to provide satisfactory services to the customer by assuming that this booking will be canceled. This might damage the brand equity.
F1 score will help us minimize both false positives and false negatives.
● The Logistic Regression and Decision Tree models indicates that the most significant predictors of booking status are:
Lead Time
Number of special requests
Average price per room
Decision Tree Pre-Prunning is the best performing model.
Overall we can see that the Decision Tree model performs better on the dataset.
Looking at important variables based on p-values in Logistic regression and feature importance in the Decision Tree model
Lead Time, Number of special requests, Average price per room are important in both model
From the Logistic Regression model we observe that Lead Time, and Average price per room have a positive relation with bookings getting canclled. And the number of special requests has negative relation with bookings getting cancelled.
The lead time and the number of special requests made by the customer play a key role in identifying if a booking will be cancelled or not. Bookings where a customer has made a special request and the booking was done under 151 days to the date of arrival are less likely to be canceled.
Using this information, the hotel can take the following actions:
Set up a system that can send a prompt like an automated email to the customers before the arrival date asking for a re-confirmation of their booking and any changes they would like to make in their bookings.
Remind guests about imminent deadlines.
The response given by the customer will give the hotel ample time to re-sell the room or make preparations for the customers' requests.
Stricter cancellation policies can be adopted by the hotel.
The bookings where the average price per room is high, and there were special requests associated should not get a full refund as the loss of resources will be high in these cases.
Ideally the cancellation policies should be consistent across all market segments but as noticed in our analysis high percentage of bookings done online are cancelled. The booking cancelled online should yield less percentage of refund to the customers.
The refunds, cancellation fee, etc should be highlighted on the website/app before a customer confirms their booking to safeguard guests' interest.
The length of stay at the hotel can be restricted.
We saw in our analysis that bookings, where the total length of stay was more than 5 days, had higher chances of getting cancelled.
Hotel can allow bookings up to 5 days only and then customers should be asked to re-book if they wish to stay longer. These policies can be relaxed for corporate and Aviation market segments. For other market segments, the process should be fairly easy to not hamper their experience with the hotel.
Such restrictions can be strategized by the hotel to generate additional revenue.
In the months of December and January cancellation to non-cancellation ratio is low. Customers might travel to celebrate Christmas and New Year. The hotel should ensure that enough human resources are available to cater to the needs of the guests.
October and September saw the highest number of bookings but also high number of cancellations. This should be investigated further by the hotel.
Post-booking interactions can be initiated with the customers.
Post-booking interactions will show the guests the level of attention and care they would receive at the hotel.
To give guests a personalized experience, information about local events, nearby places to explore, etc can be shared from time to time.
Improving the experience of repeated customers.
Our analysis shows that there are very few repeated customers and the cancellation among them is very less which is a good indication as repeat customers are important for the hospitality industry as they can help in spreading the word of mouth.
A loyal guest is usually more profitable for the business because they are more familiar with offerings from the hotel they have visited before.
Attracting new customers is tedious and costs more as compared to a repeated guest.
A loyalty program that offers - special discounts, access to services in hotels, etc for these customers can help in improving their experience.