Hotel Booking
Using Python
In this project, we analyze and provide actionable recommendations to both hotels on how to reduce their cancellation rates and improve their revenue generation, based on the insights gained from the data analysis.
ABOUT DATA
The hotel booking dataset is a collection of information related to hotel bookings made by customers. It contains 119391 rows and 32 columns of data.
The dataset includes information such as hotel type, location, booking dates, room type, number of adults and children, and whether the booking was cancelled or not.
Some of the notable columns in the dataset include:
Hotel: the type of hotel (City Hotel or Resort Hotel)
Arrival_date_month: the month of the arrival date
Arrival_date_week_number: the week number of the arrival date
Arrival_date_day_of_month: the day of the month of the arrival date
Adults: the number of adults included in the booking
Children: the number of children included in the booking
Is_canceled: whether the booking was cancelled or not (1 for cancelled, 0 for not cancelled)
This data can be used for various analyses and predictions related to hotel bookings, such as predicting the likelihood of a booking being cancelled or identifying trends in seasonal booking patterns.
PROJECT BEGINNING
To start our project first we open anaconda navigator and open Jupiter Lab.
Make a folder for this project then import the data and make a python file as Hotel Booking Analysis.
And next, continue work and finally save it.
IMPORT and LOAD DATA
First import all libraries like as Pandas, Matplotlib, Seaborn and Warnings.
Then load 'Hotel booking Data' set.
EDA and DATA CLEANING
Here we start EDA, In the first image See the first 5 rows , and In the second image last 5 rows.
In 3rd Image show the number of columns and rows i.e 119390 columns and 32 rows
In last 4th image, we use ".info()" function that provides information about the number of non-null values, data types, and memory usage of each column in the hotel booking data set.
Here, children, meal, country, agent and company columns contain null values.
This code snippet is written in Python and is used to convert a column called reservation_status_date in a pandas DataFrame df into datetime format using the pd.to_datetime() function.
Here's how the code works:
pd.to_datetime() is a function provided by the pandas library that converts a given input into datetime format.
In this case, the input is the reservation_status_date column in the DataFrame df.
The function pd.to_datetime() returns a pandas DateTimeIndex object, which is assigned back to the reservation_status_date column in df.
Once the conversion is done, the reservation_status_date column in df will contain datetime values, which allows for more convenient handling and manipulation of dates and times in the DataFrame.
Here we use for loop to extract unique values of object datatype columns.
df.describe(include = 'object') is used to generate summary statistics for only the categorical variables in the DataFrame df.
.columns is used to retrieve the column names of the categorical variables.
The for loop iterates through each column in the categorical variables.
For each column, the column name is printed using print(col).
The unique values for that column are then printed using print(df[col].unique()).
A line of underscores is printed using print('_'*50) to separate the output for each column.
This code snippet is used to count the number of missing values in each column of a pandas DataFrame df. Here's how the code works:
df.isnull() is a pandas DataFrame method that returns a DataFrame of the same shape as df, but with True values wherever a missing value (i.e., NaN) is present and False values elsewhere.
.sum() is a method that operates on the resulting DataFrame from df.isnull() and calculates the sum of missing values for each column in the DataFrame.
The final output is a Series object that contains the count of missing values for each column in df.
Overall, this code is useful for quickly identifying the number of missing values in each column of a pandas DataFrame. It can be used to determine if there are any missing values in the dataset, and to decide on an appropriate strategy for handling those missing values, such as imputation or dropping missing values.
How this drop and dropna work :
df.drop() is a pandas DataFrame method that drops the specified column(s) from the DataFrame. By default, df.drop() does not modify the original DataFrame, but instead returns a new DataFrame with the specified columns dropped.
['company','agent'] specifies a list of column names to be dropped from the DataFrame df.
axis=1 specifies that the columns to be dropped are along the column axis (i.e., horizontally).
inplace=True specifies that the DataFrame df is to be modified in place, meaning that the original DataFrame is changed rather than returning a new DataFrame.
df.dropna() is a pandas DataFrame method that drops all rows with any missing values from the DataFrame. By default, df.dropna() does not modify the original DataFrame, but instead returns a new DataFrame with the specified rows dropped.
inplace=True specifies that the DataFrame df is to be modified in place, meaning that the original DataFrame is changed rather than returning a new DataFrame.
BoxPlot :
df['adr'] selects the "adr" column from the DataFrame df.
.plot() is a pandas method that creates a plot of the specified data. By default, .plot() creates a line plot.
kind='box' specifies that a box plot should be created instead of a line plot.
Reservation Cancel Percentage:
df['is_canceled'] selects the "is_canceled" column from the DataFrame df.
.value_counts() is a pandas method that calculates the frequency of each unique value in the selected column.
normalize=True specifies that the output should be normalized to represent proportions rather than counts.
df['is_canceled'] selects the "is_canceled" column from the DataFrame df.
.value_counts() is a pandas method that calculates the frequency of each unique value in the selected column.
normalize=True specifies that the output should be normalized to represent proportions rather than counts.
plt.figure() creates a new figure with a specified size (in this case, 5 inches wide by 4 inches high).
plt.title() sets the title of the plot to "Reservation Status Count".
plt.bar() creates a bar plot of the frequency of canceled and non-canceled reservations, with two bars representing the two unique values in the "is_canceled" column.
['Not Canceled','Canceled'] specifies the labels for the x-axis (i.e., the two bars).
df['is_canceled'].value_counts() specifies the height of each bar, with the frequency of each unique value in the "is_canceled" column representing the number of observations in each category.
edgecolor='k' specifies that the edges of the bars should be black.
width=0.7 specifies the width of each bar as a fraction of the available space.
plt.show() displays the plot.
plt.figure(figsize = (8,4)) sets the size of the figure
ax1 = sns.countplot(x ='hotel',hue = 'is_canceled',data = df, palette = 'Blues') creates a countplot using Seaborn. The x parameter is set to 'hotel' which means that the plot will show the counts for different hotels. The hue parameter is set to 'is_canceled', which means that the plot will differentiate between the two cancellation statuses (canceled or not canceled). The data parameter is set to df, which is the DataFrame that the data is being pulled from. The palette parameter is set to 'Blues', which sets the color palette of the plot.
legend_labels = ax1.get_legend_handles_labels() stores the legend labels of the plot
plt.title('Reservation Status In Different Hotels',fontsize = 15) sets the title of the plot
plt.xlabel('Hotels') sets the x-axis label of the plot
plt.ylabel('Number of Reservation') sets the y-axis label of the plot
plt.legend(['Not Canceled','Canceled']) sets the legend of the plot
plt.show() displays the plot.
Resort and City Hotel :
df[df['hotel'] == 'Resort Hotel'] creates a subset of df that contains only the rows where the "hotel" column is equal to "Resort Hotel".
['is_canceled'] selects the "is_canceled" column from the resulting subset.
.value_counts() is a pandas method that calculates the frequency of each unique value in the selected column.
normalize=True specifies that the output should be normalized to represent proportions rather than counts.
Resort and City Hotel :
df[df['hotel'] == 'Resort Hotel'] and df[df['hotel'] == 'City Hotel'] create subsets of df that contain only the rows where the "hotel" column is equal to "Resort Hotel" and "City Hotel", respectively.
.groupby('reservation_status_date') groups the resulting subsets by the "reservation_status_date" column.
[['adr']].mean() selects the "adr" column from the resulting groups and calculates the mean of each group.
plt.figure() creates a new figure with a specified size (in this case, 20 inches wide by 8 inches high).
plt.title() sets the title of the plot to "Average Daily rate in City and Resort Hotel" with a font size of 30.
plt.plot() creates a line plot of the data.
resort_hotel.index specifies the x-values (i.e., the "reservation_status_date" column of the "Resort Hotel" group).
resort_hotel['adr'] specifies the y-values (i.e., the average daily rate for the "Resort Hotel" group).
label='Resort Hotel' specifies the label for the "Resort Hotel" line in the legend.
city_hotel.index and city_hotel['adr'] specify the x-values and y-values, respectively, for the "City Hotel" group.
label='City Hotel' specifies the label for the "City Hotel" line in the legend.
plt.legend() creates a legend for the plot and specifies the labels for each line.
plt.show() displays the plot.
This code is creating a count plot to visualize the reservation status (canceled or not) by month. The first line of code is creating a new column in the DataFrame called 'month' and it is extracting the month information from the 'reservation_status_date' column. The second line of code is creating a figure with a size of 16x8. The third line of code is creating the count plot, with the 'month' column on the x-axis, the hue as the 'is_canceled' column (which will show the count of reservations that are canceled and not canceled), and using the 'df' DataFrame as the data source. The fourth and fifth lines of code are adding a title and labels for the x and y axes. Finally, the last line of code is adding a legend to the plot to show which color represents 'Not Canceled' and 'Canceled'.
This code is creating a bar plot to visualize the Average Daily Rate (ADR) per month for canceled reservations. The first line of code is creating a figure with a size of 15x8 and adding a title. The second line of code is creating a bar plot using the Seaborn package, with 'month' on the x-axis, 'adr' on the y-axis, and using the 'df' DataFrame as the data source. The 'df' DataFrame is filtered to only include canceled reservations by using the boolean condition 'df['is_canceled']==1'. Then, the DataFrame is grouped by month and the 'adr' column is summed up. Finally, the DataFrame is reset_index() to convert the month from an index to a column. The resulting plot will show the ADR for canceled reservations for each month.
The first line of code creates a new DataFrame called 'cancelled_data' by filtering the original DataFrame 'df' to only include the canceled reservations. It does this by using the boolean condition 'df['is_canceled']==1', which checks if the value in the 'is_canceled' column is equal to 1 (which indicates a canceled reservation).
The second line of code creates a new variable called 'top_10_country' that contains the count of canceled reservations for each country in the 'cancelled_data' DataFrame. It does this by using the 'value_counts()' function on the 'country' column and selecting the top ten countries using the slice operator [:10].
The third line of code creates a new figure with a size of 8x8 and sets the title to 'Top Ten Countries with Reservation Cancelation Rate'.
The fourth line of code creates a pie chart using the 'plt.pie()' function. The 'top_10_country' variable is used as the data source, which contains the count of canceled reservations for each country. The 'autopct' parameter is set to '%.2f' to format the percentage values with two decimal places. The 'labels' parameter is set to 'top_10_country.index', which assigns the country names to each slice of the pie chart.
The last line of code displays the resulting pie chart.
In summary, the code creates a pie chart to visualize the percentage of canceled reservations for the top ten countries with the highest reservation cancelation rates.
This line of code is used to calculate the percentage of reservations for each unique value in the 'market_segment' column of the DataFrame 'df'. It does this by using the 'value_counts()' function, which counts the frequency of each unique value in the 'market_segment' column. The 'normalize=True' parameter is used to return the frequency counts as a percentage of the total number of reservations in the DataFrame.
This code creates a new dataframe for canceled reservations and calculates the average daily rate (ADR) for each day that a reservation was canceled. It also creates a new dataframe for not canceled reservations and calculates the ADR for each day that a reservation was not canceled. Then, it plots a line graph that shows the trend of ADR over time for both canceled and not canceled reservations. The x-axis shows the reservation status date and the y-axis shows the average daily rate. The graph is helpful to compare the ADR trends between canceled and not canceled reservations and to identify any patterns or insights.
cancelled_df_adr = cancelled_df_adr[(cancelled_df_adr['reservation_status_date']>'2016') & (cancelled_df_adr['reservation_status_date'] < '2017-09')]
This line of code is filtering the cancelled_df_adr DataFrame.
It starts by accessing the reservation_status_date column of the DataFrame using the syntax cancelled_df_adr['reservation_status_date'].
It then applies two inequality operators (> and <) to this column, checking if each value in the column is greater than '2016' and less than '2017-09', respectively.
The two inequality conditions are combined using the bitwise & operator, so that only rows that satisfy both conditions are kept.
Finally, the resulting filtered DataFrame is assigned to the variable cancelled_df_adr.
not_cancelled_df_adr = not_cancelled_df_adr[(not_cancelled_df_adr['reservation_status_date'] > '2016') & (not_cancelled_df_adr['reservation_status_date'] < '2017-09')]
This line of code is filtering the not_cancelled_df_adr DataFrame in a similar way to the previous line.
Again, it starts by accessing the reservation_status_date column of the DataFrame using the syntax not_cancelled_df_adr['reservation_status_date'].
It then applies the same two inequality operators (> and <) to this column, checking if each value in the column is greater than '2016' and less than '2017-09', respectively.
The two inequality conditions are combined using the bitwise & operator, so that only rows that satisfy both conditions are kept.
Finally, the resulting filtered DataFrame is assigned to the variable not_cancelled_df_adr.
To summarize, these two lines of code are filtering two separate DataFrames based on the reservation_status_date column, and keeping only the rows where the date falls within a specific range. The resulting DataFrames will only contain data for reservations that fall within that time frame and meet the condition of being either cancelled or not cancelled.
In Second Code :
plt.figure(figsize = (20,6))
This line creates a new figure object with a specified size of 20 inches by 6 inches. The figure object is used to contain the line plot that will be generated by the subsequent code.
plt.title('Average Daily Rate',fontsize = 20)
This line adds a title to the line plot with the text "Average Daily Rate". The fontsize argument sets the size of the title text to 20.
plt.plot(not_cancelled_df_adr['reservation_status_date'],not_cancelled_df_adr['adr'],label = 'not cancelled')
This line creates a line plot of the ADR for not-cancelled reservations over time.
The x-axis values are taken from the reservation_status_date column of the not_cancelled_df_adr DataFrame.
The y-axis values are taken from the adr column of the not_cancelled_df_adr DataFrame.
The label argument sets the label for this line in the legend to "not cancelled".
plt.plot(cancelled_df_adr['reservation_status_date'],cancelled_df_adr['adr'],label = 'cancelled')
This line creates a line plot of the ADR for cancelled reservations over time.
The x-axis values are taken from the reservation_status_date column of the cancelled_df_adr DataFrame.
The y-axis values are taken from the adr column of the cancelled_df_adr DataFrame.
The label argument sets the label for this line in the legend to "cancelled".
plt.legend(fontsize = 15)
This line adds a legend to the line plot, with labels for each line.
The fontsize argument sets the size of the legend text to 15.
plt.show()
This line displays the line plot in the current figure object.
Overall, this code generates a line plot that shows the ADR over time for cancelled and not-cancelled reservations. The x-axis represents the dates of the reservations, and the y-axis represents the ADR for each group of reservations. The legend shows which line represents which group of reservations.
INSIGHTS
1. Cancellation rates rise as the price does. In order to prevent cancellations of reservations, hotels could work on their pricing strategies and try to lower the rates for specific hotels based on locations. They can also provide some discounts to the consumers.
2. As the ratio of the cancellation and no cancellations of the resort hotel is higher in the resort hotel than in the city hotels. So the hotels should provide a reasonable discount on the room prices on weekends or on holidays.
3. In the month of January, hotels can start campaigns or marketing with a reasonable amount to increase their revenue as cancellation is the highest in this month.
4. They can also increase the quality of their hotels and their services mainly in Portugal to reduce the cancellation rate.
CONCLUSION
To prevent rising cancellation rates, hotels can consider adjusting their pricing strategies, offering discounts during peak times, and improving the quality of their hotels and services. These measures can improve customer satisfaction and increase revenue.
HELPING MATERIALS
Here, You See and Download All Material Click Here