Table of Content
This dataset is a list of Airbnb listings in New York City, New York, obtained from the "listings.csv" file downloaded from the "insideairbnb.com" website, which can be found at http://insideairbnb.com/get-the-data
In New York City there are 5 boroughs (or counties)
Manhattan
Queens
Brooklyn
Bronx
Staten Island
Within each borough (or county) there are further subdivisions of neighborhoods.
There are 4 types of places to stay:
entire place (homes/apartments)
private rooms
shared rooms
hotel rooms
For more information about types of places in Airbnb
When visiting an accommodation website, review material is provided for the booker to read in order to assess the accommodation before booking.
Each column in the dataset has the following meaning.
id: This is a unique identifier for each accommodation listing on Airbnb.
name: The name or title of the accommodation listing.
host_id: A unique identifier for each host who owns the accommodation.
host_name: The name of the host who owns the accommodation.
neighbourhood_group: The name of the borough (neighborhood group) where the accommodation is located. There are five boroughs: Manhattan, Queens, Brooklyn, Bronx, and Staten Island.
neighbourhood: The name of the specific neighborhood or community where the accommodation is located.
latitude: The latitude coordinates of the accommodation's location.
longitude: The longitude coordinates of the accommodation's location.
room_type: The type of accommodation available for rent. There are three types: entire homes/apartments, private rooms, shared rooms, and hotel rooms.
price: The price per night for the accommodation.
minimum_nights: The minimum number of nights required for a stay.
number_of_reviews: The total number of reviews received for the accommodation.
number_of_reviews_ltm (Last Twelve Months): the number of reviews that an Airbnb listing has received in the last twelve months.
last_review: The date of the last review received for the accommodation.
reviews_per_month: The average number of reviews received per month, indicating the frequency of stays and reviews in each month.
calculated_host_listings_count: The number of accommodations the host has listed on Airbnb.
availability_365: The number of days in a year that the accommodation is available for booking through Airbnb.
When examining the dataset, it's evident that the columns "last_review" and "reviews_per_month" have empty values, accounting for approximately 4% of the data.
To address these gaps, we need a thoughtful strategy to handle missing data (e.g., imputation, removal).
For "last_review" Column:
Imputation with a Default Date: if the missing values represent cases where no reviews have been given, impute missing "last_review" values with '01-Jan-1900'. This indicates that no review was received.
For "reviews_per_month" Column:
Imputation with Zero: If the missing "reviews_per_month" values indicate that no reviews were received in the last month, impute the missing values with 0.
Check if columns have the appropriate data types (e.g., dates, numbers, text).
We convert the "id" and "host_id" columns from being considered “whole numbers” to being treated as “text” data.
To prevent any data redundancy and guarantee the reliability of our analysis, we will employ Power Query's "Remove Duplicates" feature to eliminate any duplicated rows present in the dataset.
We rename the column "neighbourhood_group" to "borough" for simplicity.
The column "id" will be renamed to "listings" to enhance clarity.
Price Analysis:
What is the average price distribution of boroughs and neighborhoods?
What is the average price distribution of different types of accommodations (entire homes/apartments, private rooms, shared rooms) in each borough?
Availability Analysis:
What is the average number of days in a year that accommodations are available for booking in each borough and neighborhood?
Host Analysis:
Who are the top hosts in terms of the number of accommodations listed?
How does the number of listings a host has impact their average reviews per month?
Is there any correlation between the host's experience (calculated_host_listings_count) and the average price of their listings?
Review Analysis:
What are the neighborhoods or boroughs with the highest and lowest number of reviews per month?
How does the number of reviews correlate with the price of accommodations? Are more expensive listings reviewed more often?
Geospatial Insights:
Can we identify clusters of accommodations on a map? Are there specific neighborhoods that have a high concentration of listings?
What are the average prices for accommodations in different parts of the city? Are there notable spatial patterns?
Stay Duration:
How does the minimum number of nights required for a stay vary across different types of accommodations and neighborhoods?
Are there any neighborhoods or boroughs that tend to have longer or shorter minimum stay requirements?
Room Type Analysis:
How does the distribution of room types differ across boroughs and neighborhoods?
Are there neighborhoods that primarily offer a specific type of room (e.g., private rooms)?
Which variables show strong positive or negative correlations, and what insights can we derive from these relationships?
Python Code in Google Colaboratory:
# Import the 'drive' module from the 'google.colab' library to enable Google Drive integration.
# Mount the Google Drive to the '/content/drive' directory in the Colab environment.
# This step will prompt you to authenticate and provide an authorization code to access your Google Drive.
from google.colab import drive
drive.mount('/content/drive')
# After mounting, you can access your Google Drive files and directories within the Colab environment
# using the path '/content/drive'.
# Import the necessary libraries, including pandas for data manipulation.
import pandas as pd
# Read the CSV file into a pandas DataFrame.
dataset = pd.read_csv(" "choose location that locate file "listings.csv"" ")
# Display the first 3 rows of the DataFrame to inspect the data.
dataset.head(3)
# Original column names mapping for renaming
original_column_names = {
'neighbourhood_group': 'borough', # Renaming 'neighbourhood_group' to 'borough'
'id': 'listings' # Renaming 'id' to 'listings'
}
# Rename the columns in the dataset using the defined mapping
dataset.rename(columns=original_column_names, inplace=True)
# Create a new DataFrame with selected columns and drop duplicate rows
dataset = pd.DataFrame(dataset[['borough', 'neighbourhood', 'room_type', 'availability_365', 'calculated_host_listings_count', 'listings', 'minimum_nights', 'number_of_reviews', 'number_of_reviews_ltm', 'price', 'reviews_per_month']])
dataset = dataset.drop_duplicates()
# Import seaborn and matplotlib libraries for data visualization
import seaborn as sns
import matplotlib.pyplot as plt
# Create a heatmap to visualize the correlation matrix of the dataset with annotations
hm = sns.heatmap(dataset.corr(), annot=True)
# Display the heatmap
plt.show()
The variables "review_per_month" and "number_of_review_ltm" exhibit a strong positive correlation with a coefficient of 0.83.
This suggests that as the number of reviews per month increases, the number of reviews in the last twelve months also tends to increase, and vice versa.
On the other hand, the remaining variable pairs show correlations close to 0, indicating that they are largely unrelated to each other.
How does the distribution of listings differ across boroughs or neighborhoods?
Are there specific boroughs or neighborhoods that have a high concentration of listings?
What is the ratio of the number of listings for each room type?
How does the distribution of room types differ across boroughs?
What is the average price distribution of boroughs and neighborhoods?
What is the average price distribution of room types in each borough?
What is the average number of days in a year that accommodations are available for booking in each borough and neighborhood?
Who are the top hosts in terms of the number of accommodations listed?
Who are the top hosts in terms of average reviews per month?
Who are the top hosts in terms of number of reviews in the last 12 months?
Who are the top hosts in terms of number of reviews?
Through a comprehensive analysis of the Airbnb listings dataset for New York City, we've gained valuable insights that shed light on various aspects of the accommodation landscape. This analysis was empowered by Power BI's capabilities and involved thorough data cleaning and exploration. Here are the key takeaways:
Data Cleaning:
Missing values in the "last_review" and "reviews_per_month" columns were handled strategically. "last_review" values were imputed with '01-Jan-1900' to indicate no reviews, and "reviews_per_month" gaps were filled with zeros.
Columns were standardized for data types, ensuring uniformity and accuracy.
Duplicate rows were removed to ensure the reliability of analysis results.
Column names were made more intuitive, with "neighbourhood_group" becoming "borough" and "id" being renamed to "listings."
Key Insights:
A strong positive correlation (0.83) was observed between "reviews_per_month" and "number_of_reviews_ltm," indicating that higher monthly review rates correspond to more reviews over the past year.
The distribution of listings across boroughs and neighborhoods showed that Manhattan had the highest count, with specific neighborhoods like Bedford-Stuyvesant, Williamsburg, and Harlem having significant listings.
"Entire home/apt" was the dominant room type, comprising 56.58% of listings, while the distribution of room types varied across boroughs.
Average prices were found to vary across neighborhoods and boroughs, with Willowbrook having the highest average price.
Geospatial analysis revealed that Staten Island had the highest average availability of accommodations, while the distribution varied across neighborhoods.
Host Analysis:
Top hosts were identified based on the number of accommodations listed, average reviews per month, and number of reviews in the last 12 months.
"Blueground" emerged as a top host with a significant number of calculated host listings.
Hosts like "Alex And Zeena" exhibited high average reviews per month and the highest number of reviews in the last 12 months.
In summary, this analysis has provided a comprehensive understanding of the Airbnb landscape in New York City. The correlations, geospatial insights, room type distributions, pricing trends, availability patterns, and host performances all contribute to a richer understanding of the city's accommodation market. This information can be used to inform strategic decisions for both guests and hosts, enhancing the overall Airbnb experience in New York City.
About The Author
Polakrit Chulanutrakul
Making Numbers Meaningful, Turning Data into Action, Unleashing Business Potential